The following guide gives an introduction to the generalized random forests algorithm as implemented in the grf
package. It aims to give a complete description of the training and prediction procedures, as well as the options available for tuning. This guide is intended as an informal and practical reference; for a theoretical treatment of GRF, please consult the ‘Generalized Random Forests’ paper.
GRF extends the idea of a classic random forest to allow for estimating other statistical quantities besides the expected outcome. Each forest type, for example quantile_forest
, trains a random forest targeted at a particular problem, like quantile estimation. The most common use of GRF is in estimating treatment effects through the function causal_forest
.
In this section, we describe GRF’s overall approach to training and prediction. The descriptions given in this section apply to all the available forest models. Specific details about the causal_forest
method can be found in the ‘Causal Forests’ section below.
We begin with a simple example to illustrate the estimation process:
# Train a causal forest.
n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
causal.forest = causal_forest(X, Y, W)
# Estimate causal effects on new test data.
X.test = matrix(0, 100, p)
X.test[,1] = seq(-2, 2, length.out = 100)
predictions = predict(causal.forest, X.test)$predictions
# Estimate causal effects for the training data using out-of-bag prediction.
oob.predictions = predict(causal.forest)$predictions
We now explore each of these steps in more detail.
A random forest is at its core an ensemble model, composed of a group of decision trees. During training, a number of trees are grown on random subsamples of the dataset. Individual trees are trained through the following steps:
x
, we look at all of its possible values v
and consider splitting it into two children based on this value. The goodness of a split (x
, v
) is determined by how much it increases heterogeneity in the quantity of interest. Certain splits are not considered, because the resulting child nodes would be too small or too different in size.x
that are less than or equal to the split value v
are placed in a new left child node, and all examples with values greater than the v
are placed in a right child node.The main difference between GRF’s approach to growing trees and that of classic random forests is in how the quality of a split is measured. Because the various forest types seek to estimate different statistical quantities like quantiles and treatment effects, splitting must be tailored to the particular task at hand. The approach taken in GRF is to maximize the heterogeneity in the quantity of interest across the child nodes. For example, with causal effect estimation, the goodness of a split relates to how different the treatment effect estimates are in each node. A theoretical motivation for this split criterion can be found in section 2 of the GRF paper.
The quality of a split must be calculated for each possible split variable x
and value v
, so it is critical for it to be fast to compute. Optimizing the heterogeneity criterion directly is therefore too expensive; instead, we take the gradient of the objective and optimize a linear approximation to the criterion. This approach allows us to reuse classic tree splitting algorithms that work in terms of cumulative sums. This gradient-based approach also has close ties to the concept of ‘influence functions’, as discussed in 2.3 of the GRF paper.
Given a test example, the GRF algorithm computes a prediction as follows:
Those familiar with classic random forests might note that this approach differs from the way forest prediction is usually described. The traditional view is that to predict for a test example, each tree makes a prediction on that example. To make a final prediction, the tree predictions are combined in some way, for example through averaging or through ‘majority voting’. It’s worth noting that for regression forests, the GRF algorithm described above is identical this ‘ensemble’ approach, where each tree predicts by averaging the outcomes in each leaf, and predictions are combined through a weighted average.
The following table illustrates with a simple regression example. Given a forest trained on n = 4 training samples, we are given a new test point x we want to compute a prediction at. We consider all the training samples used to build the forest, and ask for each tree: for which training samples does the new x land in the same terminal node? As a table this would look like:
Training sample | Does test sample x fall into the same leaf? | Sum “Yes” | Similarity weight for x: α(x) | Y (outcome) | α * Y | |||
---|---|---|---|---|---|---|---|---|
tree 1 | tree 2 | … | tree B | |||||
X_{1} | Yes | No | Yes | 2 | 0.4 | 4 | 1.6 | |
X_{2} | No | No | No | 0 | 0 | -0.5 | 0 | |
X_{3} | Yes | No | No | 1 | 0.2 | 2 | 0.4 | |
X_{4} | Yes | Yes | No | 2 | 0.4 | 4 | 1.6 | |
Final prediction | 1.6 + 0.4 + 1.6 = 3.6 |
where in the right column we sum the number of “Yes” and normalize it to sum to one. This effectively acts as a similarity measure telling us how “similar” each training sample is to x. Combining these forest weights α(x) with the estimation method given by the particular task at hand, is what delivers the final point predictions. For regression, the estimation method is a mean, giving us a forest weighted average (in a “traditional” random forest tree 1 would use sample 1, 3 and 4 to predict an average Y, etc., and these averages would be aggregated for the final prediction). For other statistical tasks, the final “aggregation” could be more complicated, for example, when using a random forest for instrumental variables estimation, GRF would effectively compute a forest-weighted two-stage least squares estimate for each target sample x.
Some readers may find this description similar to kernel weighted estimators, and that is right: GRF uses random forests as an adaptive nearest neighbor method. The GRF algorithm finds a weighted set of neighbors α(x) that are similar to a test point x, where the notion of similarity varies by the statistical task. For regression, it is how similar the outcomes are, for CATE estimation, it is how similar the estimated treatment effects are. An important consequence of this is that asking for individual tree predictions, as one would for a traditional ensemble random forest, is inappropriate here, as it is not individual trees that make up predictions, but rather the neighborhood they induce.
If a dataset is provided to the predict
method, then predictions are made for these new test example. When no dataset is provided, prediction proceeds on the training examples. In particular, for each training example, all the trees that did not use this example during training are identified (the example was ‘out-of-bag’, or OOB). Then, a prediction for the test example is made using only these trees. These out-of-bag predictions can be useful in understanding the model’s goodness-of-fit, and are also used in several of the methods for causal effect estimation methods described later in this guide.
sample.fraction
The sample.fraction
parameter is a number in the range (0, 1] that controls the fraction of examples that should be used in growing each tree. By default, sample.fraction
is set to 0.5. As noted in the section on honest forests, the fractional subsample will be further split into halves when honesty is enabled.
num.trees
The num.trees
parameter controls how many trees are grown during training, and defaults to 2000. Tree training is parallelized across several threads in an effort to improve performance. By default, all available cores are used, but the number of threads can be set directly through num.threads
.
Forests are a randomized ensemble algorithm, and as such every forest grown with a different initial seed will produce slightly different estimates, even when fit on the same data. We call this sort of perturbation error the excess.error
, because it does not come from the inherent sampling variability in the data. Large forests have smaller excess.error
, so we recommend that users grow as many trees as necessary to ensure that the values in excess.error
are negligible relative to variance.estimates
.
In addition, obtaining tighter confidence intervals requires growing even more trees than are needed for accurate predictions. When the number of trees in a forest is small, the confidence intervals can be too wide, and therefore too conservative. We recommend that users grow trees in proportion to the number of observations.
If you are interested in checking the evolution of excess.error
or confidence interval widths as the number of trees increases, you can grow forests iteratively using the function merge_forests
(see Merging Forests below).
honesty
, honesty.fraction
, honesty.prune.leaves
By default, ‘honest’ forests are trained. The motivation behind honesty is to reduce bias in tree predictions, by using different subsamples for constructing the tree and for making predictions. Honesty is a well-explored idea in the academic literature on random forests, but is not yet common in software implementations. This section gives an algorithmic explanation of honesty; for a more formal overview, please see section 2.4 of Wager and Athey (2018).
In a classic random forest, a single subsample is used both to choose a tree’s splits, and for the leaf node examples used in making predictions. In contrast, honest forests randomly split this subsample in half, and use only the first half when performing splitting. The second half is then used to populate the tree’s leaf nodes: each new example is ‘pushed down’ the tree, and added to the leaf in which it falls. In a sense, the leaf nodes are ‘repopulated’ after splitting using a fresh set of examples.
After repopulating a tree’s leaves using the second half-sample, it is possible that some leaves end up empty. With empty leaves, a tree is not able to handle certain test examples and needs to be skipped when computing those predictions. By default, empty leaves are pruned away after training so that each tree is able to handle all test points. GRF’s behavior with respect to empty leaves can be controlled through the parameter honesty.prune.leaves
.
It’s important to note that honesty may hurt performance when working with very small datasets. In this set-up, the subsample used to determine tree splits is already small, and honesty further cuts this subsample in half, so there may no longer be enough information to choose high-quality splits. There are a couple options for mitigating the cost of honesty in small samples:
honesty.fraction
allows for increasing the fraction of samples used in selecting tree splits. For example, an honesty fraction of 0.7 directs GRF to use 70% of the tree subsample for splitting, and the other 30% to populate the leaf nodes. If GRF is not working well on a small sample, we’ve found empirically that it can help to increase honesty.fraction
and set honesty.prune.leaves
to FALSE
. With these settings, it may also be necessary to increase the number of trees grown in training through num.trees
.honesty
to FALSE
.When automatically tuning parameters through tune.parameters
, GRF will try varying honesty.fraction
between 0.5 and 0.8, and consider both options for honesty.prune.leaves
. More information on automatic parameter selection can be found in the ‘Parameter Tuning’ section below.
mtry
The mtry
parameter determines the number of variables considered during each split. The value of mtry
is often tuned as a way to improve the runtime of the algorithm, but can also have an impact on statistical performance.
By default, mtry
is taken as min(sqrt(p) + 20, p)
, where p
is the number of variables (columns) in the dataset. This value can be adjusted by changing the parameter mtry
during training. Selecting a tree split is often the most resource-intensive component of the algorithm. Setting a large value for mtry
may therefore slow down training considerably.
To more closely match the theory in the GRF paper, the number of variables considered is actually drawn from a poisson distribution with mean equal to mtry
. A new number is sampled from the distribution before every tree split.
min.node.size
The parameter min.node.size
relates to the minimum size a leaf node is allowed to have. Given this parameter, if a node reaches too small of a size during splitting, it will not be split further.
There are several important caveats to this parameter:
min.node.size
setting.min.node.size
. Because of this, trees can have leaf nodes that violate the min.node.size
setting. We initially chose this behavior to match that of other random forest packages like randomForest
and ranger
, but will likely be changed as it is misleading (see #143).min.node.size
takes on a slightly different notion related to the number of treatment and control samples. More detail can be found in the ‘Selecting Balanced Splits’ section below, under the ‘Causal Forests’ heading.alpha
The parameter alpha
controls the maximum imbalance of a split. In particular, when splitting a parent node, the size of each child node is not allowed to be less than size(parent) * alpha
. Its value must lie between (0, 0.25), and defaults to 0.05.
When training a causal forest, this parameter takes on a slightly different notion related to the number of treatment and control samples. More detail can be found in the ‘Selecting Balanced Splits’ section below, under the ‘Causal Forests’ heading.
imbalance.penalty
The imbalance.penalty
parameter controls how harshly imbalanced splits are penalized. When determining which variable to split on, each split is assigned a ‘goodness measure’ related to how much it increases heterogeneity across the child nodes. The algorithm applies a penalty to this value to discourage child nodes from having very different sizes, specified by imbalance.penalty * (1.0 / size(left.child) + 1.0 / size(right.child)
. This penalty can be seen as a complement to the hard restriction on splits provided by alpha
. It defaults to 0 so that no split penalty is applied.
When training a causal forest, this parameter takes on a slightly different notion related to the number of treatment and control samples. More detail can be found in the ‘Selecting Balanced Splits’ section below, under the ‘Causal Forests’ heading.
By default, all forest models are trained in such a way as to support variance estimates. To calculate these estimates, the flag estimate.variance
can be provided to prediction:
causal.forest = causal_forest(X, Y, W)
prediction.result = predict(causal.forest, X.test, estimate.variance=TRUE)
standard.error = sqrt(prediction.result$variance.estimates)
The procedure works by training trees in small groups, then comparing the predictions within and across groups to estimate variance. In more detail:
PredictionStrategy::compute_variance
.Note that although training starts by drawing a half-sample, the sample.fraction
option still corresponds to a fraction of the full sample. This means that when variance estimates are requested, sample.fraction
cannot be greater than 0.5.
The number of trees in each group is controlled through the ci.group.size
parameter, and defaults to 2. If variance estimates are not needed, ci.group.size
can be set to 1 during training to avoid growing trees in small groups.
The causal.forest
method uses the same general training and prediction framework described above:
For a technical treatment of causal forest splitting and prediction, please refer to section 6.2 of the GRF paper.
Beyond this core training procedure, causal forests incorporate some additions specific to treatment effect estimation. These additions are described below.
Recall that causal forests assume that potential outcomes are independent of treatment assignment, but only after we condition on features X
. In this setting, in order to consistently estimate conditional average treatment effects, a naive causal forest would need to split both on features that affect treatment effects and those that affect treatment propensities. This can be wasteful, as splits ‘spent’ on modeling treatment propensities may not be useful in estimating treatment heterogeneity.
In GRF, we avoid this difficulty by ‘orthogonalizing’ our forest using Robinson’s transformation (Robinson, 1988). Before running causal_forest
, we compute estimates of the propensity scores e(x) = E[W|X=x]
and marginal outcomes m(x) = E[Y|X=x]
by training separate regression forests and performing out-of-bag prediction. We then compute the residual treatment W - e(x)
and outcome Y - m(x)
, and finally train a causal forest on these residuals. If propensity scores or marginal outcomes are known through prior means (as might be the case in a randomized trial) they can be specified through the training parameters W.hat
and Y.hat
. In this case, causal_forest
will use these estimates instead of training separate regression forests.
Empirically, we’ve found orthogonalization to be essential in obtaining accurate treatment effect estimates in observational studies. This first-stage residualization is conceptually different from the GRF splitting routine described in the previous sections. It can be interpreted as giving rise to a loss function tailored for heterogeneous treatment effects. In recognition of the work of Robinson (1988) and to emphasize the role of residualization, we refer to this approach as the ‘R-learner’. More details on the orthogonalization procedure in the context of forests can be found in section 6.1.1 of the GRF paper. For a broader discussion on Robinson’s transformation for conditional average treatment effect estimation, including formal results, please see Nie and Wager (2021).
In the sections above on min.node.size
, alpha
, and imbalance.penalty
, we described how tree training protects against making splits that result in a large size imbalance between the children, or leaf nodes that are too small. In a causal setting, it is not sufficient to consider the number of examples in each node – we must also take into account the number treatment vs. control examples. Without a reasonable balance of treated and control examples, there will not be enough information in the node to obtain a good estimate of treatment effect. In the worst case, we could end up with nodes composed entirely of control (or treatment) examples.
For this reason, causal splitting uses modified notions of each split balancing parameter:
min.node.size
usually determines the minimum number of examples a node should contain. In causal forests, the requirement is more stringent: a node must contain at least min.node.size
treated samples, and also at least that many control samples.alpha
and imbalance.penalty
help ensure that the size difference between children is not too large. When applying alpha
and imbalance.penalty
in causal forests, we use a modified measure of node size that tries to capture how much ‘information content’ it contains. The new size measure is given by \sum_{i in node} (W_i - \bar{W})^2
.The above description of min.node.size
assumes that the treatment is binary, which in most cases is an oversimplification. The precise algorithm for enforcing min.node.size
is as follows. Note that this same approach is used both when the treatment is binary or continuous.
min.node.size
samples with treatment value less than the average, and at least that many samples with treatment value greater than or equal to the average.In addition to personalized treatment effects, causal forests can be used to estimate the average treatment effect across the training population. Naively, one might estimate the average treatment effect by averaging personalized treatment effects across training examples. However, a more accurate estimate can be obtained by plugging causal forest predictions into a doubly robust average treatment effect estimator. As discussed in Chernozhukov et al. (2018), such approaches can yield semiparametrically efficient average treatment effect estimates and accurate standard error estimates under considerable generality. GRF provides the dedicated function average_treatment_effect
to compute these estimates.
The average_treatment_effect
function implements two types of doubly robust average treatment effect estimations: augmented inverse-propensity weighting “AIPW” (Robins et al., 1994), and targeted maximum likelihood estimation (van der Laan and Rubin, 2006). Which method to use can be specified through the method
parameter. The following estimates are available:
target.sample = all
): E[Y(1) - Y(0)]
.target.sample = treated
): E[Y(1) - Y(0) | Wi = 1]
.target.sample = control
): E[Y(1) - Y(0) | Wi = 0]
.target.sample = overlap
): E[e(X) (1 - e(X)) (Y(1) - Y(0))] / E[e(X) (1 - e(X))
, where e(x) = P[Wi = 1 | Xi = x]
.This last estimand is recommended by Li et al. (2018) in case of poor overlap (i.e., when the treatment propensities e(x) may be very close to 0 or 1), as it doesn’t involve dividing by estimated propensities.
Even though there is treatment effect heterogeneity across the population or a sub-population, the average treatment effect might still be zero. To assess if the estimated CATE function tau(x) = E[Y(1) - Y(0) | X = x]
does well in identifying personalised effects, another summary measure would be useful. One such proposed measure is the Rank-Weighted Average Treatment Effect (RATE), which uses the relative ranking of the estimated CATEs to gauge if it can effectively target individuals with high treatment effects on a separate evaluation data set (and can thus be used to test for the presence of heterogeneous treatment effects).
This approach is implemented in the function rank_average_treatment_effect
, which provides valid bootstrapped errors of the RATE, along with a Targeting Operator Characteristic (TOC) curve which can be used to visually inspect how well a CATE estimator performs in ordering observations according to treatment benefit. For more details on the RATE metric see Yadlowsky et al., 2021.
Sometimes, it is helpful to have a more expressive summary of the CATE function tau(x) = E[Y(1) - Y(0) | X = x]
than the average treatment effect. One useful summary of this type is the “best linear projection” of tau(x) onto a set of features A, i.e., the population solution to the linear model
tau(X) ~ beta0 + A * beta
Qualitatively, these betas can be used to assess the association of the CATE function with the features A. The features A could, for example, be a subset of the features X used to train the forest. Note that, if the features A are mean zero, the intercept beta0 corresponds to the average treatment effect.
The function best_linear_projection
provides estimates of the coefficients beta in the above model. Note that this function does not simply regress the CATE estimates tau.hat(Xi) given by the causal forest against Ai. Instead, we use a doubly robust estimator that generalizes the augmented inverse-propensity weighted estimator for the average treatment effect described above.
As mentioned above, causal forest choose splits that maximize the difference in the treatment effect tau between two child nodes by a gradient based approximation. When the response Yi
and treatment Wi
are real-valued this results in a scalar valued approximation ρi which is used by the subsequent splitting routine (see section 2.3 of the GRF paper for more details).
There are however empirical applications where there are more than one primary outcome Y
of interest, and a practitioner may wish to train a forest that jointly expresses heterogeneity in treatment effects across several outcomes Yj
, j=1…M. The same algorithmic framework described above can be employed for this purpose (assuming all outcomes are on the same scale) by computing the gradient based approximation of tau_j across each outcome, then concatenating them to a M-sized pseudo response vector ρ which is used by a subsequent CART regression splitting routine. The CART criterion GRF uses for vector-valued responses is the squared L2 norm.
Another closely related practical application are settings where there are several interventions, as for example in medical trials with multiple treatment arms. In the event there are K mutually exclusive treatment choices, we can use the same algorithmic principles described above to build a forest that jointly targets heterogeneity across the K-1 different treatment contrasts. In the software package this is done with (20) from the GRF paper, where Wi is a vector encoded as {0, 1}^(K-1), and ξ selects the K-1 gradient approximations of the contrasts.
The functionality described above is available in multi_arm_causal_forest
. The intended use-case for this function is for a “handful” of arms or outcomes. Statistical considerations aside (s.t. limited overlap with many arms), our implementation has a memory scaling that for each tree takes the form O(number.of.nodes * M * K^2)
. The general case where all Wk
are real-valued is implemented in the conditionally linear model forest: lm_forest
.
Many applications featuring time-to-event data involve right-censored responses where instead of observing the survival time Ti
we observe Yi = min(Ti, Ci)
along with an event indicator Di = 1{Ti <= Ci}
. To estimate treatment effects in such a setting we need to account for censoring in order to obtain unbiased estimates.
GRF supports this use-case in the function causal_survival_forest
by incorporating an extension of the celebrated AIPW score that is robust to censoring to the orthogonalization approach described in the previous section (see Cui et al., 2020 for more details). In addition to estimates of the propensity score, this approach relies on estimates of the survival functions S(t, x) = P(T > t | X = x)
and C(t, x) = P(C > t | X = x)
, which GRF estimates using a variant of random survival forests (Ishwaran et al., 2008), where the notable difference is GRF uses honest splitting and forest weights to produce a kernel-weighted survival function, instead of aggregating terminal node survival curves.
The balanced split criterions described in the earlier paragraph are extended to take censoring into account. For both causal_survival_forest
and survival_forest
the alpha
parameter controls the minimum number of non-censored samples each child node needs for splitting to proceed.
The following sections describe other features of GRF that may be of interest.
The accuracy of a forest can be sensitive to several training parameters:
min.node.size
, sample.fraction
, and mtry
honesty.fraction
and honesty.prune.leaves
alpha
and imbalance.penalty
GRF provides a cross-validation procedure to select values of these parameters to use in training. To enable this tuning during training, the option tune.parameters = "all"
can be passed to main forest method. Parameter tuning is currently disabled by default.
The cross-validation procedure works as follows:
tune.num.reps
).OptimizedPredictionStrategy#compute_debiased_error
.
tune.num.trees
). With such a small number of trees, the out-of-bag error gives a biased estimate of the final forest error. We therefore debias the error through a simple variance decomposition.tune.num.draws
).Note that honesty.fraction
and honesty.prune.leaves
are only considered for tuning when honesty = TRUE
(its default value). Parameter tuning does not try different options of honesty
itself.
Tuning can be sensitive. Tuning only a subset of the parameters (with for example tune.parameters = c("min.node.size", "honesty.prune.leaves")
may give better performance than trying to find the optimum for all. On smaller data, increasing tune.num.trees
may not necessarily increase training time drastically and could result in more stable results.
In order to ensure valid predictions and tight confidence intervals, users may need to grow a large number of trees. However, it is hard to know exactly how many trees to grow in advance. That is why GRF allows users to grow their forests separately and then create a single forest using the function merge_forests
. This functionality allows users to sequentially grow small forests, merge them into a large forest, and check if they have attained the desired level of excess.error
or tight enough confidence intervals.
Ghosal and Hooker (2018) show how a boosting step can reduce bias in random forest predictions. We provide a boosted_regression_forest
method which trains a series of forests, each of which are trained on the OOB residuals from the previous step. boosted_regression_forest
includes the same parameters as regression_forest
, which are passed directly to the forest trained in each step.
The boosted_regression_forest
method also contains parameters to control the step selection procedure. By default, the number of boosting steps is automatically chosen through the following cross-validation procedure:
regression_forest
is trained and added to the boosted forest.boost.trees.tune
on the OOB residuals. If it’s estimated OOB error reduces the OOB error from the previous step by more than boost.error.reduction
, then we will prepare for another step.regression_forest
on the residuals and add it to the boosted forest.boost.error.reduction
cannot be met when training the small forest, or if the total number of steps exceeds the limit boost.max.steps
.Alternatively, you can to skip the cross-validation procedure and specify the number of steps directly through the parameter boost.steps
.
Some additional notes about the behavior of boosted regression forests:
tune.parameters
is enabled, then parameters are chosen by the regression_forest
procedure once in the first boosting step. The selected parameters are then applied to train the forests in any further steps.estimate.variance
parameter is not available for boosted forests.For accurate predictions and variance estimates, it can be important to take into account for natural clusters in the data, as might occur if a dataset contains examples taken from the same household or small town. GRF provides support for cluster-robust forests by accounting for clusters in the subsampling process. To use this feature, the forest must be trained with the relevant cluster information by specifying the clusters
and optionally the equalize.cluster.weights
parameter. Then, all subsequent calls to predict
will take clusters into account, including when estimating the variance of forest predictions.
When clustering is enabled during training, all subsampling procedures operate on entire clusters as opposed to individual examples. Then, to determine the examples used for performing splitting and populating the leaves, samples_per_cluster
examples are drawn from the selected clusters. By default, samples_per_cluster
is all the observations in the clusters. If equalize.cluster.weights = TRUE
samples_per_cluster
is equal to the size of the smallest cluster. Concretely, the cluster-robust training procedure proceeds as follows:
estimate.variance
is enabled, sample half of the clusters. Each ‘tree group’ is associated with this half-sample of clusters (as described in the ‘Variance Estimates’ section). If we are not computing variance estimates, then keep the full sample instead.sample.fraction
of the clusters. Each tree is now associated with a list of cluster IDs.samples_per_cluster
examples from each of the cluster IDs, and do the same when repopulating the leaves for honesty.Note that when clusters are provided, standard errors from average_treatment_effect
estimation are also cluster-robust. Moreover, If clusters are specified, then each unit gets equal weight by default. For example, if there are 10 clusters with 1 unit each and per-cluster ATE = 1, and there are 10 clusters with 19 units each and per-cluster ATE = 0, then the overall ATE is 0.05 (additional sample.weights allow for custom weighting). If equalize.cluster.weights = TRUE each cluster gets equal weight and the overall ATE is 0.5.
When the distribution of data that you observe is not representative of the population you are interested in scientifically, it can be important to adjust for this. We can pass sample.weights
to specify that in our population of interest, we observe Xi with probability proportional to sample.weights[i]
. By default, these weights are constant, meaning that our population of interest is the population from which X1 … Xn are sampled. For causal validity, the weights we use should not be confounded with the potential outcomes — typically this is done by having them be a function of Xi. One common example is inverse probability of complete case weighting to adjust for missing data, which allows us to work with only the complete cases (the units with nothing missing) to estimate properties of the full data distribution (all units as if nothing were missing).
The impact these weights have depends on what we are estimating. When our estimand is a function of x, e.g. r(x) = E[Y | X=x]
in regression_forest
or tau(x)=E[Y(1)-Y(0)| X=x]
in causal_forest
, passing weights that are a function of x does not change our estimand. It instead prioritizes fit on our population of interest. In regression_forest
, this means minimizing weighted mean squared error, i.e. mean squared error over the population specified by the sample weights. In causal_forest
, this means minimizing weighted R-loss (Nie and Wager (2021), Equations 1/2). When our estimand is an average of such a function, as in average_treatment_effect
, it does change the estimand. Our estimand will be the average treatment/partial effect over the population specified by our sample weights.
Currently most forest types takes sample.weights
into account during splitting and prediction. The exception is local linear forests and quantile_forest
where sample weighting is currently not a feature. survival_forest
only takes sample.weights
into account during prediction (due to properties of the log-rank splitting criterion not being “amendable” to sample weights).
GRF can only handle numerical inputs. If your data has a column with categorical values, here are some ways in which you can proceed.
If there is a natural way to order the categories, then simply encode them as integers according to this ordering. For instance, an individual’s educational attainment could be encoded as 1 for “primary”, 2 for “secondary”, 3 for “undergraduate”, and so on. This representation will make it easy for forests to split individuals into lower and higher education groups. Moreover, even if your variable does not have an obvious ordering at first sight, you may want to be a little creative in imposing an artificial ordering that is relevant to your statistical problem. For example, numbering contiguous geographical areas sequentially can often help forests make splits that group nearby areas together. If geographical proximity is important, the forest performance will likely increase.
However, sometimes the category has no reasonable ordering. In that case, one common option is to create one-hot encoded vectors (also know as dummy vectors), replacing the column with several binary columns indicating the category to which each observation belongs. Unfortunately, this sort of representation can be wasteful, because each binary column will likely not carry a lot of information. When the number of categories is large, you might want to use a more sophisticated representation method such as feature hashing. Alternatively, we recommend users try our sister package sufrep
, which implements several categorical variable representation methods. It is currently in beta version, so please feel free to submit github issues for bugs or additional feature requests.
Finally, please bear in mind that in general the statistical performance of any statistical algorithm will depend on how the data is represented. Therefore, it may be worthwhile to spend some time thinking about how to encode your variables before starting the actual analysis.
GRF can handle missing covariate values, meaning that a the input covariate matrix can contain NA
in some cells instead of numerical values.
These missing values can be informative, in the sense that the fact that a covariate is missing can be predictive of the unobserved value itself or of the outcome Y, or it can be non-informative. For instance, an individual’s annual income could be missing because this individual is wealthy and prefers to conceal his/her income and this in turn is likely to be predictive, for instance, of the type of housing of this individual.
GRF handles missing values implicitly using the missing incorporated in attributes criterion (MIA; Twala et al., 2008), meaning that as soon as there is a missing value in a variable j, there are three candidate splits for a given threshold.
Note that by default, missing values are sent to the left, which implies that if the covariates from the training set are completely observed but there are missing values occurring in a new covariate matrix from the test set, then the prediction functions will not produce an error but simply send the incomplete observations to the left side of the corresponding nodes.
Missing values occur almost inevitably in most applications and the way to handle them depends on the type of analysis one wishes to perform and on the mechanism that generated these missing values. For a broader discussion on treatment effect estimation with missing attributes please see Mayer et al. (2019).
If you observe poor performance on a dataset with a small number of examples, there are two changes worth looking into:
honesty
parameter gives guidance on how to mitigate the issue.ci.group.size
to 1 during training, then increasing sample.fraction
. Because of how variance estimation is implemented, sample.fraction
cannot be greater than 0.5 when it is enabled. If variance estimates are not needed, it may help to disable this computation and use a larger subsample size for training.GRF is a package designed for forest-based estimation on datasets of ‘moderate’ size. The tree building process can be sped up significantly on a machine with many cores, but note that on very large datasets the problem quickly becomes memory bound, due to the information (sufficient statistics, etc.) stored by each tree. As a point of reference a causal_forest
with 1 000 000 observations and 30 continuous covariates takes around 1 hour to train (using default settings on a machine with 24 cores and 150 GB RAM), and the final forest takes up ca. 25 GB of memory (this is expected to scale linearly in the number of trees). In order to make forest training on very large data more manageable, we provide the following recommendations for parameters to experiment with:
min.node.size
, effectively growing shallower trees. Having a min.node.size
in the thousands will make scaling to a data set in the millions perfectly feasible.num.trees
. The default of 2000 is high in order to ensure good performance of the optionally estimated pointwise confidence intervals. If these are not needed, fewer trees might give very similar prediction performance.In addition, the following might also be useful, depending on application:
sample.fraction
, reducing the number of samples per tree.merge_forests
(note that merging is still a copy inducing operation).In this case, it would be good to try growing a larger number of trees. Obtaining good variance estimates often requires growing more trees than it takes to only obtain accurate predictions. See the discussion under num.trees
and Merging Forests.
If the output of the causal.forest
method doesn’t pass a sanity check based on your knowledge of the data, it may be worth checking whether the overlap assumption is violated. In order for conditional average treatment effects to be properly identified, a dataset’s propensity scores must be bounded away from 0 and 1. A simple way to validate this assumption is to calculate the propensity scores by regressing the treatment assignments W against X, and examining the out-of-bag predictions. Concretely, you can perform the following steps:
propensity.forest = regression_forest(X, W)
W.hat = predict(propensity.forest)$predictions
hist(W.hat, xlab = "propensity score")
If there is strong overlap, the histogram will be concentrated away from 0 and 1. If the data is instead concentrated at the extremes, the overlap assumption likely does not hold.
For further discussion of the overlap assumption, please see Imbens and Rubin (2015). In practice, this assumption is often violated due to incorrect modeling decision: for example one covariate may be a deterministic indicator that the example received treatment.
While the algorithm in regression_forest
is very similar to that of classic random forests, it has several notable differences, including ‘honesty’, group tree training for variance estimates, and restrictions during splitting to avoid imbalanced child nodes. These features can cause the predictions of the algorithm to be different, and also lead to a slower training procedure than other packages. We welcome GitHub issues that shows cases where GRF does notably worse than other packages (either in statistical or computational performance), as this will help us choose better defaults for the algorithm, or potentially point to a bug.
Overall, GRF is designed to produce the same estimates across platforms when using a consistent value for the random seed through the training option seed. However, there are still some cases where GRF can produce different estimates across platforms. When it comes to cross-platform predictions, the output of GRF will depend on a few factors beyond the forest seed.
One such factor is the compiler that was used to build GRF. Different compilers may have different default behavior around floating-point rounding, and these could lead to slightly different forest splits if the data requires numerical precision. Another factor is how the forest construction is distributed across different threads. Right now, our forest splitting algorithm can give different results depending on the number of threads that were used to build the forest.
Therefore, in order to ensure consistent results, we provide the following recommendations.
seed
and num.threads
are the same across platformsAlso, please note that we have not done extensive testing on Windows platforms, although we do not expect random number generation issues there to be different from Linux/Mac. Regardless of the platform, if results are still not consistent please help us by submitting a Github issue.
Athey, Susan, Julie Tibshirani and Stefan Wager. Generalized Random Forests, Annals of Statistics, 2019. (arxiv)
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 2018.
Cui, Yifan, Michael R. Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. Estimating Heterogeneous Treatment Effects with Right-Censored Data via Causal Survival Forests. arXiv preprint arXiv:2001.09887, 2020.
Ghosal, Indrayudh, and Giles Hooker. Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate. Journal of Computational and Graphical Statistics, 2020.
Ishwaran, Hemant, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random survival forests. The Annals of Applied Statistics, 2008.
Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
Li, Fan, Kari Lock Morgan, and Alan M. Zaslavsky. Balancing covariates via propensity score weighting. Journal of the American Statistical Association, 2018.
Mayer, Imke, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse. Doubly robust treatment effect estimation with missing attributes. The Annals of Applied Statistics, 2020.
Nie, Xinkun, and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 2021.
Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 1994.
Robinson, Peter M. Root-n-consistent semiparametric regression. Econometrica, 1988.
Twala, B. E. T. H., M. C. Jones, and David J. Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters 29,2008.
Van Der Laan, Mark J., and Daniel Rubin. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2006.
Wager, Stefan, and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 2018.
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. arXiv preprint arXiv:2111.07966, 2021.