library(grf)

Here we consider a problem in which we want to impute outcomes that are missing at random. We observe \((W_i Y_i, W_i)\) where \(Y\) is real and \(W\) is binary, and we will use a regression forest to estimate \(\mu(x) = E[Y_i | X_i=x, W_i = 1]\) using the units with \(W_i=1\). Doing this without weights gives \(\mu\) minimizing an empirical version of \(E[(Y_i - \mu(X_i))^2 | W_i=1]\), squared error over the population of units with nonmissing outcomes, while a more natural measure of imputation quality is \(E[(Y_i - \mu(X_i))^2 | W_i=0]\), squared error over the population of units for which we will be imputing outcomes. With the inverse propensity weights \(w(x)=(1-e(x))/e(x)\), we instead minimize unbiased estimator of this more natural measure, as \(E[w(x) (Y_i - \mu(X_i))^2 | W_i=1] = E[ (Y_i - \mu(X_i))^2 | W_i=0]\).

We simulate data from this missing data model, then compare error on the imputed observations and check coverage.

n <- 2000
p <- 2
X <- matrix(rnorm(n * p), n, p)
mu <- X[, 1]
Y <- mu + 0.1 * rnorm(n)

e <- 1 / (1 + exp(-1 * X[, 1]))
w <- runif(n) <= e
sample.weights <- (1 - e[w]) / e[w]

forest <- regression_forest(X[w, ], Y[w])
forest.weighted <- regression_forest(X[w, ], Y[w], sample.weights = sample.weights)

mu.forest <- predict(forest, X[!w, ], estimate.variance = TRUE)
mu.forest.weighted <- predict(forest.weighted, X[!w, ], estimate.variance = TRUE)
z.forest <- abs(mu.forest$predictions - mu[!w]) / sqrt(mu.forest$variance.estimates)
z.forest.weighted <- abs(mu.forest.weighted$predictions - mu[!w]) / sqrt(mu.forest.weighted$variance.estimates)

mean((mu.forest$predictions - mu[!w]))^2
#> [1] 0.0001010782
mean((mu.forest.weighted$predictions - mu[!w]))^2
#> [1] 0.001717534

mean(abs(z.forest <= 1.96))
#> [1] 0.8972163
mean(abs(z.forest.weighted <= 1.96))
#> [1] 0.8319058