--- title: "Automatic Construction of Bootstrap Confidence Intervals" author: "Bradley Efron and Balasubramanian Narasimhan" date: '`r Sys.Date()`' bibliography: bcaboot.bib output: html_document: fig_caption: yes theme: cerulean toc: yes toc_depth: 2 vignette: > %\VignetteIndexEntry{Automatic Construction of Bootstrap Confidence Intervals} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r echo=FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, error = FALSE, tidy = FALSE, cache = FALSE ) library(bcaboot) ``` ## Introduction Bootstrap confidence intervals depend on three elements: - the cdf of the \eqn{B} bootstrap replications $t_i^*$, $i=1\ldots B$ - the bias-correction number $z_0 = \Phi(\sum_i^B I(t_i^* < t_0) / B )$ where $t_0=f(x)$ is the original estimate - the acceleration number $a$ that measures the rate of change in $\sigma_{t_0}$ as $x$, the data changes. The first two of these depend only on the bootstrap distribution, and not how it is generated: parametrically or non-parametrically. Package `bcaboot` aims to make construction of bootstrap confidence intervals _almost_ automatic. The three main functions for the user are: - `bcajack` and `bcajack2` for nonparametric bootstrap - `bcapar` for parametric bootstrap Further details are in the @efronnaras2018 paper. Much of the theory behind the approach can be found in references @efron1987, @diciccio1992, @diciccio1996, and @efron2016. ## A Nonparametric Example Suppose we wish to construct bootstrap confidence intervals for an $R^2$-statistic from a linear regression. Using the diabetes data from the [`lars`](https://cran.r-project.org/package=lars) (442 by 11) as an example, we use the function below to regress the `y` on `x`, a matrix of of 10 predictors, to compute $R^2$. ```{r} data(diabetes, package = "bcaboot") Xy <- cbind(diabetes$x, diabetes$y) rfun <- function(Xy) { y <- Xy[, 11] X <- Xy[, 1:10] summary(lm(y ~ X) )$adj.r.squared } ``` Constructing bootstrap confidence intervals involves merely calling `bcajack`: ```{r} set.seed(1234) result <- bcajack(x = Xy, B = 2000, func = rfun, verbose = FALSE) ``` The `result` contains several components. The confidence interval limits can be obtained via ```{r} knitr::kable(result$lims, digits = 3) ``` The first column shows the estimated Bca confidence limits at the requested alpha percentiles which can be compared with the standard limits $\theta \pm \hat{\sigma}z_{\alpha}$ under the column titled `standard`. The `jacksd` column jacksd gives the internal standard errors for the Bca limits, quite small in this example. The `pct` column gives percentiles of the ordered `B` bootstrap replications corresponding to the Bca limits, e.g. the 91.85 percentile equals the the .975 Bca limit .5600968. Further details are provided by the `stats` component. ```{r} knitr::kable(result$stats, digits = 3) ``` The first column `theta` is the original point estimate of the parameter of interest, `sdboot` is its bootstrap estimate of standard error. The quantity `z0` is the Bca bias correction value, in this case quite negative; `a` is the acceleration, a component of the Bca limits (nearly zero here). Finally, `sdjack` is the jackknife estimate of standard error for `theta`. The bottom line gives the internal standard errors for the five quantities above. This is substantial for `z0` above. The component `ustats` of the result provides the bias-corrected estimator and an estimate of its sampling error. ```{r} knitr::kable(t(result$ustats), digits = 3) ``` The resulting object can be plotted using `bcaplot`. ```{r} bcaplot(result) ``` ## A Parametric Example ```{r, echo = FALSE} if (!requireNamespace("glmnet", quietly = TRUE)) { stop("Please install glmnet package for this vignette.") } load(system.file("extdata", "neonates.rda", package = "bcaboot")) ``` A logistic regression was fit to data on 812 neonates at a large clinic. Here is a summary of the dataset. ```{r} str(neonates) ``` The goal was to predict death versus survival---$y$ is 1 or 0, respectively---on the basis of 11 baseline variables of which one of them `resp` was of particular concern. (There were 207 deaths and 605 survivors.) So here $\theta$, the parameter of interest is the coefficient of `resp`. Discussions with the investigator suggested a weighting of 4 to 1 of deaths versus non-deaths. ### A Logistic Model ```{r} weights <- with(neonates, ifelse(y == 0, 1, 4)) glm_model <- glm(formula = y ~ ., family = "binomial", weights = weights, data = neonates) summary(glm_model) ``` Parametric bootstrapping in this context requires us to independently sample the response according to the estimated probabilities from regression model. As discussed in the paper accompanying this software, routine `bcapar` also requires sufficient statistics $\hat{\beta} = M^\prime y$ where $M$ is the model matrix. Therefore, it makes sense to have a function do the work. The function `glm_boot` below returns a list of the estimate $\hat{\theta}$, the bootstrap estimates, and the sufficient statistics. ```{r} glm_boot <- function(B, glm_model, weights, var = "resp") { pi_hat <- glm_model$fitted.values n <- length(pi_hat) y_star <- sapply(seq_len(B), function(i) ifelse(runif(n) <= pi_hat, 1, 0)) beta_star <- apply(y_star, 2, function(y) { boot_data <- glm_model$data boot_data$y <- y coef(glm(formula = y ~ ., data = boot_data, weights = weights, family = "binomial")) }) list(theta = coef(glm_model)[var], theta_star = beta_star[var, ], suff_stat = t(y_star) %*% model.matrix(glm_model)) } ``` Now we can compute the bootstrap estimates using `bcapar`. ```{r} set.seed(3891) glm_boot_out <- glm_boot(B = 2000, glm_model = glm_model, weights = weights) glm_bca <- bcapar(t0 = glm_boot_out$theta, tt = glm_boot_out$theta_star, bb = glm_boot_out$suff_stat) ``` We can examine the bootstrap limits and statistics. ```{r} knitr::kable(glm_bca$lims, digits = 3) ``` ```{r} knitr::kable(glm_bca$stats, digits = 3) ``` Our bootstrap standard error using $B=2000$ samples for `resp` can be read off from the last table as $0.943\pm 0.155$. We can also see a small upward bias from the fact that `r with(glm_boot_out, sum(theta_star > theta)/ 2000)` proportion of bootstrap replicates were above $0.943$. This is also reflected in the bias-corrector term $\hat{z}_0= -0.215$ in the table above with an internal standard error of $0.024. ### A Penalized Logistic Model Now suppose we wish to use a nonstandard estimation procedure, for example, via the `glmnet` package, which uses cross-validation to figure out a best fit, corresponding to a penalization parameter $\lambda$ (named `lambda.min`). ```{r} X <- as.matrix(neonates[, seq_len(11)]) ; Y <- neonates$y; glmnet_model <- glmnet::cv.glmnet(x = X, y = Y, family = "binomial", weights = weights) ``` We can examine the estimates at the `lambda.min` as follows. ```{r} coefs <- as.matrix(coef(glmnet_model, s = glmnet_model$lambda.min)) knitr::kable(data.frame(variable = rownames(coefs), coefficient = coefs[, 1]), row.names = FALSE, digits = 3) ``` Following the lines above, we create a helper function to perform the bootstrap. ```{r} glmnet_boot <- function(B, X, y, glmnet_model, weights, var = "resp") { lambda <- glmnet_model$lambda.min theta <- as.matrix(coef(glmnet_model, s = lambda)) pi_hat <- predict(glmnet_model, newx = X, s = "lambda.min", type = "response") n <- length(pi_hat) y_star <- sapply(seq_len(B), function(i) ifelse(runif(n) <= pi_hat, 1, 0)) beta_star <- apply(y_star, 2, function(y) { as.matrix(coef(glmnet::glmnet(x = X, y = y, lambda = lambda, weights = weights, family = "binomial"))) }) rownames(beta_star) <- rownames(theta) list(theta = theta[var, ], theta_star = beta_star[var, ], suff_stat = t(y_star) %*% X) } ``` And off we go. ```{r} glmnet_boot_out <- glmnet_boot(B = 2000, X, y, glmnet_model, weights) glmnet_bca <- bcapar(t0 = glmnet_boot_out$theta, tt = glmnet_boot_out$theta_star, bb = glmnet_boot_out$suff_stat) ``` We can compare the output of this against what we got from `glm` above. We can examine the bootstrap limits and statistics. ```{r} knitr::kable(glmnet_bca$lims, digits = 3) ``` ```{r} knitr::kable(glmnet_bca$stats, digits = 3) ``` The shrinkage is evident; we now have the bootstrap estimate is now $0.862\pm 0.127$. In fact, we now have only `r with(glmnet_boot_out, sum(theta_star > theta)/ 2000)` proportion of bootstrap replicates above $0.862$. Therefore, the bias corrector is large: $\hat{z}_0 = 0.411.$ Finally, we can plot both the `glm` and `glmnet` results side-by-side. ```{r fig.width = 8, echo = FALSE} opar <- par(mfrow = c(1, 2)) bcaplot(glm_bca) bcaplot(glmnet_bca) par(opar) ``` ## Ratio of Independent Variance Estimates Assume we have two independent estimates of variance from normal theory: \[ \hat{\sigma}_1^2\sim\frac{\sigma_1^2\chi_{n_1}^2}{n_1}, \] and \[ \hat{\sigma}_2^2\sim\frac{\sigma_2^2\chi_{n_2}^2}{n_2}. \] Suppose now that our parameter of interest is \[ \theta=\frac{\sigma_1^2}{\sigma_2^2} \] for which we wish to compute confidence limits. In this setting, theory yields exact limits: \[ \hat{\theta}(\alpha) = \frac{\hat{\theta}}{F_{n_1,n_2}^{1-\alpha}}. \] We can apply `bcapar` to this problem. As before, here are our helper functions. ```{r} ratio_boot <- function(B, v1, v2) { s1 <- sqrt(v1) * rchisq(n = B, df = n1) / n1 s2 <- sqrt(v2) * rchisq(n = B, df = n2) / n2 theta_star <- s1 / s2 beta_star <- cbind(s1, s2) list(theta = v1 / v2, theta_star = theta_star, suff_stat = beta_star) } funcF <- function(beta) { beta[1] / beta[2] } ``` Note that we have an additional function `funcF` which corresponds to $\tau(\hat{\beta}^*)$ in the paper. This is the function expressing the parameter of interest as as a function of the sample. ```{r} B <- 16000; n1 <- 10; n2 <- 42 ratio_boot_out <- ratio_boot(B, 1, 1) ratio_bca <- bcapar(t0 = ratio_boot_out$theta, tt = ratio_boot_out$theta_star, bb = ratio_boot_out$suff_stat, func = funcF) ``` The limits obtained are shown below, along with the exact limits as the last column. ```{r} exact <- 1 / qf(df1 = n1, df2 = n2, p = 1 - as.numeric(rownames(ratio_bca$lims))) knitr::kable(cbind(ratio_bca$lims, exact = exact), digits = 3) ``` Clearly the bca limits match the exact values very well and suggests a large upward correction to the standard limits. Here the corrections are all positive as seen in the table below; $\hat{z}_0 = 0.093$ and $\hat{a} = 0.092$. ```{r} knitr::kable(ratio_bca$stats, digits = 3) ``` ```{r} knitr::kable(t(ratio_bca$abcstats), digits = 3) ``` ```{r} knitr::kable(ratio_bca$ustats, digits = 3) ``` The plot below shows that there is moderate amount of internal error in $\hat{\theta}_{bca}(0.975)$ as shown by the red bar. The `pct` column suggests why: $\hat{\theta}_{bca}(0.975)$ occurs at the $0.996$-quantile of the 16,000 replications, i.e., at the 64th largest $\hat{\theta}$, where there is a limited amount of data for estimating the distribution. ```{r} bcaplot(ratio_bca) ``` ## References