Title: | Empirical Bayes Estimation Strategies |
---|---|
Description: | Empirical Bayes methods for learning prior distributions from data. An unknown prior distribution (g) has yielded (unobservable) parameters, each of which produces a data point from a parametric exponential family (f). The goal is to estimate the unknown prior ("g-modeling") by deconvolution and Empirical Bayes methods. Details and examples are in the paper by Narasimhan and Efron (2020, <doi:10.18637/jss.v094.i11>). |
Authors: | Bradley Efron [aut], Balasubramanian Narasimhan [aut, cre] |
Maintainer: | Balasubramanian Narasimhan <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2-1 |
Built: | 2024-11-18 05:40:37 UTC |
Source: | https://github.com/bnaras/deconvolver |
-modeling using exponential families.deconvolveR
is a package for Empirical Bayes Deconvolution
and Estimation. A friendly introduction is provided in the JSS
paper reference below and this package includes a vignette
containing a number of examples.
Bradley Efron. Empirical Bayes Deconvolution Estimates. Biometrika 103(1), 1-20, ISSN 0006-3444. doi:10.1093/biomet/asv068. http://biomet.oxfordjournals.org/content/103/1/1.full.pdf+html
Bradley Efron and Trevor Hastie. Computer Age Statistical Inference. Cambridge University Press. ISBN 978-1-1-7-14989-2. Chapter 21.
Balasubramanian Narasimhan and Bradley Efron. deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. doi:10.18637/jss.v094.i11
Shakespeare word counts in the entire canon: 14,376 distinct words appeared exactly once, 4343 words appeared twice etc.
data(bardWordCount)
data(bardWordCount)
Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, Vol 63(3), doi:10.1093/biomet/63.3.435.
A function to compute Empirical Bayes estimates using deconvolution
deconv( tau, X, y, Q, P, n = 40, family = c("Poisson", "Normal", "Binomial"), ignoreZero = TRUE, deltaAt = NULL, c0 = 1, scale = TRUE, pDegree = 5, aStart = 1, ... )
deconv( tau, X, y, Q, P, n = 40, family = c("Poisson", "Normal", "Binomial"), ignoreZero = TRUE, deltaAt = NULL, c0 = 1, scale = TRUE, pDegree = 5, aStart = 1, ... )
tau |
a vector of (implicitly m) discrete support points for
|
X |
the vector of sample values: a vector of counts for
Poisson, a vector of z-scores for Normal, a 2-d matrix with
rows consisting of pairs, (trial size |
y |
the multinomial counts. See details below |
Q |
the Q matrix, implies y and P are supplied as well; see details below |
P |
the P matrix, implies Q and y are supplied as well; see details below |
n |
the number of support points for X. Applies only to
Poisson and Normal. In the former, implies that support of X is
1 to n or 0 to n-1 depending on the |
family |
the exponential family, one of |
ignoreZero |
if the zero values should be ignored (default =
|
deltaAt |
the theta value where a delta function is desired
(default |
c0 |
the regularization parameter (default 1) |
scale |
if the Q matrix should be scaled so that the spline
basis has mean 0 and columns sum of squares to be one, (default
|
pDegree |
the degree of the splines to use (default 5). In
notation used in the references below, |
aStart |
the starting values for the non-linear optimization, default is a vector of 1s |
... |
further args to function |
a list of 9 items consisting of
mle |
the maximum
likelihood estimate |
Q |
the m by p matrix Q |
P |
the n by m matrix P |
S |
the ratio of
artificial to genuine information per the reference below,
where it was referred to as |
cov |
the covariance matrix for the mle |
cov.g |
the covariance
matrix for the |
stats |
an m by 6 or 7 matrix
containing columns for |
loglik |
the negative log-likelihood function for the data
taking a |
statsFunction |
a function to compute the statistics returned above |
The data X
is always required with two exceptions. In the Poisson case,
y
alone may be specified and X
omitted, in which case the sample space of
the observations $$ is assumed to be 1, 2, ..,
length(y)
. The second exception is
for experimentation with other exponential families besides the three implemented here:
y
, P
and Q
can be specified together.
Note also that in the Poisson case where there is zero truncation,
the stats
matrix has an additional column "tg"
which
accounts for the thinning correction induced by the truncation. See
vignette for details.
Bradley Efron. Empirical Bayes Deconvolution Estimates. Biometrika 103(1), 1-20, ISSN 0006-3444. doi:10.1093/biomet/asv068. http://biomet.oxfordjournals.org/content/103/1/1.full.pdf+html
Bradley Efron and Trevor Hastie. Computer Age Statistical Inference. Cambridge University Press. ISBN 978-1-1-7-14989-2. Chapter 21.
set.seed(238923) ## for reproducibility N <- 1000 theta <- rchisq(N, df = 10) X <- rpois(n = N, lambda = theta) tau <- seq(1, 32) result <- deconv(tau = tau, X = X, ignoreZero = FALSE) print(result$stats) ## ## Twin Towers Example ## See Brad Efron: Bayes, Oracle Bayes and Empirical Bayes ## disjointTheta is provided by deconvolveR package theta <- disjointTheta; N <- length(disjointTheta) z <- rnorm(n = N, mean = disjointTheta) tau <- seq(from = -4, to = 5, by = 0.2) result <- deconv(tau = tau, X = z, family = "Normal", pDegree = 6) g <- result$stats[, "g"] if (require("ggplot2")) { ggplot() + geom_histogram(mapping = aes(x = disjointTheta, y = ..count.. / sum(..count..) ), color = "blue", fill = "red", bins = 40, alpha = 0.5) + geom_histogram(mapping = aes(x = z, y = ..count.. / sum(..count..) ), color = "brown", bins = 40, alpha = 0.5) + geom_line(mapping = aes(x = tau, y = g), color = "black") + labs(x = paste(expression(theta), "and x"), y = paste(expression(g(theta)), " and f(x)")) }
set.seed(238923) ## for reproducibility N <- 1000 theta <- rchisq(N, df = 10) X <- rpois(n = N, lambda = theta) tau <- seq(1, 32) result <- deconv(tau = tau, X = X, ignoreZero = FALSE) print(result$stats) ## ## Twin Towers Example ## See Brad Efron: Bayes, Oracle Bayes and Empirical Bayes ## disjointTheta is provided by deconvolveR package theta <- disjointTheta; N <- length(disjointTheta) z <- rnorm(n = N, mean = disjointTheta) tau <- seq(from = -4, to = 5, by = 0.2) result <- deconv(tau = tau, X = z, family = "Normal", pDegree = 6) g <- result$stats[, "g"] if (require("ggplot2")) { ggplot() + geom_histogram(mapping = aes(x = disjointTheta, y = ..count.. / sum(..count..) ), color = "blue", fill = "red", bins = 40, alpha = 0.5) + geom_histogram(mapping = aes(x = z, y = ..count.. / sum(..count..) ), color = "brown", bins = 40, alpha = 0.5) + geom_line(mapping = aes(x = tau, y = g), color = "black") + labs(x = paste(expression(theta), "and x"), y = paste(expression(g(theta)), " and f(x)")) }
values that have a bimodal distribution for testingA set of values that have a bimodal distribution for testing
data(disjointTheta)
data(disjointTheta)
,
) where
is the number of satellites removed
and
is the number of satellites found to be malignant.Intestinal surgery data involving 844 cancer patients. The data consists of
pairs (,
) where
is the number of satellites removed
and
is the number of satellites found to be malignant.
data(surg)
data(surg)
Gholami, et. al. Number of Lymph Nodes Removed and Survival after Gastric Cancer Resection: An Analysis from the US Gastric Cancer Collaborative. J Am Coll Surg. 2015 Aug;221(2):291-9. doi: 10.1016/j.jamcollsurg.2015.04.024.