Title: | Contrast Trees and Boosting |
---|---|
Description: | Contrast trees represent a new approach for assessing the accuracy of many types of machine learning estimates that are not amenable to standard (cross) validation methods; see "Contrast trees and distribution boosting", Jerome H. Friedman (2020) <doi:10.1073/pnas.1921562117>. In situations where inaccuracies are detected, boosted contrast trees can often improve performance. Functions are provided to to build such trees in addition to a special case, distribution boosting, an assumption free method for estimating the full probability distribution of an outcome variable given any set of joint input predictor variable values. |
Authors: | Jerome Friedman [aut, cph], Balasubramanian Narasimhan [aut, cre] |
Maintainer: | Balasubramanian Narasimhan <[email protected]> |
License: | Apache License 2.0 |
Version: | 0.3-1 |
Built: | 2024-10-31 21:13:00 UTC |
Source: | https://github.com/bnaras/contree |
Contrast trees represent a new approach for assessing the accuracy of many types of machine learning estimates that are not amenable to standard (cross) validation methods. In situations where inaccuracies are detected, boosted contrast trees can often improve performance. Functions are provided to to build such trees in addition to a special case, distribution boosting, an assumption free method for estimating the full probability distribution of an outcome variable given any set of joint input predictor variable values.
Original code (C) by Jerome H. Friedman, minor modifications, formatting, and packaging by Balasubramanian Narasimhan
Jerome Friedman (2019). Contrast Trees and Distribution Boosting https://arxiv.org/abs/1912.03785
The data come from 9243 questionnaires filled out by shopping mall
customers in the San Francisco Bay Area (Impact Resources, Inc.,
Columbus, OH). Here we attempt to estimate a persons age as a
function of the other 13 demographic variables. For this data set
age value is reported as being in one of seven intervals {13-17, 18-24, 25-34, 35-44, 45-54, 55-64, >= 65}
. Each persons age is
randomly generated uniformly within its corresponding reported
interval. For the last interval an exponential distribution was
used with mean corresponding to life expectancy after reaching age
65.
age_data
age_data
age_data
A list of 3 items.
data frame of 8856 observations on 13 variables
Randomly generated age in the range above
gradient boosting model for median age given x
The Elements of Statistical Learning, Data Mining, Second Edition, by Hastie, Tibshirani, and Friedman.
The data set consists of hourly averaged measurements from an array
of 5 metal oxide chemical sensors embedded in an air quality
chemical multisensor device. The outcome variable y
is the
corresponding true hourly averaged concentration CO taken from a
reference analyzer. The input variables x
are taken to be the
corresponding hourly averaged measurements of the 13 other
quantities.
air_quality
air_quality
air_quality
A list with 4 items.
data frame of 9357 observations on 13 variables
hourly averaged CO concentration
sample membership indicator
probability propensity score
https://archive.ics.uci.edu/ml/datasets/air+quality
Includes a data frame of 1994 US census income from 48,842 people
divided into a training set of 32,561 and an independent test set
of 16,281. The training outcome variable y
(yt
for test) is
binary and indicates whether or not a person’s income is greater
than $50,000 per year. There are 12 predictor variables x
(xt
for test) consisting of various demographic and financial
properties associated with each person. It also included estimates
of obtained by several machine learning methods:
gradient boosting on logistic scale using maximum likelihood (GBL),
random forest (RF), and gradient boosting on the probability scale
(GBP) using least–squares.
census
census
census
A list of 10 items.
training data frame of 32561 observations on 12 predictor variables
training binary response whether salary is above $50K or not
test data frame of 16281 observations predictor variables
test binary response whether salary is above $50K or not
training GBL response variable
test GBL response variable
training GBP response variable
test GBP response variable
training RF response probabilities
test GBP response probabilities
https://archive.ics.uci.edu/ml/datasets/census+income
Build contrast tree
Build boosted contrast tree model
Bootstrap contrast trees
contrast( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, mode = c("onesamp", "twosamp"), type = "dist", pwr = 2, quant = 0.5, nclass = NULL, costs = NULL, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 1, fnodes = 0.25, fsamp = 1, doprint = FALSE ) modtrast( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, learn.rate = 0.1, type = c("dist", "diff", "class", "quant", "prob", "maxmean", "diffmean"), pwr = 2, quant = 0.5, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 1, fnodes = 0.25, fsamp = 1, doprint = FALSE, niter = 100, doplot = FALSE, span = 0, plot.span = 0.15, print.itr = 10 ) bootcri( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, mode = "onesamp", type = "dist", pwr = 2, quant = 0.5, nclass = NULL, costs = NULL, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 100, fnodes = 1, fsamp = 1, doprint = FALSE )
contrast( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, mode = c("onesamp", "twosamp"), type = "dist", pwr = 2, quant = 0.5, nclass = NULL, costs = NULL, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 1, fnodes = 0.25, fsamp = 1, doprint = FALSE ) modtrast( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, learn.rate = 0.1, type = c("dist", "diff", "class", "quant", "prob", "maxmean", "diffmean"), pwr = 2, quant = 0.5, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 1, fnodes = 0.25, fsamp = 1, doprint = FALSE, niter = 100, doplot = FALSE, span = 0, plot.span = 0.15, print.itr = 10 ) bootcri( x, y, z, w = rep(1, nrow(x)), cat.vars = NULL, not.used = NULL, qint = 10, xmiss = 9e+35, tree.size = 10, min.node = 500, mode = "onesamp", type = "dist", pwr = 2, quant = 0.5, nclass = NULL, costs = NULL, cdfsamp = 500, verbose = FALSE, tree.store = 1e+06, cat.store = 1e+05, nbump = 100, fnodes = 1, fsamp = 1, doprint = FALSE )
x |
training input predictor data matrix or data frame. Rows are observations and columns are variables. Must be a numeric matrix or a data frame. |
y |
vector, or matrix containing training data input outcome values or censoring intervals for each observation. if y is a vector then it implies that y uncensored outcome values or other contrasting quantity. If y is a matrix, then then y is assumed to be censoring intervals for each observation; see details below |
z |
vector containing values of a second contrasting quantity for each observation |
w |
training observation weights |
cat.vars |
vector of column labels (numbers or names) indicating categorical variables (factors). All variables not so indicated are assumed to be orderable numeric; see details below |
not.used |
vector of column labels (numbers or names) indicating predictor variables not to be used in the model |
qint |
maximum number of split evaluation points on each predictor variable |
xmiss |
missing value flag. Must be numeric and larger than
any non missing predictor/abs(response) variable value.
Predictor variable values greater than or equal to xmiss are
regarded as missing. Predictor variable data values of |
tree.size |
maximum number of terminal nodes in generated trees |
min.node |
minimum number of training observations in each tree terminal node |
mode |
indicating one or two-sample contrast; see details below for how it works with type |
type |
type of contrast; see details below for how it works with mode |
pwr |
center split bias parameter. Larger values produce less center split bias. |
quant |
specified quantile p (type='quant' only) |
nclass |
number of classes (type ='class' only) default=2 |
costs |
nclass by nclass misclassification cost matrix (type='class' only); default is equal valued diagonal (error rate) |
cdfsamp |
= maximum subsample size used to compute censored CDF (censoring only) |
verbose |
a logical flag indicating print/don't print censored
CDF computation progress, default |
tree.store |
size of internal tree storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size, or in response to diagnostic message or erratic program behavior |
cat.store |
size of internal categorical value storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size in the presence of many categorical variables (factors) with many levels, or in response to diagnostic message or erratic program behavior |
nbump |
number of bootstrap replications |
fnodes |
top fraction of node criteria used to evaluate trial bumped trees |
fsamp |
fraction of observations used in each bootstrap sample for bumped trees |
doprint |
logical flag |
learn.rate |
learning rate parameter in |
niter |
number of trees |
doplot |
a flag to display/not display graphical plots |
span |
span for qq-plot transformation smoother |
plot.span |
running median smoother span for discrepancy plot ( |
print.itr |
tree discrepancy printing iteration interval |
The varible xmiss
is the missing value flag, Must be
numeric and larger than any non missing predictor/abs(response) variable value. Predictor variable values greater than or equal to xmiss
are regarded as missing. Predictor variable data values of NA
are internally set to the value of xmiss and thereby regarded as missing.
If the response y is a matrix, it is assumed to contain censoring intervals for each observation. Rows are observations.
First/second column are lower/upper boundary of censoring interval (Can be same value for uncensored observations) respectively
y[,1] = -xmiss
implies outcome less than or equal to y[,2]
(censored from above)
y[,2] = xmiss
implies outcome greater than or equal to y[,1]
Note that censoring is only allowed for type='dist'
; see further below.
If x is a data frame and cat.vars
(the columns indicating
categorical variables), is missing, then components of type factor
are treated as categorical variables. Ordered factors should be
input as type numeric with appropriate numerical scores. If
cat.vars
is present it will over ride the data frame typing.
The mode
argument is either
'onesamp'
(default) meaning one x
-vector for each (x,z)
pair
'twosamp'
implies two-sample contrast with
x
are predictor variables for both samples
y
are outcomes for both samples
z
is sample identity flag with z < 0
implying first sample observations and z > 0
, the second sample observations.
The type
argument indicates the type of contrast. It can be either a user defined function or a string. If mode
is 'onesamp'
, the default,
type = 'dist'
(default) implies contrast distribution of y
with that of z
(y
may be censored - see above)
type = 'diff'
implies contrast joint paired values of y
and z
type = 'class'
implies classification: contrast class labels y[i]
and z[i]
are two class labels (in 1:nclass
) for each observation.
type = 'prob'
implies contrast predicted with empirical probabilities: y[i] = 0/1
and z[i]
is predicted probability for
-th observation
type = 'quant'
is contrast predicted with empirical quantiles: y[i]
is outcome value for -th observation and
z[i]
is predicted -th quantile value (see below) for
-th observation
type = 'diffmean'
implies maximize absolute mean difference between y
and z
type = 'maxmean'
implies maximize signed mean difference between y
and z
When mode is 'twosamp'
type= 'dist'
(default) implies contrast y
distributions of both samples
type = 'diffmean'
implies maximize absolute difference between means of two samples
type = 'maxmean'
maximize signed difference between means of two samples
When type
is a function, it must be a function of three arguments
f(y,z,w)
where y
and z
are double vectors and w
is a weight
vector, not necessarily normalized. The function should return a
double vector of length 1 as the result. See example below.
a contrast model object use as input to interpretation procedures
a contrast model object to be used with predtrast()
a named list with out$bcri
the bootstraped discrepancy values
Jerome H. Friedman
Jerome H. Friedman (2020). doi:10.1073/pnas.1921562117
data(census, package = "conTree") dx <- 1:10000; dxt <- 10001:16281; # Build contrast tree tree <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = 'prob') # Summarize tree treesum(tree) # Get terminal node identifiers for regions containing observations 1 through 10 getnodes(tree, x = census$xt[1:10, ]) # Plot nodes nodeplots(tree, x = census$xt[dx, ], y = census$yt[dx], z = census$gblt[dx]) # Summarize contrast tree against (precomputed) gradient boosting # on logistic scale using maximum likelihood (GBL) nodesum(tree, census$xt[dxt,], census$yt[dxt], census$gblt[dxt]) # Use a custom R discrepancy function to build a contrast tree dfun <- function(y, z, w) { w <- w / sum(w) abs(sum(w * (y - z))) } tree2 <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = dfun) nodesum(tree2, census$xt[dxt,], census$yt[dxt], census$gblt[dxt]) # Generate lack of fit curve lofcurve(tree, census$xt[dx,], census$yt[dx], census$gblt[dx]) # Build contrast tree boosting models # Use small # of iterations for illustration (typically >= 200) modgbl = modtrast(census$x, census$y, census$gbl, type = 'prob', niter = 10) # Plot model accuracy as a function of iteration number xval(modgbl, census$x, census$y, census$gbl, col = 'red') # Produce predictions from modtrast() for new data. ypred <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter) # Produce distribution boosting estimates yhat <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)
data(census, package = "conTree") dx <- 1:10000; dxt <- 10001:16281; # Build contrast tree tree <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = 'prob') # Summarize tree treesum(tree) # Get terminal node identifiers for regions containing observations 1 through 10 getnodes(tree, x = census$xt[1:10, ]) # Plot nodes nodeplots(tree, x = census$xt[dx, ], y = census$yt[dx], z = census$gblt[dx]) # Summarize contrast tree against (precomputed) gradient boosting # on logistic scale using maximum likelihood (GBL) nodesum(tree, census$xt[dxt,], census$yt[dxt], census$gblt[dxt]) # Use a custom R discrepancy function to build a contrast tree dfun <- function(y, z, w) { w <- w / sum(w) abs(sum(w * (y - z))) } tree2 <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = dfun) nodesum(tree2, census$xt[dxt,], census$yt[dxt], census$gblt[dxt]) # Generate lack of fit curve lofcurve(tree, census$xt[dx,], census$yt[dx], census$gblt[dx]) # Build contrast tree boosting models # Use small # of iterations for illustration (typically >= 200) modgbl = modtrast(census$x, census$y, census$gbl, type = 'prob', niter = 10) # Plot model accuracy as a function of iteration number xval(modgbl, census$x, census$y, census$gbl, col = 'red') # Produce predictions from modtrast() for new data. ypred <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter) # Produce distribution boosting estimates yhat <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)
Get terminal node observation assignments
getnodes(tree, x)
getnodes(tree, x)
tree |
model object output from contrast() or prune() |
x |
training input predictor data matrix or data frame in same format as in contrast() |
vector of tree terminal node identifiers (numbers) corresponding to each observation (row of x)
Produce lack-of-fit curve for a contrast tree
lofcurve( tree, x, y, z, w = rep(1, length(y)), doplot = "first", col = "black", ylim = NULL )
lofcurve( tree, x, y, z, w = rep(1, length(y)), doplot = "first", col = "black", ylim = NULL )
tree |
model object output from contrast() or prune() |
x |
training input predictor data matrix or data frame in same format as in contrast() |
y |
vector, or matrix containing training data input outcome values or censoring intervals for each observation in same format as in contrast() |
z |
vector containing values of a second contrasting quantity for each observation in same observation format as in contrast () |
w |
observation weights |
doplot |
logical flag. doplot="first" implies start new display. doplot="next" implies super impose plot on existing display. doplot="none" implies no plot displayed. |
col |
color of plotted curve |
ylim |
y-axis limit |
a named list of plotted x
and y
points
Summarize contrast tree
Show graphical terminal node summaries
nodesum(tree, x, y, z, w = rep(1, nrow(x)), doplot = FALSE) nodeplots( tree, x, y, z, w = rep(1, nrow(x)), nodes = NULL, xlim = NULL, ylim = NULL, pts = "FALSE", span = 0.15 )
nodesum(tree, x, y, z, w = rep(1, nrow(x)), doplot = FALSE) nodeplots( tree, x, y, z, w = rep(1, nrow(x)), nodes = NULL, xlim = NULL, ylim = NULL, pts = "FALSE", span = 0.15 )
tree |
model object output from contrast() or prune() |
x |
training input predictor data matrix or data frame in same format as in contrast() |
y |
vector, or matrix containing training data input outcome values or censoring intervals for each observation in same format as in contrast() |
z |
vector containing values of a second contrasting quantity for each observation in same observation format as in contrast() |
w |
observation weights |
doplot |
a flag to display/not display plots of output quantities |
nodes |
selected tree terminal node identifiers. Default is all terminal nodes |
xlim |
x-axis limit |
ylim |
y-axis limit |
pts |
logical flag indicating whether to show |
span |
running median smoother span ( |
The graphical representations of terminal node contrasts depend on the tree type
graphical representations of terminal node contrasts depending on tree type
-type = 'dist'
implies CDFs of y and z in each terminal node. (Only top nine nodes are shown). Note that y can be censored (see above)
-type = 'diff'
implies plot y versus z in each terminal node. (Only top nine nodes are shown).
-type = 'class'
implies barplot of misclassification risk (upper) amd total weight (lower) in each terminal node
-type = 'prob'
implies upper barplot contrasting empirical (blue) and predicted (red) in each terminal node. Lower barplot showing total weight in each terminal node.
type = 'quant' => upper barplot of fraction of y-values greater than or equal to corresponding z-values (quantile prediction) in each terminal node. Horizontal line reflects specified target quantile. Lower barplot showing total weight in each terminal node.
type = 'diffmean'
or type = 'maxmean'
implies upper barplot contrasting y-mean (blue) and z-mean (red) in each terminal node. Lower barplot showing total weight in each terminal node.
a named list of four items:
nodes
the tree terminal node identifiers
cri
the terminal node criterion values (depends on contrast type see above)
wt
sum of weights in each terminal node
avecri
weighted criterion average over all terminal nodes
These functions are mostly useful when one wants to
test one's own discrepancy function in R f(y, z, w)
to
determine if the results are correct. So a natural test
is to experiment by programming one of the already implemented
discrepancy functions in R. However, the Fortran
implementations of such discrepancy measures use some
parameters in the computations and therefore the returned
results from a simple R implementation may not exactly
match. Using these parameters, one can ensure that they
do. These are to be interpreted as follows. For one sample,
the type = "dist"
implementation in the package returns 0 if
the length of y
is less than nmin
which is (100L). The eps = 1.0e-5
parameter is used to ensure that the denominator in
the formula for the Anderson-Darling statistic is at least
eps
. Next, for type = "prob"
, if the length of the vector
is less than nmin = 20
the discrepancy is computed to be
0. And so on. Refer to the R and Fortran source for further
details as this is an advanced topic.
onesample_parameters() twosample_parameters()
onesample_parameters() twosample_parameters()
a named list for each of the types.
Predict y-values from boosted contrast model
predtrast(model, x, z, num = model$niter)
predtrast(model, x, z, num = model$niter)
model |
model object output from modtrast() |
x |
x-values for new data |
z |
z-values for new data |
num |
number of trees used to compute model values |
predicted y-values for new data from model
Prune a contrast tree
prune(tree, thr = 0.1)
prune(tree, thr = 0.1)
tree |
a tree model object output from contrast |
thr |
a split improvement threshold, default is 0.1 |
a bottom-up pruned tree with splits corresponding to improvement less than threshold thr
removed
Show all possible pruned subtrees
prune.seq(tree, eps = 0.01, plot.it = TRUE)
prune.seq(tree, eps = 0.01, plot.it = TRUE)
tree |
a tree model object output from contrast |
eps |
small increment defining grid of threshold values |
plot.it |
a logical flag indicating plot/don't plot of number of nodes versus threshold value for all pruned subtrees, default |
a named list of two items:
thr
a set of threshold values that sequentially reduce tree size
nodes
the corresponding tree sizes (number of terminal nodes)
Save the function f for calling from fortran
save_rfun(f)
save_rfun(f)
f |
the R function to be called using |
TRUE, invisibly.
Print terminal node x-region boundaries
treesum(tree, nodes = NULL)
treesum(tree, nodes = NULL)
tree |
model object output from contrast() or prune() |
nodes |
vector of terminal node identifiers for the tree specifying the desired regions. The default is all terminal nodes. |
The predictor variable x-boundaries defining each terminal node are printed.
For numeric variables: variable | sign | value
sign + => value=lower boundary on variable
sign - => value upper boundary on variable
For categorical variables: cat variable | sign | set of values
sign + => values in node
sign - => values not in node (compliment values in node) graphical representations of terminal node contrasts depend on the tree type
No return value (invisble NULL
)
Cross-validate boosted contrast tree boosted with (new) data
xval( mdl, x, y, z, num = length(mdl$tree), del = 10, span = 0.15, ylab = "Average Discrepancy", doplot = "first", doprint = FALSE, col = "red" )
xval( mdl, x, y, z, num = length(mdl$tree), del = 10, span = 0.15, ylab = "Average Discrepancy", doplot = "first", doprint = FALSE, col = "red" )
mdl |
model output from modtrast() |
x |
data predictor variables is same format as input to modtrast |
y |
data y values is same format as input to modtrast |
z |
data z values is same format as input to modtrast |
num |
number of trees used to compute model values |
del |
plot discrepancy value computed every del-th iteration (tree) |
span |
running median smoother span ( |
ylab |
graphical parameter ('doplot="first", only) |
doplot |
logical flag. doplot="first" implies start new display. doplot="next" implies super impose plot on existing display. doplot="none" implies no plot displayed. |
doprint |
logical flag |
col |
color of plotted curve |
a named list of two items: ntree
the iteration numbers, and error
the corresponding discrepancy values
approximates
for type = 'dist' onlyTransform z-values t(z) such that the distribution of approximates
for type = 'dist' only
ydist(model, x, z, num = model$niter)
ydist(model, x, z, num = model$niter)
model |
model object output from modtrast() |
x |
vector of predictor variable values for a (single) observation |
z |
sample of z-values drawn from |
num |
number of trees used to compute model values |
vector of length(z)
containing transformed values t(z) approximating