Title: | Pam: Prediction Analysis for Microarrays |
---|---|
Description: | Some functions for sample classification in microarrays. |
Authors: | Trevor Hastie [aut], Rob Tibshirani [aut], Balasubramanian Narasimhan [aut, cre], Gilbert Chu [aut] |
Maintainer: | Balasubramanian Narasimhan <[email protected]> |
License: | GPL-2 |
Version: | 1.57 |
Built: | 2024-10-29 04:46:09 UTC |
Source: | https://github.com/bnaras/pamr |
The khan
data frame has 2308 rows and 65 columns. These are one of
the datasets data used in the Tibshirani et al paper in PNAS on nearest
shrunken centroids.
The first two columns of gene ids and names and the remaining
columns are gene expression values for 63 samples. An attribute
cancer_type
contains the cancer type for each sample.
A function to adaptive choose threshold scales, for use in pamr.train
pamr.adaptthresh(object, ntries = 10, reduction.factor = 0.9, full.out = FALSE)
pamr.adaptthresh(object, ntries = 10, reduction.factor = 0.9, full.out = FALSE)
object |
The result of a call to pamr.train |
ntries |
Number of iterations to use in algorithm |
reduction.factor |
Amount by which a scaling is reduced in one step of the algorithm |
full.out |
Should full output be returned? Default FALSE |
pamr.adaptthresh
Adaptively searches for set of good threshold
scales. The baseline (default) scale is 1 for each class. The idea is that
for easy to classify classes, the threshold scale can be increased without
increasing the error rate for that class, and resulting in fewer genes
needed for the classification rule. The scalings from pamr.adaptthresh are
then used in pamr.train, and pamr.cv. The results may be better than those
obtained with the default values of threshold.scale.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. "Diagnosis of multiple cancer types by shrunken centroids of gene expression" PNAS 2002 99:6567-6572 (May 14).
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu (2002). Class prediction by nearest shrunken centroids,with applications to DNA microarrays. Stanford tech report.
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) new.scales <- pamr.adaptthresh(mytrain) mytrain2 <- pamr.train(mydata, threshold.scale=new.scales) myresults2 <- pamr.cv(mytrain2, mydata)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) new.scales <- pamr.adaptthresh(mytrain) mytrain2 <- pamr.train(mydata, threshold.scale=new.scales) myresults2 <- pamr.cv(mytrain2, mydata)
A function to mean-adjust microarray data by batches
pamr.batchadjust(data)
pamr.batchadjust(data)
data |
The input data. A list with components: x- an expression genes in the rows, samples in the columns, and y- a vector of the class labels for each sample, and batchlabels- a vector of batch labels for each sample. |
pamr.batchadjust
does a genewise one-way ANOVA adjustment for
expression values. Let be the expression for gene
in sample
.
Suppose sample
in in batch
, and let
be the set of all samples in batch
. Then
pamr.batchadjust
adjusts to
where the mean is taken over all samples
in
.
A data object of the same form as the input data, with x replaced by the adjusted x
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) #generate some data x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) batchlabels <- sample(c(1:5),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y),batchlabels=factor(batchlabels)) mydata2 <- pamr.batchadjust(mydata)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) #generate some data x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) batchlabels <- sample(c(1:5),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y),batchlabels=factor(batchlabels)) mydata2 <- pamr.batchadjust(mydata)
A function giving a table of true versus predicted values, from a nearest shrunken centroid fit.
pamr.confusion(fit, threshold, extra = TRUE)
pamr.confusion(fit, threshold, extra = TRUE)
fit |
The result of a call to pamr.train or pamr.cv |
threshold |
The desired threshold value |
extra |
Should the classwise and overall error rates be returned? Default TRUE |
pamr.confusion
Gives a cross-tabulation of true versus predicted
classes for the fit returned by pamr.train or pamr.cv, at the specified
threshold.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.confusion(mytrain, threshold=2) pamr.confusion(mycv, threshold=2)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.confusion(mytrain, threshold=2) pamr.confusion(mycv, threshold=2)
computes confusion matrix for (survival.time,censoring) outcome based on fit object "fit" and class predictions "yhat" soft response probabilities for (survival.time,censoring) are first estimated using Kaplan-Meier method applied to training data
pamr.confusion.survival(fit, survival.time, censoring.status, yhat)
pamr.confusion.survival(fit, survival.time, censoring.status, yhat)
fit |
The result of a call to pamr.train or pamr.cv |
survival.time |
Survival time |
censoring.status |
censoring status |
yhat |
class predictions |
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
A function to cross-validate the nearest shrunken centroid classifier produced by pamr.train
pamr.cv(fit, data, nfold = NULL, folds = NULL, ...)
pamr.cv(fit, data, nfold = NULL, folds = NULL, ...)
fit |
The result of a call to pamr.train |
data |
A list with at least two components: x- an expression genes in the rows, samples in the columns), and y- a vector of the class labels for each sample. Same form as data object used by pamr.train. |
nfold |
Number of cross-validation folds. Default is the smallest class size |
folds |
A list with nfold components, each component a vector of indices of the samples in that fold. By default a (random) balanced cross-validation is used |
... |
Any additional arguments that are to be passed to pamr.train |
pamr.cv
carries out cross-validation for a nearest shrunken centroid
classifier.
A list with components
threshold |
A vector of the thresholds tried in the shrinkage |
errors |
The number of cross-validation errors for each threshold value |
loglik |
The cross-validated multinomial log-likelihood value for each threshold value |
size |
A vector of the number of genes that survived the thresholding, for each threshold value tried. |
.
yhat |
A matrix of size n by nthreshold, containing the cross-validated class predictions for each threshold value, in each column |
prob |
A matrix of size n by nthreshold, containing the cross-validated class probabilities for each threshold value, in each column |
folds |
The cross-validation folds used |
cv.objects |
Train objects (output of pamr.train), from each of the CV folds |
call |
The calling sequence used |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata)
A function to decorrelate (adjust) the feature matrix with respect to some additional predictors
pamr.decorrelate( x, adjusting.predictors, xtest = NULL, adjusting.predictors.test = NULL )
pamr.decorrelate( x, adjusting.predictors, xtest = NULL, adjusting.predictors.test = NULL )
x |
Matrix of training set feature values, with genes in the rows, samples in the columns |
adjusting.predictors |
List of training set predictors to be used for adjustment |
xtest |
Optional matrix of test set feature values, to be adjusted in the same way as the training set |
adjusting.predictors.test |
Optional list of test set predictors to be used for adjustment |
pamr.decorrelate
Does a least squares regression of each row of x on
the adjusting predictors, and returns the residuals. If xtest is provided,
it also returns the adjusted version of xtest, using the training set least
squares regression model for adjustment
A list with components
x.adj |
Adjusted x matrix |
xtest.adj |
Adjusted xtest matrix, if xtest we provided |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu Diagnosis of multiple cancer types by shrunken centroids of gene expression PNAS 99: 6567-6572. Available at www.pnas.org
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x<-matrix(rnorm(1000*20),ncol=20) y<-c(rep(1,10),rep(2,10)) adjusting.predictors=list(pred1=rnorm(20), pred2=as.factor(sample(c(1,2),replace =TRUE,size=20))) xtest=matrix(rnorm(1000*10),ncol=10) adjusting.predictors.test=list(pred1=rnorm(10), pred2=as.factor(sample(c(1,2),replace =TRUE,size=10))) # decorrelate training x wrt adjusting predictors x.adj=pamr.decorrelate(x,adjusting.predictors)$x.adj # train classifier with adjusted x d=list(x=x.adj,y=y) a<-pamr.train(d) # decorrelate training and test x wrt adjusting predictors, then make #predictions for test set temp <- pamr.decorrelate(x,adjusting.predictors, xtest=xtest, adjusting.predictors.test=adjusting.predictors.test) d=list(x=temp$x.adj,y=y) a<-pamr.train(d) aa<-pamr.predict(a,temp$xtest.adj, threshold=.5)
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x<-matrix(rnorm(1000*20),ncol=20) y<-c(rep(1,10),rep(2,10)) adjusting.predictors=list(pred1=rnorm(20), pred2=as.factor(sample(c(1,2),replace =TRUE,size=20))) xtest=matrix(rnorm(1000*10),ncol=10) adjusting.predictors.test=list(pred1=rnorm(10), pred2=as.factor(sample(c(1,2),replace =TRUE,size=10))) # decorrelate training x wrt adjusting predictors x.adj=pamr.decorrelate(x,adjusting.predictors)$x.adj # train classifier with adjusted x d=list(x=x.adj,y=y) a<-pamr.train(d) # decorrelate training and test x wrt adjusting predictors, then make #predictions for test set temp <- pamr.decorrelate(x,adjusting.predictors, xtest=xtest, adjusting.predictors.test=adjusting.predictors.test) d=list(x=temp$x.adj,y=y) a<-pamr.train(d) aa<-pamr.predict(a,temp$xtest.adj, threshold=.5)
A function to estimate false discovery rates for the nearest shrunken centroid classifier
pamr.fdr( trained.obj, data, nperms = 100, xl.mode = c("regular", "firsttime", "onetime", "lasttime"), xl.time = NULL, xl.prevfit = NULL )
pamr.fdr( trained.obj, data, nperms = 100, xl.mode = c("regular", "firsttime", "onetime", "lasttime"), xl.time = NULL, xl.prevfit = NULL )
trained.obj |
The result of a call to pamr.train |
data |
Data object; same as the one passed to pamr.train |
nperms |
Number of permutations for estimation of FDRs. Default is 100 |
xl.mode |
Used by Excel interface |
xl.time |
Used by Excel interface |
xl.prevfit |
Used by Excel interface |
pamr.fdr
estimates false discovery rates for a nearest shrunken
centroid classifier
A list with components:
results |
Matrix of estimates FDRs for various various threshold values. Reported are both the median and 90th percentile of the FDR over permutations |
pi0 |
The estimated proportion of genes that are null, i.e. not significantly different |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) mytrain <- pamr.train(mydata) myfdr <- pamr.fdr(mytrain, mydata)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) mytrain <- pamr.train(mydata) myfdr <- pamr.fdr(mytrain, mydata)
A function to plot the genes that survive the thresholding, from the nearest shrunken centroid classifier produced by pamr.train
pamr.geneplot(fit, data, threshold)
pamr.geneplot(fit, data, threshold)
fit |
The result of a call to pamr.train |
data |
The input data. In the same format as the input data for pamr.train |
threshold |
The desired threshold value |
pamr.geneplot
Plots the raw gene expression for genes that survive
the specified threshold. Plot is stratified by class. Plot is set up to
display only up to about 20 or 25 genes, otherwise it gets too crowded.
Hence threshold should be chosen to yield at most about 20 or 25 genes.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) pamr.geneplot(mytrain, mydata, threshold=1.6)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) pamr.geneplot(mytrain, mydata, threshold=1.6)
A function that takes estimate class probabilities and produces a class prediction or indeterminate prediction
pamr.indeterminate(prob, mingap = 0)
pamr.indeterminate(prob, mingap = 0)
prob |
Estimated class probabilities, from pamr.predict with type="posterior") |
mingap |
Minimum difference between highest and second highest probability. If difference is < mingap, prediction is set to indeterminate (NA) |
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) prob<- pamr.predict(mytrain, mydata$x , threshold=1, type="posterior") pamr.indeterminate(prob,mingap=.75)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) prob<- pamr.predict(mytrain, mydata$x , threshold=1, type="posterior") pamr.indeterminate(prob,mingap=.75)
A function to list the genes that survive the thresholding, from the nearest shrunken centroid classifier produced by pamr.train
pamr.listgenes(fit, data, threshold, fitcv = NULL, genenames = FALSE)
pamr.listgenes(fit, data, threshold, fitcv = NULL, genenames = FALSE)
fit |
The result of a call to pamr.train |
data |
The input data. In the same format as the input data for pamr.train |
threshold |
The desired threshold value |
fitcv |
Optional object, result of a call to pamr.cv |
genenames |
Include genenames in the list? If yes, they are taken from "data". Default is false (geneid is always included in the list). |
pamr.listgenes
List the geneids, and standardized centroids for each
class, for genes surviving at the given threshold. If fitcv is provided, the
function also reports the average rank of the gene in the cross-validation
folds, and the proportion of times that the gene is chosen (at the given
threshold) in the cross-validation folds.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) #train classifier mytrain<- pamr.train(mydata) pamr.listgenes(mytrain, mydata, threshold=1.6)
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) #train classifier mytrain<- pamr.train(mydata) pamr.listgenes(mytrain, mydata, threshold=1.6)
function to interactively define classes from a clustering tree
pamr.makeclasses(data, sort.by.class = FALSE, ...)
pamr.makeclasses(data, sort.by.class = FALSE, ...)
data |
The input data. A list with components: x- an expression genes in the rows, samples in the columns, and y- a vector of the class labels for each sample, and batchlabels- a vector of batch labels for each sample. This object if the same form as that produced by pamr.from.excel. |
sort.by.class |
Optional argument. If true, the clustering tree is forced to put all samples in the same class (as defined by the class labels y in ‘data’) together in the tree. This is useful if a regrouping of classes is desired. Eg: given classes 1,2,3,4 you want to define new classes (1,3) vs (2,4) or 2 vs (1,3) |
... |
Any additional arguments to be passed to hclust |
pamr.makeclasses
Using this function the user interactively defines a
new set of classes, to be used in pamr.train, pamr.cv etc. After invoking
pamr.makeclasses, a clustering tree is drawn. This callss the R function
hclust
, and any arguments for hclust
can be passed to it.
Using the left button, the user clicks at the junction point defining the
subgroup 1. More groups can be added to class 1 by clicking on further
junction points. The user ends the definition of class 1 by clicking on the
rightmost button (in Windows, an additional menu appears and he chooses
Stop). This process is continued for classes 2,3 etc. Note that some
sample may be left out of the new classes. Two consecutive clicks of the
right button ends the definition for all classes.
At the end, the clustering is redrawn, with the new class labels shown.
Note: this function is "fragile". The user must click close to the junction point, to avoid confusion with other junction points. Classes 1,2,3.. cannot have samples in common (if they do, an Error message will appear). If the function is confused about the desired choices, it will complain and ask the user to rerun pamr.makeclasses. The user should also check that the labels on the final redrawn cluster tree agrees with the desired classes.
A vector of class labels 1,2,3... If a component is NA (missing),
then the sample is not assigned to any class. This vector should be
assigned to the newy component of data, for use in pamr.train etc. Note
that pamr.train uses the class labels in the component newy'' if it is present. Otherwise it uses the data labels
y”.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) #generate some data x <- matrix(rnorm(1000*40),ncol=40) y <- sample(c(1:4),size=40,replace=TRUE) batchlabels <- sample(c(1:5),size=40,replace=TRUE) mydata <- list(x=x,y=factor(y),batchlabels=factor(batchlabels), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) # mydata$newy <- pamr.makeclasses(mydata) Run this and define some new classes train <- pamr.train(mydata)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) #generate some data x <- matrix(rnorm(1000*40),ncol=40) y <- sample(c(1:4),size=40,replace=TRUE) batchlabels <- sample(c(1:5),size=40,replace=TRUE) mydata <- list(x=x,y=factor(y),batchlabels=factor(batchlabels), geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep="")) # mydata$newy <- pamr.makeclasses(mydata) Run this and define some new classes train <- pamr.train(mydata)
A function to plot the shrunken class centroids, from the nearest shrunken centroid classifier produced by pamr.train
pamr.plotcen(fit, data, threshold)
pamr.plotcen(fit, data, threshold)
fit |
The result of a call to pamr.train |
data |
The input data, in the same form as that used by pamr.train |
threshold |
The desired threshold value |
pamr.plotcen
plots the shrunken class centroids for each class, for
genes surviving the threshold for at least once class. If genenames are
included in "data", they are added to the plot. Note: for many classes and
long gene names, this plot may need some manual prettying.
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y,genenames=as.character(1:1000)) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.plotcen(mytrain, mydata,threshold=1.6)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y,genenames=as.character(1:1000)) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.plotcen(mytrain, mydata,threshold=1.6)
A function to plot the cross-validated error curves the nearest shrunken centroid classifier
pamr.plotcv(fit)
pamr.plotcv(fit)
fit |
The result of a call to pamr.cv |
pamr.plotcv
plots the cross-validated misclassification error curves,
from nearest shrunken centroid classifier. An overall plot, and a plot by
class, are produced.
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain, mydata) pamr.plotcv(mycv)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain, mydata) pamr.plotcv(mycv)
A function to plot the cross-validated sample probabilities from the nearest shrunken centroid classifier
pamr.plotcvprob(fit, data, threshold)
pamr.plotcvprob(fit, data, threshold)
fit |
The result of a call to pamr.cv |
data |
A list with at least two components: x- an expression genes in the rows, samples in the columns), and y- a vector of the class labels for each sample. Same form as data object used by pamr.train. |
threshold |
Threshold value to be used |
pamr.plotcvprob
plots the cross-validated sample probabilities the
from nearest shrunken centroid classifier, stratified by the true classses.
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.plotcvprob(mycv,mydata,threshold=1.6)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.plotcvprob(mycv,mydata,threshold=1.6)
A function to plot the FDR curve the nearest shrunken centroid classifier
pamr.plotfdr(fdrfit, call.win.metafile = FALSE)
pamr.plotfdr(fdrfit, call.win.metafile = FALSE)
fdrfit |
The result of a call to pamr.fdr |
call.win.metafile |
Used by Excel interface |
pamr.plotfdr
plots the FDR curves from nearest shrunken centroid
classifier. The median FDR (solid line) and upper 90 percentile (broken
line) are shown
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:2),size=20,replace=TRUE) x[1:50,y==2]=x[1:50,y==2]+3 mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) myfdr <- pamr.fdr(mytrain, mydata) pamr.plotfdr(myfdr)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:2),size=20,replace=TRUE) x[1:50,y==2]=x[1:50,y==2]+3 mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) myfdr <- pamr.fdr(mytrain, mydata) pamr.plotfdr(myfdr)
A function to plot the survival curves in each Kaplan Meier stratum
pamr.plotstrata(fit, survival.time, censoring.status)
pamr.plotstrata(fit, survival.time, censoring.status)
fit |
The result of a call to pamr.train |
survival.time |
Vector of survival times |
censoring.status |
Vector of censoring status values |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) pamr.plotstrata(a3, d$survival.time, d$censoring.status)
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) pamr.plotstrata(a3, d$survival.time, d$censoring.status)
A function to plots Kaplan-Meier curves stratified by a group variable
pamr.plotsurvival(group, survival.time, censoring.status)
pamr.plotsurvival(group, survival.time, censoring.status)
group |
A grouping factor |
survival.time |
Vector of survival times |
censoring.status |
Vector of censoring status values: 1=died, 0=censored |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) #make class predictions yhat <- pamr.predict(a3,d$x, threshold=1.0) pamr.plotsurvival(yhat, d$survival.time, d$censoring.status)
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) #make class predictions yhat <- pamr.predict(a3,d$x, threshold=1.0) pamr.plotsurvival(yhat, d$survival.time, d$censoring.status)
A function giving prediction information, from a nearest shrunken centroid fit
pamr.predict( fit, newx, threshold, type = c("class", "posterior", "centroid", "nonzero"), prior = fit$prior, threshold.scale = fit$threshold.scale )
pamr.predict( fit, newx, threshold, type = c("class", "posterior", "centroid", "nonzero"), prior = fit$prior, threshold.scale = fit$threshold.scale )
fit |
The result of a call to pamr.train |
newx |
Matrix of features at which predictions are to be made |
threshold |
The desired threshold value |
type |
Type of prediction desired: class predictions, posterior probabilities, (unshrunken) class centroids, vector of genes surviving the threshold |
prior |
Prior probabilities for each class. Default is that specified in "fit" |
threshold.scale |
Additional scaling factors to be applied to the thresholds. Vector of length equal to the number of classes. Default is that specified in "fit". |
pamr.predict
Give a cross-tabulation of true versus predicted classes
for the fit returned by pamr.train or pamr.cv, at the specified threshold
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.predict(mytrain, mydata$x , threshold=1)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) mycv <- pamr.cv(mytrain,mydata) pamr.predict(mytrain, mydata$x , threshold=1)
A function giving prediction information for many threshold values, from a nearest shrunken centroid fit
pamr.predictmany( fit, newx, threshold = fit$threshold, prior = fit$prior, threshold.scale = fit$threshold.scale, ... )
pamr.predictmany( fit, newx, threshold = fit$threshold, prior = fit$prior, threshold.scale = fit$threshold.scale, ... )
fit |
The result of a call to pamr.train |
newx |
Matrix of features at which predictions are to be made |
threshold |
The desired threshold values |
prior |
Prior probabilities for each class. Default is that specified in "fit" |
threshold.scale |
Additional scaling factors to be applied to the thresholds. Vector of length equal to the number of classes. Default is that specified in "fit". |
... |
Additional arguments to be passed to pamr.predict |
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) pamr.predictmany(mytrain, mydata$x)
suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=y) mytrain <- pamr.train(mydata) pamr.predictmany(mytrain, mydata$x)
A function to assign observations to categories, based on their survival times.
pamr.surv.to.class2( y, icens, cutoffs = NULL, n.class = NULL, class.names = NULL, newy = y, newic = icens )
pamr.surv.to.class2( y, icens, cutoffs = NULL, n.class = NULL, class.names = NULL, newy = y, newic = icens )
y |
vector of survival times |
icens |
Vector of censorng status values: 1=died, 0=censored |
cutoffs |
Survival time cutoffs for categories. Default NULL |
n.class |
Number of classes to create: if cutoffs is NULL, n.class equal classes are created. |
class.names |
Character names for classes |
newy |
New set of survival times, for which probabilities are computed (see below). Default is y |
newic |
New set of censoring statuses, for which probabilities are computed (see below). Default is icens |
pamr.pamr.surv.to.class2
splits observations into categories based on
their survival times and the Kaplan-Meier estimates. For example if
n.class=2, it makes two categories, one below the median survival, the other
above. For each observation (newy, ic), it then computes the probability of
that observation falling in each category. For an uncensored observation
that probability is just 1 or 0 depending on when the death occurred. For a
censored observation, the probabilities are based on the Kaplan Meier and
are typically between 0 and 1.
class |
The category labels |
prob |
The estimates class probabilities |
cutoffs |
The cutoffs used |
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) # generate test data junkk<- gendata(n=500) dd <- list(x=junkk$x, survival.time=junkk$y, censoring.status=junkk$ic) # compute soft labels proby <- pamr.surv.to.class2(dd$survival.time, dd$censoring.status, n.class=a3$ngroup.survival)$prob
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) # generate test data junkk<- gendata(n=500) dd <- list(x=junkk$x, survival.time=junkk$y, censoring.status=junkk$ic) # compute soft labels proby <- pamr.surv.to.class2(dd$survival.time, dd$censoring.status, n.class=a3$ngroup.survival)$prob
A function giving a table of true versus predicted values, from a nearest shrunken centroid fit from survival data.
pamr.test.errors.surv.compute(proby, yhat)
pamr.test.errors.surv.compute(proby, yhat)
proby |
Survival class probabilities, from pamr.surv.to.class2 |
yhat |
Estimated class labels, from pamr.predict |
pamr.test.errors.surv.compute
computes the erros between the true
'soft" class labels proby and the estimated ones "yhat"
Trevor Hastie, Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) # generate test data junkk<- gendata(n=500) dd <- list(x=junkk$x, survival.time=junkk$y, censoring.status=junkk$ic) # compute soft labels proby <- pamr.surv.to.class2(dd$survival.time, dd$censoring.status, n.class=a3$ngroup.survival)$prob # make class predictions for test data yhat <- pamr.predict(a3,dd$x, threshold=1.0) # compute test errors pamr.test.errors.surv.compute(proby, yhat)
gendata<-function(n=100, p=2000){ tim <- 3*abs(rnorm(n)) u<-runif(n,min(tim),max(tim)) y<-pmin(tim,u) ic<-1*(tim<u) m <- median(tim) x<-matrix(rnorm(p*n),ncol=n) x[1:100, tim>m] <- x[1:100, tim>m]+3 return(list(x=x,y=y,ic=ic)) } # generate training data; 2000 genes, 100 samples junk<-gendata(n=100) y<-junk$y ic<-junk$ic x<-junk$x d <- list(x=x,survival.time=y, censoring.status=ic, geneid=as.character(1:nrow(x)), genenames=paste("g",as.character(1:nrow(x)),sep= "")) # train model a3<- pamr.train(d, ngroup.survival=2) # generate test data junkk<- gendata(n=500) dd <- list(x=junkk$x, survival.time=junkk$y, censoring.status=junkk$ic) # compute soft labels proby <- pamr.surv.to.class2(dd$survival.time, dd$censoring.status, n.class=a3$ngroup.survival)$prob # make class predictions for test data yhat <- pamr.predict(a3,dd$x, threshold=1.0) # compute test errors pamr.test.errors.surv.compute(proby, yhat)
A function that computes a nearest shrunken centroid for gene expression (microarray) data
pamr.train( data, gene.subset = NULL, sample.subset = NULL, threshold = NULL, n.threshold = 30, scale.sd = TRUE, threshold.scale = NULL, se.scale = NULL, offset.percent = 50, hetero = NULL, prior = NULL, remove.zeros = TRUE, sign.contrast = "both", ngroup.survival = 2 )
pamr.train( data, gene.subset = NULL, sample.subset = NULL, threshold = NULL, n.threshold = 30, scale.sd = TRUE, threshold.scale = NULL, se.scale = NULL, offset.percent = 50, hetero = NULL, prior = NULL, remove.zeros = TRUE, sign.contrast = "both", ngroup.survival = 2 )
data |
The input data. A list with components: x- an expression genes in the rows, samples in the columns), and y- a vector of the class labels for each sample. Optional components- genenames, a vector of gene names, and geneid- a vector of gene identifiers. |
gene.subset |
Subset of genes to be used. Can be either a logical vector of length total number of genes, or a list of integers of the row numbers of the genes to be used |
sample.subset |
Subset of samples to be used. Can be either a logical vector of length total number of samples, or a list of integers of the column numbers of the samples to be used. |
threshold |
A vector of threshold values for the centroid shrinkage.Default is a set of 30 values chosen by the software |
n.threshold |
Number of threshold values desired (default 30) |
scale.sd |
Scale each threshold by the wthin class standard deviations? Default: true |
threshold.scale |
Additional scaling factors to be applied to the thresholds. Vector of length equal to the number of classes. Default- a vectors of ones. |
se.scale |
Vector of scaling factors for the within class standard errors. Default is sqrt(1/n.class-1/n), where n is the overall sample size and n.class is the sample sizes in each class. This default adjusts for different class sizes. |
offset.percent |
Fudge factor added to the denominator of each t-statistic, expressed as a percentile of the gene standard deviation values. This is a small positive quantity to penalize genes with expression values near zero, which can result in very large ratios. This factor is expecially impotant for Affy data. Default is the median of the standard deviations of each gene. |
hetero |
Should a heterogeneity transformation be done? If yes, hetero must be set to one of the class labels (see Details below). Default is no (hetero=NULL) |
prior |
Vector of length the number of classes, representing prior probabilities for each of the classes. The prior is used in Bayes rule for making class prediction. Default is NULL, and prior is then taken to be n.class/n, where n is the overall sample size and n.class is the sample sizes in each class. |
remove.zeros |
Remove threshold values yielding zero genes? Default TRUE |
sign.contrast |
Directions of allowed deviations of class-wise average
gene expression from the overall average gene expression. Default is
|
ngroup.survival |
Number of groups formed for survival data. Default 2 |
pamr.train
fits a nearest shrunken centroid classifier to gene
expression data. Details may be found in the PNAS paper referenced below.
One feature not described there is "heterogeneity analysis". Suppose there
are two classes labelled "A" and "B". CLass "A" is considered a normal
class, and "B" an abnormal class. Setting hetero="A" transforms expression
values to
where the mean is taken only over
samples in class "A". The transformed feature values are then used in Pam.
This is useful when the abnormal class "B" is heterogeneous, i.e. a given
gene might have higher expresion than normal for some class "B" samples, and
lower for others. With more than 2 classes, each class is centered on the
class specified by hetero.
A list with components
y |
The outcome classes. |
yhat |
A matrix of predicted classes, each column representing the results from one threshold. |
.
prob |
A array of predicted class probabilities. of dimension n by nclass by n.threshold. n is the number samples, nclass is the number of classes, n.threshold is the number of thresholds tried |
centroids |
A matrix of (unshrunken) class centroids, n by nclass |
hetero |
Value of hetero used in call to pamr.train |
norm.cent |
Centroid of "normal" group, if hetero was specified |
centroid.overall |
A vector containing the (unshrunken) overall centroid (all classes together) |
sd |
A vector of the standard deviations for each gene |
threshold |
A vector of the threshold tried in the shrinkage |
nonzero |
A vector of the number of genes that survived the thresholding, for each threshold value tried |
threshold.scale |
A vector of threshold scale factors that were used |
se.scale |
A vector of standard error scale factors that were used |
call |
The calling sequence used |
prior |
The prior probabilities used |
errors |
The number of trainin errors for each threshold value |
Trevor Hastie,Robert Tibshirani, Balasubramanian Narasimhan, and Gilbert Chu
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu Diagnosis of multiple cancer types by shrunken centroids of gene expression PNAS 99: 6567-6572. Available at www.pnas.org
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y)) #train classifier results<- pamr.train(mydata) # train classifier on all data except class 4 results2 <- pamr.train(mydata,sample.subset=(mydata$y!=4)) # train classifier on only the first 500 genes results3 <- pamr.train(mydata,gene.subset=1:500)
#generate some data suppressWarnings(RNGversion("3.5.0")) set.seed(120) x <- matrix(rnorm(1000*20),ncol=20) y <- sample(c(1:4),size=20,replace=TRUE) mydata <- list(x=x,y=factor(y)) #train classifier results<- pamr.train(mydata) # train classifier on all data except class 4 results2 <- pamr.train(mydata,sample.subset=(mydata$y!=4)) # train classifier on only the first 500 genes results3 <- pamr.train(mydata,gene.subset=1:500)