Algorithms for Regularized Linear Discriminant Analysis
Jan Kalina
1
and Jurjen Duintjer Tebbens
2,3
1
Dept. of Medical Informatics and Biostatistics, Institute of Computer Science AS CR, Prague, Czech Republic
2
Dept. of Computational Methods, Institute of Computer Science AS CR, Prague, Czech Republic
3
Dept. of Biophysics and Physical Chemistry, Faculty of Pharmacy in Hradec Kr
´
alov
´
e, Charles University in Prague,
Hradec Kr
´
alov
´
e, Czech Republic
Keywords:
Classification Analysis, Regularization, High-dimensional Data, Matrix Decomposition, Computational
Aspects.
Abstract:
This paper is focused on regularized versions of classification analysis and their computation for high-
dimensional data. A variety of regularized classification methods has been proposed and we critically discuss
their computational aspects. We formulate several new algorithms for regularized linear discriminant analysis,
which exploits a regularized covariance matrix estimator towards a regular target matrix. Numerical linear
algebra considerations are used to propose tailor-made algorithms for specific choices of the target matrix.
Further, we arrive at proposing a new classification method based on L
2
-regularization of group means and the
pooled covariance matrix and accompany it by an efficient algorithm for its computation.
1 INTRODUCTION
Classification analysis methods have the aim to con-
struct (learn) a decision rule based on a training data
set, which is able to automatically assign new data to
one of K groups. Linear discriminant analysis (LDA)
is a standard statistical classification method. In the
whole paper, we consider n observations with p vari-
ables, observed in K different samples (groups) with
p > K 2,
X
11
,...,X
1n
1
,...,X
K1
,...,X
Kn
K
, (1)
where n =
K
k=1
n
k
. LDA assumes the data in each
group to come from a Gaussian distribution. The co-
variance matrix Σ is assumed to be the same across
groups and its estimator will be denoted by S.
Various tasks in bioinformatics deal with high-
dimensional data, i.e. data with the number of ob-
served variables p exceeding the number of observa-
tions. Analyzing data in this so-called large p/small n
problem is especially important e.g. in gene expres-
sion studies. Unfortunately, LDA in its standard form
is infeasible for n < p, because the matrix S of size p
is singular and computing its inverse must be replaced
by an appropriate alternative. Available approaches in
this context are based e.g. on pseudoinverse matrices,
which are however unstable due to a small n (Guo
et al., 2007).
Regularized versions of LDA for n p were pro-
posed (Guo et al., 2007) mainly for applications in
bioinformatics. They are based on such regularized
estimator of the covariance matrix, which is guaran-
teed to be regular and positive definite even for n p.
Fast computation and numerical stability is a key re-
quirement expected from reliable statistical and data
mining procedures (Kogan, 2007; Duintjer Tebbens
and Schlesinger, 2007) and remains to be an impor-
tant issue also for regularized classification methods
(Hastie et al., 2008; Pourahmadi, 2013). Let us now
describe the most important approaches and critically
discuss their possible computation.
This paper studies efficient algorithms for com-
puting various regularized versions of LDA. Section 2
of this paper formulates several algorithms for reg-
ularized LDA, which exploits a regularized covari-
ance matrix estimator towards a regular target matrix.
The computational effectivity of the algorithms is in-
spected using arguments of numerical linear algebra.
For a specific choice of the target matrix, we are able
to propose a tailor-made algorithm with a lower com-
putational cost compared to algorithms which are for-
mulated for a general context. Besides, we arrive at
proposing a new version of LDA based on regular-
ization in the sense of the L
2
norm and accompany it
by an efficient algorithm for its computation in Sec-
tion 3. The classification performance of the methods
128
Kalina J. and Duintjer Tebbens J..
Algorithms for Regularized Linear Discriminant Analysis.
DOI: 10.5220/0005234901280133
In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 128-133
ISBN: 978-989-758-070-3
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
is illustrated on real data in Section 4. Finally, Sec-
tion 5 concludes the paper.
2 COMPUTATION OF
REGULARIZED LINEAR
DISCRIMINANT ANALYSIS
This section is devoted to proposing and comparing
new algorithms for a habitually used version of the
regularized LDA. We use suitable matrix decomposi-
tions to propose efficient algorithms either for a gen-
eral choice of T or for its specific choices. To the best
of our knowledge, tailor-made algorithms for a spe-
cific T have not been described. We compare the new
algorithms in terms of their computational costs as
well as numerical stability.
We will describe one of habitually used regular-
ized versions of LDA. This will be denoted as LDA
to avoid confusion, because the concept of regularized
discriminant analysis encompasses several different
methods (Pourahmadi, 2013; Kalina, 2014) A given
target matrix T will be used, which must be a regular
symmetric positive definite matrix of size p × p. Its
most common choices include the identity matrix I
p
or a diagonal (non-identity) matrix; other target matri-
ces have been considered by (Sch
¨
afer and Strimmer,
2005).
Let us denote the mean of the observed values in
the k-th group (k = 1,...,K) by
¯
X
k
. LDA
assigns
a new observation Z = (Z
1
,...,Z
p
)
T
to group k, if
l
k
> l
j
for every j 6= k, where the regularized linear
discriminant score for the k-th group (k = 1,...,K)
has the form
l
k
=
¯
X
T
k
(S
)
1
Z
1
2
¯
X
T
k
(S
)
1
¯
X
k
+ log p
k
, (2)
where p
k
is a prior probability of observing an obser-
vation from the k-th group and
S
= λS + (1 λ)T (3)
for λ (0,1] denotes a regularized estimator of the
covariance matrix across groups. We do not treat the
situation with l
k
= l
k
0
for k
0
6= k separately, because
it occurs with a zero probability for data coming from
a continuous distribution. Equivalently, LDA
assigns
a new observation Z to group k, if
(
¯
X
k
Z)
T
S
∗−1
(
¯
X
k
Z) =
= min
j=1,...,K
(
¯
X
j
Z)
T
S
∗−1
(
¯
X
j
Z)
. (4)
Various versions of the target matrix T have been
suggested by (Sch
¨
afer and Strimmer, 2005). We point
out that the formula (3) can be justified by Bayesian
reasoning. Let us assume normally distributed data
with a covariance matrix Σ, while Σ
1
is assumed to
be a random variable following Wishart distribution
W
p
(T /γ, k) for some γ > 0 and an integer k. Follow-
ing (Haff, 1980), the mean of the posterior distribu-
tion of Σ is equal to (3) up to a normalizing constant,
which does not influence the classification rule.
First, the standard approach for computing LDA
may be improved by employing the eigendecomposi-
tion of S
for a fixed λ. A suitable value of λ is found
by a cross-validation in the form of a grid search over
all possible values of λ (0, 1].
Algorithm 1. LDA
for the general regularization (3)
based on eigendecomposition.
Step 1. Compute the matrix
A = [
¯
X
1
Z, ... ,
¯
X
K
Z] (5)
of size p × K whose k-th column is
¯
X
k
Z.
Step 2. Compute S
according to (3) with a fixed λ
(0,1].
Step 3. Compute the eigendecomposition of S
as
S
= Q
D
Q
T
. (6)
Step 4. Compute the matrix
B = D
1/2
Q
T
A (7)
and assign Z to group k if the column of B with
largest Euclidean norm is the k-th column.
Step 5. Repeat steps 2 to 4 with different values of λ
and find the classification rule with the best clas-
sification performance.
The group assignment (4) is done by using
(
¯
X
j
Z)
T
S
∗−1
(
¯
X
j
Z) =
= (
¯
X
j
Z)
T
Q
D
1
Q
T
(
¯
X
j
Z) =
= kD
1/2
Q
T
(
¯
X
j
Z)k
2
. (8)
The costs of the algorithm can be made reduced
by replacing the eigendecomposition of S
with its
Cholesky decomposition
S
= L
L
T
, (9)
where L
is a nonsingular lower triangular matrix.
Algorithm 2. LDA
for the general regularization (3)
based on Cholesky decomposition.
Step 1. Compute the matrix
A = [
¯
X
1
Z, ... ,
¯
X
K
Z] (10)
of size p × K whose k-th column is
¯
X
k
Z.
Step 2. Compute S
according to (3) with a fixed λ
(0,1].
AlgorithmsforRegularizedLinearDiscriminantAnalysis
129
Step 3. Compute the Cholesky factor L
of S
.
Step 4. Compute the matrix
B = L
T
A (11)
and assign Z to group k if the column of B with
largest Euclidean norm is the k-th column.
Step 5. Repeat steps 2 to 4 with different values of λ
and find the classification rule with the best clas-
sification performance.
For specific target matrices, we can further reduce
computational costs by using the following algorithm
for LDA
. The pooled estimator S can be written in
the form S = Y
T
Y , where
Y = [X
11
¯
X,...,X
1n
1
¯
X,...,
X
K1
¯
X,...,X
Kn
K
¯
X]
T
(12)
is of size n× p. Then using the singular value decom-
position (SVD) of Y in the form
Y = PΣQ
T
, (13)
we can express the eigendecomposition of S as
S = Y
T
Y = (PΣQ
T
)
T
PΣQ
T
= QΣ
2
Q
T
. (14)
The costs will be about 4 · np
2
floating point opera-
tions, thus with p n the gain is considerable.
Moreover, if
S
= λS + (1 λ)I
p
, λ (0,1], (15)
we immediately obtain the needed eigendecomposi-
tion of S
as
S
= λS + (1 λ)I
p
= Q
λΣ
2
+ (1 λ)I
p
Q
T
. (16)
The SVD can be computed in a backward stable way
with all singular values accurate up to machine pre-
cision level (Barlow et al., 2005). For the special
case (15), which is commonly denoted as Tikhonov
or ridge regularization of S, a more efficient compu-
tation can be performed as follows.
Algorithm 3. LDA
for the ridge regularization (15).
Step 1. Compute the matrix
A = [
¯
X
1
Z, ... ,
¯
X
K
Z] (17)
of size p × K whose k-th column is
¯
X
k
Z and
compute the matrix Y in (12).
Step 2. Compute the singular value decomposition of
Y as
Y = PΣQ
T
, (18)
with singular values {σ
1
,...,σ
n
} and comple-
ment these singular values with p n zero values
σ
n+1
= ··· = σ
p
= 0.
Step 3. For a fixed λ (0,1], compute D
=
diag{λσ
2
1
+ (1 λ),...,λσ
2
p
+ (1 λ)}. (19)
Step 4. Compute the matrix
B = D
1/2
Q
T
A (20)
and assign Z to group k if the column of B with
largest Euclidean norm is the k-th column.
Step 5. Repeat steps 2 to 4 with different values of λ
and find the classification rule with the best clas-
sification performance.
Eigenvalues of S
evaluated in (19) can be in-
terpreted as regularized eigenvalues. They are how-
ever different from shrinkage eigenvalues (Pourah-
madi, 2013), which are obtained by regularizing S by
applying a penalization criterion directly on eigenval-
ues.
3 L
2
-REGULARIZED LINEAR
DISCRIMINANT ANALYSIS
Disadvantages of SCRDA include a computational in-
tensity as well as an inconsistent approach to regular-
ization. The means are namely modified by an L
1
-
norm regularization and the covariance matrix in the
sense of the L
2
-norm. Examples of other regularized
LDA versions include the Prediction Analysis of Mi-
croarrays (PAM) (Tibshirani et al., 2003), which can
be described as a diagonalized LDA (DLDA) with
means regularized in the L
1
-norm.
As an alternative, this section proposes a new reg-
ularized version of LDA denoted as L
2
-LDA. As a
unique feature, the means in each group as well as the
pooled covariance matrix are regularized in the same
way, i.e. in the L
2
-norm. We propose an efficient al-
gorithm for the computation of the method.
The classification rule of L
2
-LDA assigns a new
observation Z to the k-th group, if l
k
> l
j
for every
j 6= k, where
l
k
=
¯
X
0
T
k
(S
)
1
Z
1
2
¯
X
0
T
k
(S
)
1
¯
X
0
k
+ log p
k
(21)
and
¯
X
0
k
denotes the shrunken mean of the k-th group
towards the overall mean computed across groups.
The method can be interpreted as based on a L
2
reg-
ularized Mahalanobis distance. As another contrast
with the habitually used algorithm of SCRDA, we
will estimate the parameter λ in a straightforward
way using an asymptotically optimal value minimiz-
ing the mean square error (Sch
¨
afer and Strimmer,
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
130
2005). To avoid confusion, the asymptotically opti-
mal value of λ will be denoted by λ
and the corre-
sponding regularized covariance matrix by
S
= λ
S + (1 λ
)T. (22)
Algorithm 4. L
2
-LDA.
Step 1. Compute λ
as
λ
=
2
p
i=2
i1
j=1
c
var(S
i j
)
2
p
i=2
i1
j=1
S
2
i j
+
p
i=1
(S
ii
1)
2
, (23)
where
c
var(S
i j
) is the maximum likelihood estima-
tor of the variance of values S
i j
for a fixed i and j.
Step 2. Compute the eigendecomposition of S
as
S
= Q
D
Q
T
. (24)
Step 3. For a fixed δ [0,1], compute
¯
X
0
k
= δ
¯
X
k
+ (1 δ)
¯
X, k = 1,...,K. (25)
Step 4. Assign Z to group k, if
kD
1/2
Q
T
(
¯
X
0
k
Z)k =
= min
j=1,...,K
kD
1/2
Q
T
(
¯
X
0
j
Z)k. (26)
Step 5. Repeat steps 3 and 4 for various δ and find
the optimal classification rule yielding the best
classification performance.
The main computational costs are in step 2; the
eigendecomposition costs about 9 · p
3
floating point
operations. Note that we need not (and should never)
compute the inverse of S
, thus avoiding additional
computations of the Mahalanobis distance, which is
expensive of order p
3
and numerically rather unsta-
ble.
Algorithm 4 is formulated for a general target ma-
trix T . For a specific choice of T , a computationally
cheaper method can be obtained in an analogous way
as Algorithms 2 and 3 from the general Algorithm 1.
Particularly, the costs of Cholesky decomposition (9)
are about p
3
/3 floating point operations. On the other
hand, Cholesky decomposition will suffer from insta-
bility when S
is not positive definite.
Analogous reasoning can be applied to obtain
an algorithm for L
2
-regularized version of quadratic
discriminant analysis (QDA), which is another stan-
dard classification method derived for data following
a Gaussian distribution, but without the assumption
of a common covariance matrix. A regularized co-
variance matrix estimator S
k
in the k-th group will be
considered with the optimal value of the regulariza-
tion parameter, again yielding the best classification
performance. The regularized quadratic discriminant
score for the k-th group has the form
q
k
=
¯
X
T
k
(S
k
)
1
Z
1
2
¯
X
T
k
(S
k
)
1
¯
X
k
1
2
Z
T
(S
k
)
1
Z +
1
2
log|S
k
| + log p
k
, (27)
where | · | denotes the determinant of a matrix.
4 EXAMPLES
We present two examples on real molecular genetic
data sets in order to illustrate the behavior of the
newly proposed L
2
-LDA method and to compare its
with performance of classical classification proce-
dures.
Example 1 contains data from our own cardio-
vascular genetic study on 24 patients having a cere-
brovascular stroke and 24 control persons (Kalina and
Zv
´
arov
´
a, 2013). The were p = 38 590 gene transcripts
measured, which correspond to the whole genome.
In Example 2, a prostate cancer metabolomic data
set (Sreekumar et al., 2009) is analyzed, which con-
tains p = 518 metabolites measured over two groups
of patients, namely those with a benign prostate can-
cer (16 patients) and with other cancer types (26 pa-
tients). The task in both examples is to learn a classi-
fication rule allowing to discriminate between the two
classes of individuals.
In both examples, we computed the classification
methods described in this paper using the algorithms
of Sections 2 and 3. For comparison, we computed
also other available classification methods, including
the support vector machines (SVM), a classification
tree, Kohonen’s self-organizing map, or a multilayer
perceptron with 2 hidden layers. Various regularized
versions of LDA include the most common choice
T = I
p
or another choice
S
= λS + (1 λ)sI
p
, s =
p
i=1
S
ii
/p, (28)
for λ (0, 1]. We used the default settings to compute
them in user-submitted packages, which accompany
the R free software and are listed also in Table 1. The
classification performance is measured by means of
the Youden’s index, which is defined as
sensitivity + specificity 1. (29)
The results performed on raw data as well as after
a dimensionality reduction reveal that the regularized
versions of LDA perform quite similarly. The newly
proposed method L
2
-LDA with an efficient algorithm
AlgorithmsforRegularizedLinearDiscriminantAnalysis
131
Table 1: Results of Example 1 and Example 2. LDA
was computed using Algorithm 3 for the choice (15) and Algorithm 2
for (28). L
2
-LDA was computed using Algorithm 4. PCA uses 20 principal components.
Youden’s index
Method S
R Package Function Example 1 Example 2
SVM - e1071 svm 1.00 1.00
Classification tree - tree tree 0.94 0.97
Self-organizing map - kohonen som 0.88 0.93
Multilayer percpetron - nnet nnet Infeasible Infeasible
LDA - MASS lda Infeasible Infeasible
SCRDA (15) rda rda 1.00 1.00
LDA
(15) - - 1.00 1.00
LDA
(28) - - 1.00 1.00
L
2
-LDA (15) - - 1.00 1.00
L
2
-LDA (28) - - 1.00 1.00
PCA = LDA - - - 0.54 0.90
PCA = SCRDA (15) - - 0.71 0.92
PCA = LDA
(15) - - 0.63 0.81
PCA = LDA
(28) - - 0.63 0.81
PCA = L
2
-LDA (15) - - 0.71 0.92
PCA = L
2
-LDA (28) - - 0.71 0.92
seems to perform comparably with the available reg-
ularized methods with less efficient computation. Be-
sides, the choice of the target matrix T does not seem
to play an important role. Some of standard classifi-
cation methods are infeasible because of the dimen-
sionality (n p).
Further, we investigated the effect of dimensional-
ity reduction on the classification performance. Prin-
cipal component analysis (PCA) is performed and the
consequent classification is applied on 20 principal
components. To explain the notation, for example the
approach denoted as PCA = L
2
-LDA corresponds
to performing L
2
-LDA (using Algorithm 4) on 20
principal components of the original data.
The method PCA = L
2
-LDA yields improved
results compared to its standard counterpart (PCA
= LDA) and is not outperformed by any other
method. It is remarkable that the combination PCA
= LDA
does not improve the results compared to
PCA = LDA. The first three principal components
seem rather arbitrary and they explain only 6 %, 6 %,
and 5 % of the variability in the data, respectively. Be-
sides, we did not find any clear interpretation of the
principal components and there seems no remarkable
small group of genes responsible for a large portion
of variability of the data.
5 CONCLUSIONS
For high-dimensional continuous data, the main ob-
stacle of the traditional Mahalanobis distance is sin-
gularity of the empirical covariance matrix. As a so-
lution, various regularized versions of the LDA have
been proposed, which are commonly used to learn
a classification rule from high-dimensional data. This
paper presents our view that the methodology in its
standard form represents a set of ad hoc procedures
rather than a coherent approach, moreover without ef-
ficient algorithms available for the computation. Be-
sides, we explain some open problems concerning
regularized versions of LDA, e.g. specific algorithms
tailor-made for a specific choice of the target matrix.
Several new algorithms are proposed for a reg-
ularized LDA in Section 2. We propose the L
2
-
regularized version of LDA, advocating our position
that the mean and covariance matrix of multivariate
data can be estimated by means of the same regu-
larization principle. A new regularized classification
method L
2
-LDA is proposed in Section 3 and accom-
panied by an efficient algorithm. Its computational
costs are discussed. The regularization can be inter-
preted as a tailor-made correction for a small sample
size. If n < p for larger sample sizes, the effect of the
regularization will become of a smaller importance.
L
2
-LDA can be interpreted as a shrinkage esti-
mator of the covariance matrix, in the light of the
Stein’s result of estimating the mean of multivariate
normal data (Stein, 1956; Hastie et al., 2008). It
is possible to interpret the method as an approach
based on a regularized version of the Mahalanobis
distance. The method is reliable under an implicit as-
sumption that the variability is not substantially dif-
ferent across variables. The optimal regularization
parameters seem to be reasonable in the classification
analysis context. Another possibility is to regularize
the within-group covariance matrix instead of regu-
larizing S, which is however computationally more
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
132
intensive. An analysis of two real data sets reveals
its classification performance to be comparable to
available regularized classification methods for high-
dimensional data.
In general, some regularized statistical methods
for the analysis of high-dimensional data have been
empirically observed to possess reasonable robust-
ness properties with respect to outlying measurement
in the data. To give an example, regularized means
has been observed to cause a certain local robust-
ness against small departures in the observed data
(Tibshirani et al., 2003). Regularization itself cannot
ensure robustness against serious outliers (Filzmoser
and Todorov, 2011) for continuous data. In the words
of robust statistics, regularization does not imply a ro-
bustness in terms of the breakdown point and regular-
ized LDA cannot replace robust classification proce-
dures with a high breakdown point (Kalina, 2012). It
remains an open problem to investigate systematically
the relationship between regularization and statisti-
cal robustness for continuous data. Another warning
should be given that there is no reason to suppose that
the optimal procedure for the regularized model will
perform well away from that model (Davies, 2014).
Alternative approaches could be formulated by
means of regularization requiring a certain level of
sparsity (Chen et al., 2012). Moreover, L
2
-LDA can
be derived in an alternative way as a Bayesian esti-
mator or as the optimal method by means of robust
optimization (Xanthopoulos et al., 2013).
As a future research, we plan to investigate suit-
able choices of the target matrix T and extend the
regularized Mahalanobis distance to the context of
cluster analysis. From the theoretical point of view,
robustness of LDA regularized in the L
1
-norm has
not been inspected as well as regularized versions of
the highly robust MWCD estimator (Kalina, 2012).
We plan to apply and compare regularized versions
of LDA to pattern recognition problems in the analy-
sis of 3D neuroimages of spontaneous brain activity.
There, we plan to exploit the new L
2
-LDA without the
usual sparseness assumption, allowing to choose T to
model the high correlation of neighboring voxels.
ACKNOWLEDGEMENTS
The work was financially supported by the Neuron
Fund for Support of Science. The work of J. Kalina
was supported by the grant GA13-17187S of the
Czech Science Foundation. The work of J. Duintjer
Tebbens was supported by the grant GA13-06684S of
the Czech Science Foundation.
REFERENCES
Barlow, J., Bosner, N., and Drmac, Z. (2005). A new stable
bidiagonal reduction algorithm. Linear Algebra and
its Applications, 397:35–84.
Chen, X., Kim, Y., and Wang, Z. (2012). Efficient mini-
max estimation of a class of high-dimensional sparse
precision matrices. IEEE Transactions on Signal Pro-
cessing, 60:2899–2912.
Davies, P. (2014). Data Analysis and Approximate Mod-
els: Model Choice, Location-Scale, Analysis of Vari-
ance, Nonparametric Regression and Image Analysis.
Chapman & Hall/CRC, Boca Raton.
Duintjer Tebbens, J. and Schlesinger, P. (2007). Improving
implementation of linear discriminant analysis for the
high dimension/small sample size problem. Compu-
tational Statistics & Data Analysis, 52:423–437.
Filzmoser, P. and Todorov, V. (2011). Review of robust mul-
tivariate statistical methods in high dimension. Ana-
lytica Chinica Acta, 705:2–14.
Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized
discriminant analysis and its application in microar-
rays. Biostatistics, 8:86–100.
Haff, L. (1980). Empirical bayes estimation of the multi-
variate normal covariance matrix. Annals of Statistics,
1980:586–597.
Hastie, T., Tibshirani, R., and Friedman, J. (2008). The
elements of statistical learning. Springer, New York,
2nd edition.
Kalina, J. (2012). Highly robust statistical methods in med-
ical image analysis. Biocybernetics and Biomedical
Engineering, 32(2):3–16.
Kalina, J. (2014). Classification analysis methods for
high-dimensional genetic data. Biocybernetics and
Biomedical Engineering, 34:10–18.
Kalina, J. and Zv
´
arov
´
a, J. (2013). Decision support systems
in the process of improving patient safety. In E-health
Technologies and Improving Patient Safety: Explor-
ing Organizational Factors, pages 71–83. IGI Global,
Hershey.
Kogan, J. (2007). Introduction to clustering large and high-
dimensional data. Cambridge University Press, Cam-
bridge.
Pourahmadi, M. (2013). High-dimensional covariance es-
timation. Wiley, New York.
Sch
¨
afer, J. and Strimmer, K. (2005). A shrinkage approach
to large-scale covariance matrix estimation and impli-
cations for functional genomics. Statistical Applica-
tions in Genetics and Molecular Biology, 32:1–30.
Sreekumar et al., A. (2009). Metabolomic profiles delineate
potential role for sarcosine in prostate cancer progres-
sion. Nature, 457:910–914.
Stein, C. (1956). Inadmissibility of the usual estimator for
the mean of a multivariate normal distribution. Pro-
ceedings of the Third Berkeley Symposium on Mathe-
matical Statistics and Probability, 1:197–206.
Tibshirani, R., Hastie, T., and Narasimhan, B. (2003). Class
prediction by nearest shrunken centroids, with ap-
plications to dna microarrays. Statistical Science,
18:104–117.
Xanthopoulos, P., Pardalos, P., and Trafalis, T. (2013). Ro-
bust data mining. Springer, New York.
AlgorithmsforRegularizedLinearDiscriminantAnalysis
133