Optimal Conjugate Gradient Algorithm for Generalization of Linear

Discriminant Analysis Based on L

Norm

Kanishka Tyagi

, Nojun Kwak

and Michael Manry

Department of Electrical Engineering, The University of Texas at Arlington, Arlington, Texas, U.S.A.

Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea

Keywords:

Linear Discriminant Analysis, L

Norm, Dimension Reduction, Conjugate Gradient, Learning Factor.

Abstract:

This paper analyzes a linear discriminant subspace technique from an L

point of view. We propose an efﬁcient

and optimal algorithm that addresses several major issues with prior work based on, not only the L

based LDA

algorithm but also its L

counterpart. This includes algorithm implementation, effect of outliers and optimality

of parameters used. The key idea is to use conjugate gradient to optimize the L

cost function and to ﬁnd an

learning factor during the update of the weight vector in the subspace. Experimental results on UCI datasets

reveal that the present method is a signiﬁcant improvement over the previous work. Mathematical treatment

for the proposed algorithm and calculations for learning factor are the main subject of this paper.

1 INTRODUCTION

Dimensionality reduction and object classiﬁcation

has received considerable attention from the pattern

recognition community in the past decades (Duda

et al., 2012), (Theodoridis and Koutroumbas, 2009).

The goal of dimensionality reduction in classiﬁcation

is to remove less useful elements from the input vec-

tors. Some of the conventional methods employed

for dimensionality reduction and object classiﬁcation

are principal component analysis (PCA) (Fukunaga,

1990) (Turk and Pentland, ), independent compo-

nent analysis (ICA) (Bell and sejnowski, 1995) (Cao

et al., 2003) (Kwak and Choi, 2003) (Kwon and Lee,

2004) and linear discriminant analysis (LDA) (Fisher,

1936).

The linear discriminant analysis) of Fisher (Fisher,

1936) is a classical supervised subspace analysis tech-

nique. By minimizing the within-class scatter and

maximizing the between-class scatter, LDA seeks dis-

criminative features. LDA is a classic dimensionality

reduction method that preserves as much of the class

discriminatory information as possible.

Since the conventional LDA discriminates data in

a least square sense (L

norm), it is prone to prob-

lems common to any methods utilizing L

optimiza-

tion. It is well known that conventional subspace

analysis techniques based on L

norm minimization

is more sensitive to the presence of outliers as its

effect is magniﬁed due to a large norm. To allevi-

ate this problem, Koren et., al. (Koren and Carmel,

2008) proposed an optimal weighting approach that

assigns small weights to the outliers. However, an

optimal weighting parameter is difﬁcult to determine.

(Li et al., 2010) proposes a similar rotational invari-

ant L

approach but in author’s opinion, the method

has complex implementation. Another problem of the

conventional LDA arises from the assumption that the

data distribution is Gaussian. It means that if there are

several separate clusters in a class (i.e., multi modal)

then the data is not uniquely modeled by a Gaussian

distribution. As a result the approaches based on L

norm will fail (Fukunaga, 1990). To overcome the

difﬁculty of multi-modal data distribution with LDA,

it can be combined with the unsupervised dimension

reduction algorithms called locality preserving pro-

jection method (LPP) to form a local Fisher discrimi-

nant analysis (LFDA), which effectively combines the

ideas of LDA and LPP (Sugiyama, 2007).

In this paper, an efﬁcient L

norm based-LDA al-

gorithm (Oh and Kwak, 2013), which is motivated by

the work in (Kwak, 2008), has been improved by us-

ing an iterative algorithm based on a modiﬁed opti-

mal conjugate gradient (OCG) to solve an L

norm

LDA problem (called OCG-LDA hereafter). We also

introduce a learning factor scheme for updating the

weight vector or projection vectors in the OCG algo-

rithm. The OCG algorithm iteratively converges to

an optimal solution similar to iterative recursive least

squares (IRLS) algorithm.

207

Tyagi K., Kwak N. and Manry M..

Optimal Conjugate Gradient Algorithm for Generalization of Linear Discriminant Analysis Based on L1 Norm.

DOI: 10.5220/0004825402070212

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 207-212

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

The remainder of this paper is structured as fol-

lows. Section 2 reviews the conventional LDA

method (based on L

norm). In Section 3, we pro-

pose the L

LDA based on OCG algorithm includ-

ing its mathematical treatment, theoretical justiﬁca-

tion, proof for suboptimal learning factor and algo-

rithmic description of the complete methodology. Ex-

perimental results are presented in Section 4. Sec-

tion 5 discusses the results and ﬁnally Section 6 and

7 summarizes the proposed work and presents some

future exploration respectively.

2 CONVENTIONAL L

LDA: A

REVIEW

Linear discriminant analysis is a subspace learning

approach that leads to supervised dimensionality re-

duction. It is based on the work of Fisher (Fisher,

1936) and can be considered as an optimal feature

generation process (Theodoridis and Koutroumbas,

2009). It tries to ﬁnd a transformation that maximizes

the ratio of the between-class scatter matrix S

and

the within-class scatter matrix S

(Fukunaga, 1990)

which are deﬁned as

∑

i=1

− m)(m

− m)

∑

i=1

∑

j=1

− m

)(x

− m

)

(1)

where x

is the i-th sample of class j, m

is the mean

of class j, C is the number of classes, N

is the number

of samples in class i and m represents the mean of

all samples. This is formulated to ﬁnd M projection

vectors

{

}

i=1

that maximize the Fisher’s criterion

with respect to W = [w

,··· ,w

], as follows:

LDA

= argmax



W |

. (2)

Equation (2) is the generalized Rayleigh quotient

(Duda et al., 2012), which, as known from linear al-

gebra, is maximized if W is chosen such that

= λ

≥ λ

··· ≥ λ

(3)

Then the linear projection of

{

}

i=1

can be ob-

tained. Here λ

is the i-th largest eigenvalue of S

−1

and w

∈ ℜ

d×m

. Viewing LDA as dimension reduc-

tion technique, LDA is performed by mapping each

vector x in d-dimensional space to a vector y in the M

dimensional space (M < d) linearly. The linear pro-

jection is such that the lower dimensional projection is

closer for same class and farther for different classes.

However it is well known that if the L

(p < 2) norm

is used instead of L

norm, outliers are suppressed and

the method performs better (Oh and Kwak, 2013). In

our present investigation, we consider p = 1. The de-

tails are presented in next section.

3 PROPOSED PARADIGM:

OCG-LDA

It is well known in the literature that algorithms based

on the L

norm are less sensitive to outliers as com-

pared to their L

counterparts (Claerbout and Muir,

1973) (J. A. Scales and Lines, 1988).

3.1 L

Norm based LDA

We formulate an L

-norm maximization problem to

design an L

based LDA. Motivated from the basic

theory, we solve the following L

-norm maximiza-

tion problem (Oh and Kwak, 2013) with the constraint

||w||

=1.

(w) =

∑

i=1

− m)|

∑

i=1

∑

j=1

− m

(4)

In order to maximize the objective function in (4),

we need to consider its non-convexity owing to the ab-

solute value function involved. The singularity due to

the non-convexity of F

(w) makes it difﬁcult to calcu-

late its gradient vector and direction vector. In order

to circumvent this problem, we use sgn(·) function as

follows,

sgn(i) =











1 if i > 0

0 if i = 0

−1 if i < 0.

(5)

modify equation (4) as,

(w) =

∑

i=1

sgn(w

) · (w

)

∑

i=1

∑

j=1

sgn(w

) · (w

)

. (6)

Here,

= m

− m

= x

− m

(7)

3.2 Mathematical Treatment

We now take the gradient of (4) with respect to w.

g(w) = ∇

(w) =

(A · B) − (C · D)

(8)

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

208

where A,B,C and D are deﬁned as,

A =

∑

i=1



sgn(w

) · a



B =

∑

i=1

∑

j=1



sgn(w

) · (w

)



C =

∑

i=1



sgn(w

) · (w

)



D =

∑

i=1

∑

j=1



sgn(w

) · b



The above gradient is well deﬁned when w

6= 0

for all i = 1,··· ,C and j = 1,··· ,N

. However, the A

and D terms in (8) are not well deﬁned at the singular

points where w

= 0 because 0

is hard to deﬁne.

To avoid this problem, we add a singularity check step

before computing the gradient vector in later develop-

ment.

3.3 Theoretical Treatment: Why

Conjugate Gradient ?

The speciﬁc objective of the present work is to im-

prove upon the existing approach of (Oh and Kwak,

2013). In order to do that, we ﬁrst replace the steepest

descent algorithm with a much better conjugate gra-

dient approach. To have a better intuitiveness for the

learning factor, we update the weight vector as fol-

lows:

w(t + 1) ← w(t) + z

· v(t) (9)

where v(t) is the direction vector and z

is the learning

factor, as discussed in the next subsection.

The reason as to why steepest descent is slow is

due to its straight line search strategy. Since the steep-

est descent works on the gradient direction and goes

along a straight line search, it does not stop till the

descent line is parallel to the contour line of the cost

function surface. This poses a serious problem and

eventually leads to slow minimization. What if we

want to stop and change the gradient direction before

it becomes parallel? In steepest descent it is difﬁcult

to make such a stop and therefore it will follow a zig-

zag pattern (Haykin, 2009). In general, overshoot-

ing and undershooting are inherent problems with the

steepest descent approach. One way to overcome this

is to use an learning factor so as to update the weight

vector with gradient information in a more intelligent

way. However what if instead of a line search we

do a plane search? This leads to the conjugate gra-

dient (CG) method where an arbitrary combination of

two vectors forms a hyperplane. The CG algorithm

(Chong and Stanislaw, 2013) solves the equation in

exactly n steps where n is the number of unknowns.

However for non L

functions,(non -quadratic func-

tions)a plane search is hard to deal with.

The conjugate gradient algorithm is related to

Krylov subspace (Olavi, 1993) iteration methods.

Our motivation for using CG comes from its easy

modiﬁcation for non quadratic problems. From the

Hestense-Stiefel formula (to overcome line search

problem) and Polak Ribiere formula, we conclude

that major techniques exist to go around the two ma-

jor problems of line search and calculation of Hessian

matrix (Chong and Stanislaw, 2013).

Having a strong reason to use CG algorithm, we

now present a brief description of the learning factor,

weight updation scheme and proposed algorithm in

the subsequent section.

3.4 Deriving the Learning Factor

Let F

new

be the estimated new value of F

. Applying

Taylor series on F

, we get

new

= F

old

+ g

· ∆w (10)

where gradient g = dF

/dw. Now we have ∆w = z

·v.

Substituting the value of ∆w in (10) we get,

new

= F

old

+ z

· ||v||

g (11)

On simplifying, we get,

new

− F

old

= z

· ||v||

g (12)

If the desired value of F

new

= (1+z) · F

old

. Here z is

the desired fractional increment set to an initial value

as 0.01.

new

− F

old

= z · F

old

(13)

Therefore we have

z · F

old

||v||

(14)

3.5 Deriving the Weight Updation

Scheme

We now elaborate the weight updation scheme

1. Initialization

Set F

old

= ε (small number). Choose z so as to

fractionally increase the value of −F

by 1 % in

each iteration and let z = 0.01.

2. Weight updates

For each iteration,

• Calculate the gradient from (8) and z

as in sec-

tion 3.4. Update the weight vector as in (9)

• Using (4), calculate the value of cost function

−F

OptimalConjugateGradientAlgorithmforGeneralizationofLinearDiscriminantAnalysisBasedonL1Norm

209

If F

> F

old

← F

Increment z as z ← 1.1 ∗ z

Save weight as w ← w

old

(Forwardstep)

Else

Decrement z as z ← 0.5 · z

Decrement z

as z

← 0.5 · z

Read old weight vector w

old

Save weight as w ← w

old

(Back step)

Update the weight vector as in (9)

Break

End if

end iteration

In the above pseudocode, for each weight vector,

the iterations are performed quite a number of time

until we get an optimal value of weight. We now pro-

pose the formal algorithm that incorporates the above

expressions.

3.6 Proposed Algorithm

Since every update of the weight vector leads to

the maximization of F

(w), setting an initial w(0) is

highly important. In our investigation, we choose

the initial vector w(0) to be the solution of the con-

ventional L

-LDA. Alternatively other techniques can

also be tried like re-run the OCG algorithm several

times with different initial w(0) and choosing the best

one. We now present the proposed OCG-LDA algo-

rithm as follows:

1. Initialization

Set t = 0 and w(0) s.t. ||w(0)||

= 1, v(−1) = 0,

Den

= 1. Here v is the direction vector.

2. Singularity Check

If w(t)

= 0 for some i and j, w(t) ←

w(t)+δ

||w(t)+δ||

Here, δ is a random vector with a small magni-

tude.

3. Gradient Calculation

Compute equation (8) to obtain the gradient vec-

tor g.

4. Gradient Energy

Compute gradient energy X

Num

= ||g(w)||

5. Coefﬁcients for Direction Vector

Compute B

Num

Den

6. Direction Vector

Compute v(t) = g(w) + B

· v(t − 1)

7. learning Factor and Weight Updation

Calculate F

(w) in (4) and compute z

opt

. Update

weight vector as w(t +1) ← w(t)+z

opt

·v(t). (See

Section 3.5 for details)

8. Update for Next Iteration

Replace X

Den

with the value of X

Num

Set t ← t + 1.

9. Convergence Check

w(t) − w(t − 1)

≥ ε, then go to Step 2.

Else w

∗

= w(t) and stop iteration.

Figure 1 explains the learning factor z

and weight

updation calculation. We take the forward step by in-

crementing z

until the value of −F

is lower than the

previous iteration. In case the value of −F

is greater

for the current iteration, we take a back step to decre-

ment z

by half and thereby backtrack to reading the

old weight vector. It should be noted here that the

learning factor z obtained by using the above proce-

dure is optimal in a sense that its better estimate than

a heuristic approach. In Figure 1 (1), (2) are forward

step and (3), (4) are back step.

Figure 1: Calculating learning factor z

4 EXPERIMENT AND RESULTS

We now apply OCG-LDA to different datasets from

UCI machine learning repositories and compare their

performance with other algorithms. We have com-

pared the proposed methodology with two subspace

learning method (L

-LDA, SD-LDA) and two L

based least square methods. The L

-LDA we used,

is solved by eigenvalue decomposition. The IRLS

(Ji,2006) and L

regularized least square (L

-RLS)

(Boyd,2007) have been chosen primarily for there less

time complexity. Table 1 shows the numbers of vari-

ables, classes, and instances of each data set which is

used in this paper.

In Table 2, we present the classiﬁcation perfor-

mances for OCG-LDA and it’s comparison with other

iterative algorithms. We also make the observation

that using a direction vector information rather than

the gradient vector leads to a faster performance. Not

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

210

Table 1: UCI data set.

Data Set No of variables No of instances

Australian 14 690

Heart Disease 13 297

Bupa 6 345

Pima 8 768

Sonar 60 208

Balance 4 625

Waveform 21 4999

Table 2: Classiﬁcation (%) for UCI data set.

Dataset ↓ L

LDA SD- LDA L

IRLS L

RLS OCG-LDA

Australian 76.6790 81.8834 80.0032 81.3002 82.0324

Heart Dis-

ease

75.7580 80.8506 79.2130 75.7721 81.3059

Bupa 54.2182 65.8655 62.0581 62.8991 67.8591

Pima 65.8751 71.0919 72.6371 67.8161 72.4526

Sonar 65.3810 76.8333 78.4051 72.0712 78.1261

Balance 88.9683 90.0794 87.1943 87.5292 92.0957

Waveform 52.9113 56.0905 51.3703 51.7708 58.8654

only that, varying learning factor rather than a ﬁxed

value leads to a better results than L

LDA with steep-

est descent.

5 DISCUSSION

The result obtained in previous section, generates an

interesting set of ﬁndings. We observe from Table 2

that OCG-LDA is comparable in classiﬁcation rate to

other learning algorithms. Except the sonar dataset,

the classiﬁcation rate for OCG-LDA is better either

marginally or considerably than other subspace or

least square algorithms. The average time (in sec) is

also comparable to most of the other L

based algo-

rithms. Lower time complexity of L

algorithms is a

clear advantage over its L

counterpart. The average

number of iterations also bolster our argument about

the clear advantage of using OCG-LDA over other L

based algorithms and clearly the L

based SD-LDA

algorithm.

To understand the importance of the ﬁndings we

see that the weight vector in the updating scheme (17)

depends on the input feature vector. Now since we

have optimal weight updating and conjugate gradient

scheme for optimizing the weight vectors, the number

of input feature vector clearly affects the overall op-

timization scheme. This explains poor performance

of OCG-LDA for sonar dataset. As the cost func-

tion minimization involves absolute value and not a

square error, the time taken by L

based approach

is considerably lower than it’s L

method (L

-LDA).

Table 3: Average time (s) and average number of iterations.

Dataset ↓ L

LDA SD- LDA L

IRLS L

RLS OCG-LDA

Australian 0.5507 1.6717 0.7164 0.7670 0.6121

(-) (100) (3) (4) (3)

Heart 0.1240 0.1268 0.1324 0.1244 0.1242

Disease (-) (15) (5) (3) (4)

Bupa 0.1575 0.4132 0.3472 0.3120 0.2742

(-) (47) (9) (3) (5)

Pima 0.6621 1.1014 0.8274 0.9524 0.7825

(-) (58) (4) (3) (4)

Sonar 0.0783 0.5850 0.4765 0.5101 0.4956

(-) (100) (10) (4) (5)

Balance 0.4493 1.0717 0.7582 0.6823 0.6248

(-) (71) (2) (1) (1)

Waveform 26.503 4.3259 2.4462 1.3170 1.5295

(-) (11) (2) (5) (2)

An important implication of using an learning factor

and a second order optimizing scheme is that the av-

erage time and the average number of iterations are

considerably better or comparable to other L

and L

based algorithms. Updating the weight vector in L

subspace and guiding it with a learning factor serves

two-fold purpose. Firstly, the problem of overshoot-

ing or zigzagging of the weight vector is eliminated

as is typical in any gradient based iterative approach.

Secondly, the use of direction vector facilitates in a

much faster and efﬁcient algorithm. What’s interest-

ing to note here is that the combination of efﬁcient

scheme for obtaining learning factor and optimiza-

tion using CG algorithms leads to an algorithm that is

much better than SD-LDA. Not only that, OCG-LDA

is comparable to other complicated algorithms based

on L

regularization or iterative lease square scheme.

In order to relate the ﬁnding to those of similar

studies, our motivation came from improving the SD-

LDA by developing a better and efﬁcient subspace al-

gorithm. The results in Table 2 and 3 clearly implies

that we have achieved our goal. The questions raised

in (Oh and Kwak, 2013) about the use of an optimal

weight updation scheme, in (Duda et al., 2012) about

overcoming the effect of outliers and comparison to

the regularized least square and iterative least square

algorithms serves as motivation for our study. An-

other aspect was to see the Gaussian distribution ef-

fect on data that is prominent in L-2 norms.

6 CONCLUSIONS

The proposed OCG-LDA algorithm uses a conjugate

gradient optimization scheme to improve the existing

SD-LDA subspace algorithm. The conjugate gradi-

OptimalConjugateGradientAlgorithmforGeneralizationofLinearDiscriminantAnalysisBasedonL1Norm

211

ent is chosen due to its advantages over ﬁrst degree

optimization scheme like steepest descent and easy

implementation. We test the OCG-LDA algorithm

for various UCI datasets to demonstrate its classiﬁ-

cation performance, average time and average itera-

tions. OCG-LDA clearly outperforms the L-2 LDA

and SD-LDA but has comparable performance with

the L-1 version of least square algorithms. However

the proposed methodology is simple and easy to im-

plement and is a good alternative to other algorithms

in building a robust model for classiﬁcation. As from

No Free Lunch theorem, no single classiﬁcation algo-

rithm can outperform any other algorithm when per-

formance is analyzed over many classiﬁcation dataset.

In conclusion, OCG-LDA can be used as a basic clas-

siﬁer unit in a multi stage classiﬁcation scheme.

7 FUTURE WORK

The OCG-LDA methodology is an evident advance-

ment in the L

family of LDA subspace algorithms.

As a part of future direction, a multiple optimal learn-

ing factor scheme based on the Gaussian Newton ap-

proximation (Malalur and Manry, 2010) can be in-

vestigated. Recently, the author (Cai et al., 2011)

have proposed an efﬁcient partial Hessian calculation

that does not involves inversion and is successfully

applied on Radial basis function neural networks.

Therefore a study can be conducted to foray into the

second order algorithms using regularization parame-

ter.

REFERENCES

Bell, A. and sejnowski, T. (1995). An information-

maximization approach to blind separation and blind

deconvolution. Neural Computation, 7.

Cai, X., K.Tyagi, and Manry, M. (2011). An optimal con-

struction and training of second order rbf network for

approximation and illumination invariant image seg-

mentation.

Cao, L., Chua, K., Chong, W., Lee, H., and Gu, Q. (2003).

A comparison of pca, kpca and ica for dimensionality

reduction in support vector machine. Neurocomput-

ing, 55.

Chong, E. and Stanislaw, Z. (2013). An introduction to op-

timization. Wiley, USA, 3rd edition.

Claerbout, J. F. and Muir, F. (1973). Robust modeling with

erratic data.

Duda, R., Hart, P., and Stork, D. (2012). Pattern Classiﬁca-

tion. Wiley-interscience, USA, 2nd edition.

Fisher, R. (1936). The use of multiple measurements in

taxonomic problems. Annals of Eugenics, 7(2).

Fukunaga, K. (1990). Introduction to Statistical Pattern

Recognition. Academic Press, USA, 2nd edition.

Haykin, S. (2009). Neural networks and learning machines.

Prentice Hall, USA, 3rd edition.

J. A. Scales, A. Gersztenkorn, S. T. and Lines, L. R. (1988).

Robust optimization methods in geophysical inverse

theory.

Ji, J. (2006). Cgg method for robust inversion and its appli-

cation to velocity-stack inversion.

Koren, Y. and Carmel, L. (2008). Robust linear dimension-

ality reduction. IEEE Transactions on Visualization

and Computer Graphics, 10(4):459–470.

Kwak, N. (2008). Principal component analysis based on

l-1 norm maximization. IEEE Trans. Pattern Analysis

and Machine Intelligence, 30(9):1672–1680.

Kwak, N. and Choi, C.-H. (2003). Feature extraction based

on ica for binary classiﬁcation problems. IEEE Trans.

on Knowledge and Data Engineering, 15(6):1374–

1388.

Kwon, O.-W. and Lee, T.-W. (2004). Phoneme recognition

using ica-based feature extraction and transformation.

Signal Processing, 84(6).

Li, X., Hu, W., Wang, H., and Zhang, Z. (2010). Linear dis-

criminant analysis using rotational invariant l1 norm.

Neurocomputing, 73.

Malalur, S. S. and Manry, M. T. (2010). Multiple optimal

learning factors for feed-forward networks.

Oh, J. and Kwak, N. (2013). Generalization of linear dis-

criminant analysis using lp-norm. Pattern Recognition

Letters, 34(6):679–685.

Olavi, N. (1993). Convergence of iterations for linear equa-

tions. Birkhauser, USA, 3rd edition.

Sugiyama, M. (2007). Dimensionality reduction of mul-

timodal labeled data by ﬁsher discriminant analysis.

Journal of Machine Learning Research, 8:1027–1061.

Theodoridis, S. and Koutroumbas, K. (2009). Pattern

Recognition. Academic Press, USA, 4th edition.

Turk, M. and Pentland, A. Face recognition using eigen-

faces.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

212