Variable Selection based on a Two-stage Projection Pursuit Algorithm

Shu Jiang

and Yijun Xie

Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis, St. Louis, U.S.A.

Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada

Keywords:

Two-stage Projection Pursuit, Variable Selection, Optimization.

Abstract:

Dimension reduction methods have gained popularity in modern era due to exponential growth in data collec-

tion. Extracting key information and learning from all available data is a crucial step. Principal component

analysis (PCA) is a popular dimension reduction technique due to its simplicity and ﬂexibility. We stress that

PCA is solely based on maximizing the proportion of total variance of the explanatory variables and do not

directly impact the outcome of interest. Variable selection under such unsupervised setting may thus be inef-

ﬁcient. In this note, we propose a novel two-stage projection pursuit based algorithm which simultaneously

consider the loss in the outcome variable when doing variable selection. We believe that when one is keen in

variable selection in relation to the outcome of interest, the proposed method may be more efﬁcient compared

to existing methods.

1 INTRODUCTION

Tremendous amounts of data are being collected in

the hopes of ﬁnding signiﬁcant factors that may be

associated with, for example, disease progression in

clinical studies. With the exponential growth in data

collection, a natural question is how to select a smaller

subset of meaningful variables from the larger pool.

A naturally adopted method in overcoming such bur-

den is by dimensional reduction techniques. Principal

component analysis (PCA) has been arguably most

commonly adopted technique for such purpose. The

mathematical properties as well as highly optimized

algorithms for eigen-decomposition make PCA a very

appealing and prevalent technique for dimension re-

duction. More precise descriptions on relevant uses

of PCA can be found in Krzanowski (1987), King

and Jackson (1999), Cadima and Jolliffe (2001), and

Cadima et al. (2004).

We should note that, however, the fundamental

purpose of PCA decomposition is to maximize the

proportion of total variance of the explanatory vari-

ables explained by the principal components, and

therefore minimize the variance of residuals. Such

unsupervised approach focusing only on the vari-

ance decomposition of explanatory variables may

well mimic the structure of the variables, but does

not impact the relationship between the explanatory

variables and the outcome/response under a regres-

sion setting. Therefore, such unsupervised method

may not be the optimal approach if one’s goal lies in

dimensional reduction in relation to the outcome of

interest.

To overcome such burden, various efforts have

been made in developing projection pursuit based

methods for selecting the best set of variables. Monta-

nari and Lizzani (2001) discussed a projection pursuit

algorithm to identify multivariate variables for classi-

ﬁcation. Enshaei and Faith (2015) developed an al-

gorithm based on perceptron learning and attraction-

repulsion algorithms to ﬁnd the variable that best sep-

arates the data. More relevantly, Hwang et al. (1994)

discussed projection pursuit learning algorithm for

regression-based problems.

One of the drawbacks of traditional projection

pursuit algorithm is that it requires considerable

amount of computing power. Such limitation pre-

vented previous efforts in implementing projection

pursuit in higher dimensional space. The computa-

tional burden has gotten worse in recent years due to

exponential increase in the number of variables being

collected in the dataset. Therefore we are motivated

to introduce a novel dimensional reduction technique,

the two-stage projection pursuit method, for variable

selection in high dimensional variable space. We

believe that the proposed two-stage procedure could

lead to the most efﬁcient selection of the set of vari-

ables that exert relatively large effects on the outcome

188

Jiang, S. and Xie, Y.

Variable Selection based on a Two-stage Projection Pursuit Algorithm.

DOI: 10.5220/0009098901880193

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 3: BIOINFORMATICS, pages 188-193

ISBN: 978-989-758-398-8; ISSN: 2184-4305

of interest without much computational burden.

This paper is organized as follows. We ﬁrst re-

view the principal component analysis (PCA) and its

feature selection techniques in Section 2.1. We then

introduce a new dimension reduction framework for

high dimensional data based on a two-stage projec-

tion pursuit algorithm in Section 2.2. A detailed out-

line of our purposed algorithm is included in Sec-

tion 2.2.2. In Section 3 we conduct a small scale

simulation study to compare the performance of our

proposed algorithm with existing methods including

PCA. We present a data example in Section 4 and end

with a discussion in Section 5.

2 DIMENSION REDUCTION

2.1 Principal Component Analysis

Let X = [X

,··· , X

]

be a d-dimensional random vec-

tor with zero mean. We let v

= [v

,··· , v

]

be a

vector of length d where the norm is deﬁned as the

norm, i.e. ||v

|| = v

∑

j=1

m j

= 1. We

further let W

denote the inner product of v

and X ,

which is often referred to as the projection score of X

onto v

in the literature. Speciﬁcally,

= hv

,Xi =

∑

j=1

m j

. (1)

The ﬁrst principal component W

can thus be deﬁned

= hv

,Xi,

where

= argmax

∈R

,||v

||=1

Var(W

)

is the unit length vector in an R

space that maximizes

the variance of the projection scores. We can see from

above that the ﬁrst principal component tries to ﬁnd

a unit length vector in the d-dimensional Euclidean

space such that the projection score of a higher di-

mensional random vector onto this unit length vec-

tor has the maximum variance among all projection

scores. Such procedure will decompose the random

vector X into two parts: the projection that is in the

same direction as v

, and residuals that are orthogonal

to v

. Each subsequent v

is deﬁned as the direction

that will maximize the variance of the residuals after

the (k − 1)

projection, i.e.

= argmax

∈R

,||v

||=1,

i=0 for q<k

Var(W

The k

principal component W

can then be writ-

ten as

= hv

,Xi.

The estimation of the components v

,...,v

involves

the covariance matrix Σ of X which is assumed to be

full rank. Speciﬁcally for v

, we would need to max-

imize v

Σv

subject to v

= 1 and one possible ap-

proach is to use Lagrange multipliers

Σv

+ λ(v

− 1), (2)

where λ is a constant. By differentiating (2) with re-

spect to v

we would get

(Σ − λI)v

= 0, (3)

where I is a d × d identity matrix. We can see from

(3) that λ is an eigenvalue of Σ and v

is the corre-

sponding eigenvector where

Σv

= v

λv

= λv

= λ.

Hence, the maximum of v

Σv

is achieved when λ =

, the largest eigenvalue of Σ with v

the eigenvec-

tor corresponding to λ

. Similarly, one can show that

is the eigenvector corresponding to the k

largest

eigenvalue λ

. More details can be found in Jolliffe

(2011).

2.2 Projection Pursuit Algorithm

2.2.1 Methods and Notations

It is clear that PCA is targeted at maximizing the vari-

ance of the projection scores for some high dimen-

sional vector X. However, we stress that such unsu-

pervised approach may not always the optimal choice

when the goal for variable selection is associated with

the outcome of interest. We are thus motivated to in-

troduce an alternative dimension reduction technique,

the projection pursuit algorithm in this subsection

(Kruskal, 1972; Friedman and Tukey, 1974).

Similarly to the principal component analysis in

the multivariate setting, we want to ﬁnd the set of d-

dimensional vectors v

= [v

,··· , v

]

,k = 1,··· , d,

such that

= argmax

kvk=1

Q(v) and (4)

= argmax

kvk=1,

=0 for m<k

Q(v) for k = 2,3,··· ,d, (5)

where Q(v) is deﬁned as the projection index. It can

be easily seen that if we specify our projection index

Q(v) to be the measure of variance of X, the projec-

tion pursuit is equivalent to PCA. Under such setting,

Variable Selection based on a Two-stage Projection Pursuit Algorithm

189

the optimal directions coincide with the eigenvectors

of the sample covariance matrix.

Since the goal is to relate the set the basis func-

tions to the outcome of interest, the Q(v) function

should not be solely based on the covariate X. As

an example, under a linear regression setting, we let

= f (x

) + ε

where y

∼ N( f (x

),σ

) for some arbi-

trary linear function f (·),i = 1,...,n. If we denote the

estimate of y

as ˆy

(v)

f (hx

,vi) for some kvk = 1.

Then the projection index may be deﬁned as

Q(v|x,y) = −

∑

i=1



− ˆy

(v)



where x = (x

,...,x

)

, y = (y

,...,y

)

and the associ-

ated set of basis functions v

,...,v

can be estimated

from equations (4) and (5).

2.2.2 Two-stage Algorithm

Previous studies try to ﬁnd the best projection di-

rection by directly searching on a high dimensional

unit sphere. The previous approaches, although per-

forms well when the dimension of the unit sphere is

low, could fail when working with high dimensional

data. The “curse of dimensionality” would prevent

one from extracting meaningful information from a

sparsely distributed samples.

To address this problem, we propose a two-stage

optimization algorithm for variable selection based on

projection pursuit. We denote the target unit sphere

in a k-dimensional Euclidean space as U

, and each

point v on U

is a unit length vector with length k.

In the ﬁrst step, we generate N uniformly distributed

point on this k-dimensional unit sphere, and denote

them as v

,...,v

. We denote the desired projection

index of our data corresponding to v

= Q(v

where Q(v

) = Q(v

|x,y) and we omit the conditions

for simplicity in the algorithm. These Q

’s are then

ranked and we pick the largest M of them as Q

(1)

≥

··· ≥ Q

(M)

with their corresponding unit length vec-

tor denoted as v

(1)

,...,v

(M)

. The tuning parameters,

N and M are user speciﬁed and will be accompanied

with larger computational burden as they get larger.

However, if the surface of projection index Q(v) is

quite smooth on U

, then N = 1000 and M = 5 shall

be enough. The tuning parameters could also be de-

termined using a elbow-plot.

In our second step, we apply an optimization algo-

rithm in a small neighbor near selected v

(1)

,...,v

(M)

using some general-purpose optimization method that

allows user-speciﬁed searching boundary, such as L-

BFGS-B proposed by Byrd et al. (1995). Let ˜v denote

the the unit length vector that maximize our projection

index, then we can obtain the weight of each variable

regarding desired projection index from ˜v. Note that

Algorithm 2.1 gives a general guideline for approxi-

mating the ﬁrst v. For v

,...,v

following the ﬁrst di-

rection, one can repeat Algorithm 2.1 on unit spheres

orthogonal to all previously approximated directions.

This algorithm is close to the coarse-to-ﬁne opti-

mization schemes which is often discussed in the ma-

chine learning literature. For more detailed reference,

one can ﬁnd Pedersoli et al. (2015) and Charniak and

Johnson (2005) for applications in computer vision

and natural language processing.

Algorithm 2.1: Two-Stage Projection Pursuit-based

Algorithm for Variable Selection.

1 Input: x

,...,x

,...,y

2 Result: ˜v

3 generate v

,...,v

;

4 for j = 1 to N do

5 denote Q

= Q(v

);

6 end

7 rank Q

,··· , Q

in decreasing order as

(1)

,··· , Q

(J)

;

8 for m = 1 to M do

9 ﬁnd v

(m)

corresponding to Q

(m)

;

10 ﬁnd optimized ˜v

(m)

in the near neighbor

that maximize the objective function;

11 end

12 let ˜v = { ˜v

(m)

: Q( ˜v

(m)

) = max

m=1,···,M

Q( ˜v

(m)

)};

13 return ˜v.

For the rest of this paper, we adopt the pro-

posed algorithm and demonstrate that projection pur-

suit method can lead to an efﬁcient and robust dimen-

sion reduction for high dimensional data.

3 SIMULATION STUDY

We will investigate the performance of the proposed

method with PCA under this section. We generate 51

centered and normally distributed random variables

,...,X

, where X

∼ N(0,1) and X

∼ N(0,2), j =

2,...,51. We simulate i = 1,...,1000 individuals in

this study. We set the linear model to be

= x

∑

j=2

i j

+ ε

for i = 1,...,1000, where β

= 1 and β

,...,β

= 0,

and ε ∼ N(0,0.1). We would like to do variable se-

lection using PCA and our proposed two-stage pro-

BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms

190

jection pursuit method, and assess the mis-selection

rate. The proposed simulation procedure has been re-

peated 1000 times, and the comparison between PCA

and two-stage projection pursuit is presented in Fig-

ure 1. From the histogram we can tell that projection

pursuit method select the correct variable every time

in the simulation, while PCA is tricked by the high

variance of noise terms and can never select the cor-

rect variable. While this is an overly simpliﬁed exam-

ple, we can learn from the results that PCA may not

be a reliable way of choosing parameters.

Figure 1: Distribution of selected variables. The red bars

denote weights found using two-stage projection pursuit al-

gorithm. The orange bars denote weights found using prin-

cipal component analysis.

To further emphasize the advantage of our pro-

posed two-stage projection pursuit algorithm, we con-

duct another simulation study with the same setting

as above. In stead of using PCA for dimension re-

duction, we apply the random projection pursuit by

generating 10

uniformly distribution random points

on the 51-dimensional unit sphere. The frequencies

of selected variables are presented in Figure 2. While

the random projection pursuit select the correct vari-

able about 60% of the time, we can tell that about 1/3

of the time it will fail such a simple task due to the

high dimension of our variable space. All above re-

sults are also summarized in Table 1.

Table 1: Counts of Selected Variables by Two-stage Projec-

tion Pursuit, Random Projection Pursuit, and PCA.

Method X

,. . ., X

Two-stage Projection Pursuit 1000 0

Random Projection Pursuit 692 308

PCA 0 1000

Figure 2: Distribution of selected variables. The red bars

denote weights found using two-stage projection pursuit al-

gorithm. The blue bars denote weights found using random

projection pursuit method.

4 DATA EXAMPLE

Bostonhousing is a popular dataset that was col-

lected by Harrison Jr and Rubinfeld (1978). In this

datset there are 13 variables that are potentially re-

lated to the housing price in Boston, and they are sum-

marized in Table 2.

Table 2: 13 Explanatory Variables and 1 Response Variable

in Boston Housing dataset.

Variable Description

CRIM per capita crime rate by town

ZN proportion of residential land zoned for lots

over 25,000 sq.ft.

INDUS proportion of non-retail business acres per town

CHAS Charles River dummy variable (= 1 if tract bounds river)

NOX nitric oxides concentration (parts per 10 million)

RM average number of rooms per dwelling

AGE proportion of owner-occupied units built prior to 1940

DIS weighted distances to ﬁve Boston employment centres

RAD index of accessibility to radial highways

TAX full-value property-tax rate per $ 10,000

PTRATIO pupil-teacher ratio by town

B 1000(Bk − 0.63)

where Bk is the proportion of blacks

by town

LSTAT % lower status of the population

MEDV Median value of owner-occupied homes in $ 1000’s

In this data example, we estimate the weight of

each of these 13 variables when ﬁtting a linear regres-

sion model with mean absolute error (MAE). We ﬁrst

compare our two-stage projection pursuit algorithm

with the ﬁrst principal component. While practitioner

often use PCA as a technique for feature selection, we

https://archive.ics.uci.edu/ml/machine-learning-

databases/housing/

Variable Selection based on a Two-stage Projection Pursuit Algorithm

191

can tell from Figure 3 that the results could be very

different from projection pursuit which is specialized

in ﬁnding the optimal direction.

We further compare our results with a random

projection search by generating 10

uniformly dis-

tribution random points on the 13-dimensional unit

sphere. The estimated weights are presented in Fig-

ure 4. From the plot we can easily tell that there is

considerable differences for all variable weights ex-

cept one. Our explanation is that even though we

generate 10

uniformly distribution random points on

the 13-dimensional unit sphere, they are actually still

distributed very sparsely in the space. These random

points may not be able to cover the whole space, and

hence may very likely to miss the try direction that

will maximize our projection index which is deﬁned

as MAE in this particular example.

Figure 3: Weight of each of 13 variables in Boston Hous-

ing dataset. The red bars denote weights found using two-

stage projection pursuit algorithm. The orange bars denote

weights found using principal component analysis.

Figure 4: Weight of each of 13 variables in Boston Housing

dataset. The red bars denote weights found using two-stage

projection pursuit algorithm. The blue bars denote weights

found using random projection pursuit method.

5 CONCLUSIONS

In this note we have introduced a new technique,

namely the two-stage projection pursuit algorithm in

achieving variable selection with high dimensional

data. We stress that PCA is based on maximizing the

proportion of total variances explained by the prin-

cipal components which may not be suitable in vari-

able selection under certain scenarios as shown un-

der our simulation studies. Projection pursuit algo-

rithm, on the other hand, can be applied to a more

ﬂexible objective function which include PCA as a

special case. Previous efforts have been made in op-

timizing such projection indices only in lower dimen-

sional unit sphere due to computation burden. Our

proposed two-stage algorithm overcomes such limi-

tation in the optimization process within a high di-

mensional variable space. We believe this projec-

tion pursuit based method is more ﬂexible and can

be more efﬁcient for feature selection. In this pa-

per we used a common dataset in machine learning

to illustrate the performance of our projection pursuit

based method. Note that the proposed method can

be applied to other application settings without much

modiﬁcation. Furthermore, a larger and more inten-

sive simulation study is needed to consolidate our pro-

posed method and will be included in future work.

REFERENCES

Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A

limited memory algorithm for bound constrained op-

timization. SIAM Journal on Scientiﬁc Computing,

16(5):1190–1208.

Cadima, J., Cerdeira, J. O., and Minhoto, M. (2004). Com-

putational aspects of algorithms for variable selection

in the context of principal components. Computa-

tional statistics & data analysis, 47(2):225–236.

Cadima, J. F. and Jolliffe, I. T. (2001). Variable selection

and the interpretation of principal subspaces. Journal

of Agricultural, Biological, and Environmental Statis-

tics, 6(1):62.

Charniak, E. and Johnson, M. (2005). Coarse-to-ﬁne n-best

parsing and maxent discriminative reranking. In Pro-

ceedings of the 43rd annual meeting on association

for computational linguistics, pages 173–180. Asso-

ciation for Computational Linguistics.

Enshaei, A. and Faith, J. (2015). Feature selection with

targeted projection pursuit. IJ Information Technology

and Computer Science, 7(5):34–39.

Friedman, J. H. and Tukey, J. W. (1974). A projection

pursuit algorithm for exploratory data analysis. IEEE

Transactions on computers, 100(9):881–890.

Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic hous-

ing prices and the demand for clean air. Journal of

BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms

192

environmental economics and management, 5(1):81–

102.

Hwang, J.-N., Lay, S.-R., Maechler, M., Martin, R. D., and

Schimert, J. (1994). Regression modeling in back-

propagation and projection pursuit learning. IEEE

Transactions on neural networks, 5(3):342–353.

Jolliffe, I. (2011). Principal component analysis. Springer.

King, J. R. and Jackson, D. A. (1999). Variable selection

in large environmental data sets using principal com-

ponents analysis. Environmetrics: The ofﬁcial journal

of the International Environmetrics Society, 10(1):67–

77.

Kruskal, J. B. (1972). Linear transformation of multivariate

data to reveal clustering. Multidimensional scaling,

1:101–115.

Krzanowski, W. J. (1987). Selection of variables to pre-

serve multivariate data structure, using principal com-

ponents. Journal of the Royal Statistical Society: Se-

ries C (Applied Statistics), 36(1):22–33.

Montanari, A. and Lizzani, L. (2001). A projection pursuit

approach to variable selection. Computational statis-

tics & data analysis, 35(4):463–473.

Pedersoli, M., Vedaldi, A., Gonzalez, J., and Roca, X.

(2015). A coarse-to-ﬁne approach for fast deformable

object detection. Pattern Recognition, 48(5):1844–

1853.

Variable Selection based on a Two-stage Projection Pursuit Algorithm

193