Post Lasso Stability Selection for High Dimensional Linear Models

Niharika Gauraha

, Tatyana Pavlenko

and Swapan K. Parui

Systems Science and Informatics Unit, Indian Statistical Institute, Bangalore, India

Mathematical Statistics, KTH Royal Institute of Technology, Stockholm, Sweden

Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India

niharika@isibang.ac.in, pavlenko@math.kth.se, swapan@isical.ac.in

Keywords:

Lasso, Weighted Lasso, Variable Selection, Stability Selection, High Dimensional Data.

Abstract:

Lasso and sub-sampling based techniques (e.g. Stability Selection) are nowadays most commonly used meth-

ods for detecting the set of active predictors in high-dimensional linear models. The consistency of the Lasso-

based variable selection requires the strong irrepresentable condition on the design matrix to be fulﬁlled, and

repeated sampling procedures with large feature set make the Stability Selection slow in terms of computation

time. Alternatively, two-stage procedures (e.g. thresholding or adaptive Lasso) are used to achieve consistent

variable selection under weaker conditions (sparse eigenvalue). Such two-step procedures involve choosing

several tuning parameters that seems easy in principle, but difﬁcult in practice. To address these problems

efﬁciently, we propose a new two-step procedure, called Post Lasso Stability Selection (PLSS). At the ﬁrst

step, the Lasso screening is applied with a small regularization parameter to generate a candidate subset of

active features. At the second step, Stability Selection using weighted Lasso is applied to recover the most

stable features from the candidate subset. We show that under mild (generalized irrepresentable) condition,

this approach yields a consistent variable selection method that is computationally fast even for a very large

number of variables. Promising performance properties of the proposed PLSS technique are also demonstrated

numerically using both simulated and real data examples.

1 INTRODUCTION

Due to the presence of high dimensional data in most

areas of modern applications (examples include ge-

nomics and proteomics, ﬁnancial data analysis, as-

tronomy) variable selection methods gain consider-

able interest in statistical modeling and inference. In

this paper, we consider variable selection problems

in sparse linear regression models. We start with the

standard linear regression model

Y = Xβ +ε, (1)

where Y

n×1

is a univariate response vector, X

n×p

the design matrix, β

p×1

is the true underlying coefﬁ-

cient vector and ε

n×1

is an error vector. In particular,

we consider sparse and high dimensional linear mod-

els, where the number of variables (p) is much larger

than the number of observations (n), that is p  n.

Sparsity assumption implies that only a few of the

predictors contribute to the response. We denote the

true active set or the support of β, by S = supp(β).

The goal is to estimate the true active set S from data

(Y,X).

The Lasso (Tibshirani, 1996) has been a popular

choice for simultaneous estimation and variable selec-

tion in sparse high dimensional problems. The Lasso

penalizes least square regression by sum of the ab-

solute value of the regression coefﬁcients, the Lasso

estimator is deﬁned as

lasso

= arg min

β∈R



kY −Xβk

+ λkβk



, (2)

where λ ≥ 0 is the regularization parameter that con-

trols the amount of regularization and the `

-penalty

encourages the sparse solution. It has been proven

that, under strong conditions (i.e., Irrepresentable

Condition) on the design matrix X, the Lasso cor-

rectly recovers the true active set S with high probabil-

ity, for further details we refer to (Zhao and Yu, 2006),

(Meinshausen and B

uhlmann, 2006) and (B

uhlmann

and van de Geer, 2011). Sampling based procedures

(i.e., Stability Selection and bootstrap Lasso) can be

used as an alternative approach for variable selection,

see (Meinshausen and B

uhlmann, 2010) and (Bach,

2008). Though, the Stability Selection identiﬁes the

most stable features, but repeated sampling proce-

dures make the algorithm very slow specially with the

638

Gauraha, N., Pavlenko, T. and Parui, S.

Post Lasso Stability Selection for High Dimensional Linear Models.

DOI: 10.5220/0006244306380646

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 638-646

ISBN: 978-989-758-222-6

large number of predictors. In practice, it is difﬁcult

to satisfy the Irrepresentability Condition, hence the

Lasso does not provide any guarantees on the num-

ber of false discoveries. However, the Stability Selec-

tion has not been widely accepted due to its computa-

tional complexity. When the irrepresentable condition

(IC) is violated, two-stage procedures (e.g. thresh-

olding or adaptive Lasso) are used to achieve consis-

tent variable selection. Such two-step procedures in-

volve choosing several tuning parameters that further

complicates the problem. We propose to combine the

strength of the (adaptively weighted) Lasso and the

Stability Selection for efﬁcient and stable feature se-

lection. In the ﬁrst step, we apply the Lasso with

small regularization parameter that selects a subset

consisting of small number of features. In the second

step, stability feature selection using weighted Lasso

is applied to the restricted Lasso active set to select

the most stable features. For the weighted `

penal-

ization, the weights are computed from the Lasso es-

timator at the ﬁrst stage such that large effects covari-

ates in the Lasso ﬁt will be given smaller weights and

small effects covariates will be given larger weights.

We call the combination of the two, the Post-Lasso

Stability Selection (PLSS).

Several authors have previously considered two

stage Lasso-type procedures that have better poten-

tial and properties for variable selection than single

stage Lasso, such as adaptive Lasso (Zhao and Yu,

2006), thresholded Lasso (Zou, 2006), relaxed Lasso

(Meinshausen, 2007) and Gauss-Lasso (Javanmard

and Montanari, 2013) to name a few. The Post-Lasso

Stability Selection, is a special case of the two stage

variable selection procedure: (1) Pre-selection stage:

selection of predictors using the Lasso with small tun-

ing parameter; and (2) Selection stage: selection of

the most stable features from preselected predictors

using Stability Selection with weighted Lasso. We

prove that under assumption of Generalized Irrepre-

sentability Condition (GIC) (Javanmard and Monta-

nari, 2013), the initial Lasso active set with small tun-

ing parameter contains the true active set S with high

probability. Then stability feature selection where

base selection procedure is the weighted Lasso, cor-

rectly identiﬁes the stable predictors when applied on

the restricted Lasso active set. The contribution of this

paper is summarized as follows.

1. We brieﬂy review two stage procedures for stable

feature selection and estimation.

2. We propose a new combined approach, namely

the Post Lasso Stability Selection (PLSS): The

Lasso selecting initial active set and the Stabil-

ity Selection using weighted Lasso selecting sta-

ble features from the initial active set.

3. We also utilize the estimation result obtained by

the initial stage Lasso for computing weights of

the selected predictors considered for the next

stage.

4. We prove that under assumption of GIC, the PLSS

correctly identiﬁes the true active set with high

probability.

5. We empirically show that PLSS outperforms the

standard Lasso and the adaptive Lasso in terms of

false positives.

6. We evaluate computational complexity of PLSS,

and show that it is superior than the standard sta-

bility feature selection using the Lasso.

The rest of this paper is organized as follows. In Sec-

tion 2, we provide background, notations, assump-

tions and a brief review of the relevant work. In sec-

tion 3, we deﬁne and illustrate the Post Lasso Stability

Selection. In section 4, we carry out simulation stud-

ies and we shall provide conclusion in section 5.

2 BACKGROUND AND

NOTATIONS

In this section, we state notations, assumptions and

deﬁnitions that will be used in later sections. We also

provide a brief review of relevant work and our con-

tribution.

2.1 Notations and Assumptions

We consider the sparse high dimensional linear re-

gression set up as in (1), where p  n. We assume

that the components of the noise vector ε are i.i.d.

N(0,σ

). The true active set or support of β is denoted

as S and deﬁned as S = {j ∈ {1,..., p} : β

6= 0}. We

assume sparsity in β such that s  n, where s = |S | is

the sparsity index. The `

-norm and `

-norm (square)

are deﬁned as kβk

∑

j=1

|β

| and kβk

∑

j=1

respectively. For a matrix X ∈ R

n×p

, we use super-

scripts for the columns of X, i.e., X

denotes the j

column, and subscripts for the rows, i.e., X

denotes

the i

row. For any S ⊆{1,..., p}, we denote X

as the

restriction of X to columns in S, and β

is the vector β

restricted to the support S, with 0 outside the support

S. Without loss of generality we can assume that the

ﬁrst s = |S| variables are the active variables, and we

partition the empirical covariance matrix, C =

for the active and the redundant variables as follows.

C =





(3)

Post Lasso Stability Selection for High Dimensional Linear Models

639

Similarly, the true β is partitioned as





The weighted Lasso estimator is deﬁned as

W L

= arg min

β∈R

(

kY −Xβk

+ λ

∑

j=1

|β

)

(4)

where w ∈ R

is a known weights vector. We de-

note Λ as the set of considered regularization param-

eters deﬁned as Λ = {λ : λ ∈ (0,λ

max

)}, where λ

max

corresponds to the minimal value of λ for which the

null model is selected. The following two conditions

are assumed throughout the paper. (i)Beta-min con-

dition, the non-zero entries of the true β must satisfy

the condition β

min

≥

cσ

√

, for some c > 0. (ii) Condi-

tion on the minimum number of observations, that is

n ≥ s log(p).

2.2 The Lasso Variable Selection

The Least Absolute Shrinkage and Selection Opera-

tor (Lasso), is a penalized least squares method that

imposes an `

-penalty on the regression coefﬁcients.

The Lasso does both shrinkage and automatic vari-

able selection simultaneously due to nature of the `

penalty. The Lasso estimated parameter vector de-

noted as

β is deﬁned in (2), and the Lasso estimated

active set denoted by

lasso

can be given as

lasso

= {j ∈ {1, ..., p} :

6= 0}. (5)

It is known that Irrepresentable condition is necessary

and sufﬁcient condition for the Lasso to select true

model (see (Zhao and Yu, 2006)), the Irrepresentable

condition is deﬁned as follows.

Deﬁnition 1 (Irrepresentable Condition(IC)). The Ir-

representable Condition is said to be met for the set S

with a constant η > 0, if the following holds:

−1

sign(β

∞

≤ 1 −η. (6)

In practice, IC on the design matrix X, is quite

difﬁcult to meet. When IC fails to hold, the Lasso

selected active set tends to have many false positive

variables. A substantially weaker assumption than ir-

representability, called Generalized Irrepresentability

Condition was introduced in (Javanmard and Mon-

tanari, 2013). They proved that, such a relaxation

from irrepresentability condition to generalized irrep-

resentability condition allows to cover a signiﬁcantly

broader set of design matrices. In simple words, under

generalized irrepresentability condition a little noise

is allowed to get selected or the generalized irrepre-

sentability condition can be viewed as irrepresentabil-

ity condition satisfying for some superset of active set

T ⊇ S. In (Javanmard and Montanari, 2013), authors

also derived a suitable choice of λ

, such that for the

range (0,λ

) the Lasso selects the superset T ⊇S with

high probability.

= cσ

2log(p)

, for some constant c > 1. (7)

2.3 Stability Variable Selection

In this section, we brieﬂy study the stability fea-

ture selection method, which is mainly based on the

concept that a feature is called stable if the proba-

bility of its getting selected is insensitive to varia-

tions in the training set. The Stability Selection, in-

troduced by (Meinshausen and B

uhlmann, 2010), is

an effective method for performing variable selec-

tion in the high-dimensional setting while controlling

the false positive rates. It is a combination of sub-

sampling and high-dimensional feature selection al-

gorithms (i.e., the Lasso). The Stability Selection can

be expressed as a framework for the baseline feature

selection method, to identify a set of stable predictors

that are selected with high probability. The baseline

feature selection method is repeatedly applied to ran-

dom data sub-samples of half-size, and then the pre-

dictors which have selection frequency larger than a

ﬁxed threshold value (usually in the range (0.6, 0.9) )

are selected as stable features.

Though the Lasso does not satisfy the oracle

property and model selection consistency in high-

dimensional data, but it has been proven that the

Lasso selects the true active variables with high prob-

ability for more details we refer to (Meinshausen and

uhlmann, 2010). Hence, the Lasso method is com-

monly used as a base feature selection method for Sta-

bility Selection, we call it Stability Lasso. The active

set of variables selected by Stability Lasso is given by

stab

= {j ∈ {1, ..., p} :

≥ π

thr

}, (8)

where 0 < π

thr

< 1 is a cut off probability. The vari-

ables with a high selection probability are selected as

stable features. Here the parameter to be tuned is the

exact cut off π

thr

, the inﬂuence of π

thr

is very small

usually in the range (0.6,0.9)). Tuning regularization

parameter for the standard Lasso variable selection

can be more challenging than for prediction, since the

prediction optimal (i.e., cross-validated choice) often

includes false positive selections. whereas, the sta-

ble active set does not depend much on the choice

of the Lasso regularization λ, see (Meinshausen and

uhlmann, 2010) for more detailed discussion.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

640

2.4 Review of Relevant work

In this section, we provide a brief review of relevant

work in order to show that how our proposal differs

from other two stage penalized least square methods

for variable selection.

The Lasso variable selection could be inconsis-

tent when IC fails to hold, to overcome this prob-

lem various two stage procedures have been intro-

duced. The adaptive Lasso proposed in (Zou, 2006),

uses adaptive weights for penalizing different coefﬁ-

cients in the `

-penalty as in weighted Lasso (4). The

weights are chosen by an initial model ﬁt, such that

large effects covariates in the initial ﬁt will be given

smaller weights and small effects covariates will be

given larger weights. Mostly, the Lasso is applied

at the initial stage for high dimensional case, to de-

rive the weights for the weighted Lasso at the second

stage, together they are called the Adaptive Lasso.

The adaptive Lasso estimator is deﬁned as follows.

ada

= arg min

β∈R

(

kY −Xβk

+ λ ∗λ

ada

∑

|β

lasso

)

where

lasso

is computed using the standard Lasso

method (2) as initial ﬁt with initial regularization pa-

rameter λ > 0, and λ

ada

≥ 0 is the regularization pa-

rameter for the second stage. Then the adaptive Lasso

active set can be computed as

ada

= {j ∈ {1, ..., p} : (

ada

)

6= 0}.

The thresholded Lasso was introduced in (Zhou,

2009), which further reduces the Lasso active set by

eliminating the features having estimated coefﬁcients

below some pre-deﬁned threshold value. More pre-

cisely, in the ﬁrst stage the initial estimator is obtained

using the Lasso with suitable regularization parameter

λ, and then predictors were selected if their estimated

coefﬁcients are large enough (larger than the thresh-

old value, say

thr

= {j ∈ {1, ..., p} : |(

lasso

)

| ≥

thr

where

thr

> 0, is the second tuning parameter for the

thresholded Lasso.

The relaxed Lasso (Meinshausen, 2007) is another

two step procedure, similar to adaptive or thresholded

Lasso. The relaxed Lasso consists of two Lasso steps,

in the ﬁrst stage the Lasso variable selection is per-

formed for a suitable grid of regularization parame-

ters, say (0,λ

max

) then at the second stage every sub-

model

is considered and the Lasso with smaller

regularization parameter is used on those sub models.

The relaxed Lasso estimator is given as follows.

(λ,φ) := argmin



kY −X

+ φ ∗λkβ



where

S(λ) is the estimated sub-model from the ﬁrst

stage.

The above two-stage procedures are proven to be

variable selection consistent under some form of re-

stricted and sparse eigenvalue conditions, see (van de

Geer et al., 2011). But the tuning of regularization pa-

rameters are the main issue in practice. They require

the two dimensional cross-validation (Hastie et al.,

2001) to ﬁnd the optimal pair of regularization pa-

rameter used at two different stages.

Next, we discuss about the Gauss-Lasso selector

(see (Javanmard and Montanari, 2013)), which is also

a two-stage method that ﬁrst applies the Lasso, and

then in the second stage it performs ordinary least

squares restricted to the Lasso active set

lasso

= arg min



kY −X

lasso



Given the sparsity index or number of non-zero co-

efﬁcients s

, the Gauss-Lasso selector then ﬁnds the

largest entry (in absolute) of

, denoted by

Then ﬁnally GL active set is given by

= {j ∈ {1, ..., p} : |(

)

| ≥

}

Though, the Gauss-Lasso model selects the correct

model with high probability under GIC which is

weaker than the IC. But, it demands the true sparsity

index s

= |S| for selecting s

relevant features, but in

practice, the size of active set is not known.

Finally, we mention about the bootstrap Lasso

(Bolasso) which is more close to the Stability Se-

lection. In Bolasso, the Lasso is applied for several

bootstrapped replications of a given sample, then ﬁnal

active set is given by intersection of the supports of

the Lasso bootstrap estimates. Bolasso is a consistent

variable selection method, that does not assume any

condition on the design matrix X. The Bolasso is not

a preferable choice since it is computationally expen-

sive, as repeatedly applying Lasso on bootstrap sam-

ples specially with large number of predictors makes

it slow.

We propose the Post Lasso Stability Selection as a

computationally fast alternative, which does not need

to be tuned across the two-dimensional grid of tuning

parameters. It is a simple and consistent method for

variable selection even when the irrepresentable con-

dition is violated. We deﬁne and discuss PLSS in the

next section.

3 POST LASSO STABILITY

SELECTION

In this section, we introduce a new combined two

stage approach, called Post Lasso Stability Selection.

Post Lasso Stability Selection for High Dimensional Linear Models

641

The ﬁrst stage involves selecting a super set of the true

active set, using the Lasso with a small regularization

parameter λ

(7). Then in the second stage, Stability

Selection using the weighted Lasso is applied on the

Lasso restricted set obtained at the ﬁrst stage. We also

compute weights from the Lasso estimator from the

ﬁrst stage to assign different weights to different co-

efﬁcients. On the one hand, the Lasso at the ﬁrst stage

makes sure the true active set gets selected (along with

some noise) under assumption of the generalized ir-

representability condition on the design matrix X. On

the other hand, the Stability Selection using weighted

Lasso at second stage makes sure that the most stable

predictors ﬁnally get selected and the noise features

are eliminated from the ﬁnal model. aa

Algorithm 1: PLSS Algorithm.

Input: dataset (Y,X,π

thr

)

Output:

S:= set of selected variables

Steps: 1. Perform Lasso with

2log(p)/n. Denote the Lasso estimator

lasso

and the Lasso active set as

lasso

2. Compute weights as w

= |(

lasso

)

3. Compute the weighted reduced design

matrix, X

red

= {w

∗X

: j ∈

lasso

4. Perform stability feature selection based on

data (Y, X

red

) and obtain the estimated

probabilities π

for all j ∈

lasso

Determine the selected active set as

S = {j ∈

lasso

≥ π

thr

}

return

3.1 Consistency of PLSS

We assume that the GIC holds on the design matrix X

for some T ⊆{1,..., p}, and T contains the true active

set S, i.e. T ⊇ S. Without loss of generality we can

assume that the GIC corresponds to the ﬁrst t = |T |

predictors. In the ﬁrst stage, then the Lasso active

set contains the true active set with high probability

under GIC assumption. In the second stage, for Sta-

bility Selection with adaptively weighted Lasso, the

bounds on maximal and minimal eigenvalues are re-

quired. As GIC holds for the set T ⊇ S, therefore the

covariance matrix C(T ) is invertible and uncorrelated

with the noise features that implies the minimum and

maximum eigenvalue of sub-matrices of C of size t ×t

are bounded away from 0 and ∞ respectively. Hence,

under GIC assumption the PLSS method is variable

selection consistent.

3.2 Computation Complexity for PLSS

In this section, we discuss the computational com-

plexity of the PLSS. The computation steps are given

in Algorithm (1). Since, the PLSS performs the Lasso

and Stability Selection in two different stages, there-

fore we use the results from those studies. The LARS

(Efron et al., 2004) algorithm is used to compute the

Lasso, the computational cost of LARS is of order

O(np

). Computation cost of Stability Selection us-

ing Lasso as a base feature (with 100 sub samples),

is approximately O(25np

), where the constant 25 is

due to running 100 simulations on the on sub samples

of size

, we refer to (Meinshausen and B

uhlmann,

2010) for more details on the derivation of the result.

So, the computational cost of the PLSS for its differ-

ent stages can be given as:

stage one: O(np

)

stage two: O(25ns

), where we assume s

is the

size of the Lasso active set and in practice s

 p.

Hence, the computation cost of the PLSS is of order

O(np

+ 25ns

3.3 Illustration of PLSS

To illustrate the PLSS, we consider a small simulation

example with the following setup.

Data simulation setup

• p = 1000, n = 200 and σ = 1.

• The design matrix X is sampled from a

multivariate normal N

(0,Σ), where Σ is

the identity matrix except the left most

5 ×5 sub matrix, which is deﬁned as fol-

lows.







1 0 0 0 ρ

0 1 0 0 ρ

0 0 1 0 ρ

0 0 0 1 ρ

ρ ρ ρ ρ 1







• The active set is deﬁned as S = {1, 2,3, 4},

and for each j ∈ S we set β

= 1.

In the above setting, the ﬁfth variable is equally

correlated with all four active predictors. Using the

above setup, we run the following simulation steps to

perform variable selection using Lasso, stability se-

lection and PLSS.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

642

Simulation steps

1. Construct the design matrix X with ρ =

0.25.

2. Generate an error vector as ε

n×1

∼

(0,I) and then compute the response

using Eq. (1).

3. Compute the Lasso estimator using the

simulated data set (Y, X) (choose λ using

cross validation) and obtain the Lasso ac-

tive set.

4. Perform Stability Selection on the data set

(Y,X) and obtain the stability path.

5. Perform PLSS (deﬁned later) on the data

set (Y,X), Then compute the stability ac-

tive set and obtain the stability path.

In the following, the results for the above simula-

tion are presented. We remark that, for ρ ≥ 0.25 the

IC is violated by the design matrix X. As a result, the

Lasso always selects the ﬁfth predictor with the ﬁrst

four relevant predictors and with some other noise

feature as reported by the Lasso active set

lasso

{1,2,3,4,5,239, 265, 326, 374, 469,531,747,794,865,942}.

When applying the Stability Selection using weighted

Lasso, the ﬁrst four important predictors are selected

with their estimated probabilities close to 1, while

the irrelevant variables are selected with much lower

probability, see Figure (2) for probabilities of features

getting selected.

We also compare the stability paths of the standard

Stability Selection (using Lasso) with the PLSS for

the above example. For each predictor j = {1, ..., p},

the stability path is given by the selection probabilities

{

(λ) : j = {1, ..., p}, λ ∈ (0,λ

max

)}. From Figures

(1) and (2) (the four important predictors are plot-

ted as red lines, while the paths of noise features are

shown as black lines) , we see that the stability path

of the PLSS is much cleaner or it can be interpreted

as, a lot of computational effort is saved in PLSS as

most of the noise features are getting ﬁltered at the

ﬁrst stage itself.

Figure 1: Stability Path of the Stability Lasso.

Figure 2: Stability Path of the PLSS.

4 NUMERICAL RESULTS

In this section, we consider simulation settings and

pseudo-real data examples to compare the perfor-

mances of Lasso, Adaptive Lasso, Stability Selection

and PLSS in terms of variable selection. In particular,

we consider the true positive rate and the false dis-

covery rates as a measure of performances, which are

deﬁned as follows.

T PR = |

S|/|S|, and FDR = |

|/|

S| (9)

The Statistical analysis is performed in R3.2.5. We

used, the packages “glmnet” for penalized regression

methods (Lasso, Adaptive Lasso) and the package

“c060” to perform stability feature selection using

Lasso. All mentioned packages are available from

the Comprehensive R Archive Network (CRAN) at

http://cran.r-project.org/.

4.1 Example 1: Simulation

In order to compare the computational cost of the Sta-

bility Selection and the PLSS, we simulated a dataset

with the following details.

Simulation setup for Example 1

• Fix n = 500, and p =

10000, 20000, 30000, 40000, 50000, 100000.

• Set σ = 1, and the design matrix X is sam-

pled from a multivariate normal N

(0,I).

• The active set is deﬁned as S = {1, ...,20},

and for each j ∈ S we set β

= 1.

The time complexity for both Stability Selection

using the Lasso (using 100 sub samples) and PLSS

on a super computer, are reported in the Table (1).

Post Lasso Stability Selection for High Dimensional Linear Models

643

Table 1: Measure of time complexity (in seconds).

p Stability Selection PLSS

10000 29.070 1.989

20000 53.613 4.654

30000 59.259 5.527

40000 99.986 6.273

50000 202.103 8.762

100000 459.341 28.418

4.2 Example 2: Simulation

We use the following simulation setup for generating

data (Y,X).

Simulation setup for Example 2

• Set p = 1000, σ = 3, and

n = 100, 200, 400.

• Generate the design matrix X from

(0,Σ), here we consider two different

settings for Σ = {Σ

, Σ

}, where Σ

= I

and

(i, j) =



(0.5)

|i−j|

if i 6= j

1 if i = j

• The active set is deﬁned as S = {1, ...,20},

and for each j ∈ S we set β

= 1.

The performance measures for simulations with

and Σ

are reported in Table (2).

Table 2: performance measures for example 2.

n Method Σ

TPR FDR TPR FDR

100 Lasso 0.75 0.55 .55 0.71

Ada Lasso 0.75 0.51 .55 0.67

Stab Lasso 0.05 0 0 0

PLSS 0.5 0 .4 0.27

200 Lasso 1 0.65 1 0.66

Ada Lasso 1 0.58 1 0.50

Stab Lasso 0.6 0 0.65 0

PLSS 1 0 1 0

400 Lasso 1 0.54 1 0.53

Ada Lasso 1 0.35 1 0.25

Stab Lasso 1 0 1 0

PLSS 1 0 1 0

4.3 Example 3: Riboﬂavin Data

We consider Riboﬂavin data (see (B

uhlmann et al.,

2014)) for the design matrix X with synthetic param-

eters β and simulated Gaussian errors ε ∼ N

(0,I).

To fulﬁl the minimum sample size condition (n ≥

slog(p)) we reduce the dimension to p = 1000, and

the pseudo data generation steps are given as follows.

Data generation for Riboﬂavin example

• For the design matrix X, select ﬁrst 1000

covariates which are most associated with

the response.

• Fix s = 10 and for the true active set,

sample ten numbers randomly from the

set {1, ...,50}, and for each j ∈ S we set

= 1.

• Compute the response using the Equation

(1).

The performance measures are reported in Table

(3), and Figures (3) and (4).

Figure 3: Stability Path of the Stability Lasso for Ri-

boﬂavin.

Figure 4: Stability Path of the PLSS for Riboﬂavin.

Table 3: Performance measures for Riboﬂavin example.

Method TPR FDR

Lasso 0.9 0.57

Ada Lasso 0.9 0.21

Stab Lasso 0.4 0

PLSS 0.9 0

4.4 Example 4: Myeloma Data

Here, we consider the ﬁrst 1000 highest variance

genes of the real dataset Myeloma (see (Tian et al.,

2003)) for the design matrix X with synthetic param-

eters β and simulated Gaussian errors. The pseudo

data generation steps are similar as the previous ex-

ample (Riboﬂavine). The performance measures are

reported in Table (4), and Figures (5) and (6).

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

644

Table 4: Performance measures for Myeloma example.

Method TPR FDR

Lasso 1 0.63

Ada Lasso 1 0.16

Stab Lasso 0.8 0

PLSS 1 0

Figure 5: Stability Path of the Stability Lasso for Myeloma.

Figure 6: Stability Path of the PLSS for Myeloma.

4.5 Empirical Results

It is evident from the results of the simulation and

pseudo real examples that the PLSS method outper-

forms others, the number of false positives selected by

the Lasso and the adaptive Lasso is much larger than

the PLSS (except when the requirement of the mini-

mum number of observations is not met, for n = 100

case in Example 2). The PLSS performs better than

Stability Selection, the Stability Selection misses the

true predictors when the sample size is small, for

n = 200 in Table (2), and for real data case see Tables

(3) and (4), and Figures (3) and (4). From Example 1,

it is apparent that the PLSS outperforms the Stability

Selection in terms of computation complexity.

5 CONCLUSIONS

In this article, we have proposed a two stage vari-

able selection procedure, Post-Lasso Stability Selec-

tion with controlled false positives. At the ﬁrst stage,

the Lasso is performed with a small regularization pa-

rameter to obtain initial estimator, where small value

of regularization parameter and Generalized Irrepre-

sentable Condition on the design matrix X, ensures

that the Lasso active set contains the true active set

S. At the second stage, Stability Selection using

weighted Lasso is performed on the restricted set,

where the weights are computed from the initial Lasso

estimator. We have shown that the PLSS combines the

strength of the Lasso and the Stability Selection. We

illustrated the method using simulated and real data

examples and our empirical results have shown that

the PLSS compares favorably with other two stage

variable selection techniques. We have also proved

that under GIC assumption on the design matrix, the

PLSS has substantially less false positives than the

Lasso and potentially faster than the Stability Selec-

tion.

ACKNOWLEDGEMENTS

The authors acknowledge the PDC Center for High

Performance Computing at the KTH Royal Institute

of Technology for providing computational resources.

REFERENCES

Bach, F. R. (2008). Bolasso: model consistent lasso esti-

mation through the bootstrap. Proceedings of the 25th

international conference on Machine learning, ACM,

pages 33–40.

uhlmann, P., Kalisch, M., and Meier, L. (2014). High-

dimensional statistics with a view towards applica-

tions in biology. Annual Review of Statistics and its

Applications, 1:255–278.

uhlmann, P. and van de Geer, S. (2011). Statistics for

High-Dimensional Data: Methods, Theory and Appli-

cations. Springer Verlag.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.

(2004). Least angle regression. Ann. Statist.,

32(2):407–499.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Ele-

ments of Statistical Learning; Data Mining, Inference

and Prediction. New York: Springer.

Javanmard, A. and Montanari, A. (2013). Model selec-

tion for high-dimensional regression under the gener-

alized irrepresentability condition. In Proceedings of

the 26th International Conference on Neural Informa-

tion Processing Systems, pages 3012–3020.

Meinshausen, N. (2007). Relaxed lasso. Computational

Statistics and Data Analysis, 52(1):374–393.

Meinshausen, N. and B

uhlmann, P. (2006). High-

dimensional graphs and variable selection with the

lasso. Annals of Statistics, 34:1436–1462.

Meinshausen, N. and B

uhlmann, P. (2010). Stability selec-

tion (with discussion). J. R. Statist. Soc, 72:417–473.

Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Bar-

logie, B., and Shaughnessy, J. J. (2003). The role of

the wnt-signaling antagonist dkk1 in the development

of osteolytic lesions in multiple myeloma. N Engl J

Med., 349(26):2483–2494.

Post Lasso Stability Selection for High Dimensional Linear Models

645

Tibshirani, R. (1996). Regression shrinkage and selection

via the lasso. J. R. Statist. Soc, 58:267–288.

van de Geer, S., Bhlmann, P., and Zhou, S. (2011). The

adaptive and the thresholded lasso for potentially mis-

speciﬁed models (and a lower bound for the lasso).

Electron. J. Statist., 5:688–749.

Zhao, P. and Yu, B. (2006). On model selection consis-

tency of lasso. Journal of Machine Learning Re-

search, 7:2541–2563.

Zhou, S. (2009). Thresholded lasso for high dimensional

variable selection and statistical estimation. NIPS,

pages 2304–2312.

Zou, H. (2006). The adaptive lasso and its oracle proper-

ties. Journal of the American Statistical Association,

101(476):1418–1429.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

646