large number of predictors. In practice, it is difficult
to satisfy the Irrepresentability Condition, hence the
Lasso does not provide any guarantees on the num-
ber of false discoveries. However, the Stability Selec-
tion has not been widely accepted due to its computa-
tional complexity. When the irrepresentable condition
(IC) is violated, two-stage procedures (e.g. thresh-
olding or adaptive Lasso) are used to achieve consis-
tent variable selection. Such two-step procedures in-
volve choosing several tuning parameters that further
complicates the problem. We propose to combine the
strength of the (adaptively weighted) Lasso and the
Stability Selection for efficient and stable feature se-
lection. In the first step, we apply the Lasso with
small regularization parameter that selects a subset
consisting of small number of features. In the second
step, stability feature selection using weighted Lasso
is applied to the restricted Lasso active set to select
the most stable features. For the weighted `
1
penal-
ization, the weights are computed from the Lasso es-
timator at the first stage such that large effects covari-
ates in the Lasso fit will be given smaller weights and
small effects covariates will be given larger weights.
We call the combination of the two, the Post-Lasso
Stability Selection (PLSS).
Several authors have previously considered two
stage Lasso-type procedures that have better poten-
tial and properties for variable selection than single
stage Lasso, such as adaptive Lasso (Zhao and Yu,
2006), thresholded Lasso (Zou, 2006), relaxed Lasso
(Meinshausen, 2007) and Gauss-Lasso (Javanmard
and Montanari, 2013) to name a few. The Post-Lasso
Stability Selection, is a special case of the two stage
variable selection procedure: (1) Pre-selection stage:
selection of predictors using the Lasso with small tun-
ing parameter; and (2) Selection stage: selection of
the most stable features from preselected predictors
using Stability Selection with weighted Lasso. We
prove that under assumption of Generalized Irrepre-
sentability Condition (GIC) (Javanmard and Monta-
nari, 2013), the initial Lasso active set with small tun-
ing parameter contains the true active set S with high
probability. Then stability feature selection where
base selection procedure is the weighted Lasso, cor-
rectly identifies the stable predictors when applied on
the restricted Lasso active set. The contribution of this
paper is summarized as follows.
1. We briefly review two stage procedures for stable
feature selection and estimation.
2. We propose a new combined approach, namely
the Post Lasso Stability Selection (PLSS): The
Lasso selecting initial active set and the Stabil-
ity Selection using weighted Lasso selecting sta-
ble features from the initial active set.
3. We also utilize the estimation result obtained by
the initial stage Lasso for computing weights of
the selected predictors considered for the next
stage.
4. We prove that under assumption of GIC, the PLSS
correctly identifies the true active set with high
probability.
5. We empirically show that PLSS outperforms the
standard Lasso and the adaptive Lasso in terms of
false positives.
6. We evaluate computational complexity of PLSS,
and show that it is superior than the standard sta-
bility feature selection using the Lasso.
The rest of this paper is organized as follows. In Sec-
tion 2, we provide background, notations, assump-
tions and a brief review of the relevant work. In sec-
tion 3, we define and illustrate the Post Lasso Stability
Selection. In section 4, we carry out simulation stud-
ies and we shall provide conclusion in section 5.
2 BACKGROUND AND
NOTATIONS
In this section, we state notations, assumptions and
definitions that will be used in later sections. We also
provide a brief review of relevant work and our con-
tribution.
2.1 Notations and Assumptions
We consider the sparse high dimensional linear re-
gression set up as in (1), where p n. We assume
that the components of the noise vector ε are i.i.d.
N(0,σ
2
). The true active set or support of β is denoted
as S and defined as S = {j ∈ {1,..., p} : β
j
6= 0}. We
assume sparsity in β such that s n, where s = |S | is
the sparsity index. The `
1
-norm and `
2
-norm (square)
are defined as kβk
1
=
∑
p
j=1
|β
j
| and kβk
2
2
=
∑
p
j=1
β
2
j
respectively. For a matrix X ∈ R
n×p
, we use super-
scripts for the columns of X, i.e., X
j
denotes the j
th
column, and subscripts for the rows, i.e., X
i
denotes
the i
th
row. For any S ⊆{1,..., p}, we denote X
S
as the
restriction of X to columns in S, and β
S
is the vector β
restricted to the support S, with 0 outside the support
S. Without loss of generality we can assume that the
first s = |S| variables are the active variables, and we
partition the empirical covariance matrix, C =
1
n
X
T
X,
for the active and the redundant variables as follows.
C =
C
11
C
12
C
21
C
22
(3)
Post Lasso Stability Selection for High Dimensional Linear Models
639