Automatic Feature Selection for Sleep/Wake Classiﬁcation

with Small Data Sets

J. Foussier

, P. Fonseca

, X. Long

and S. Leonhardt

Philips Chair for Medical Information Technology, RWTH Aachen University, Pauwelsstrasse 20, 52074 Aachen, Germany

Philips Research Eindhoven, High Tech Campus 34, 5656AE Eindhoven, The Netherlands

Keywords:

Sleep Monitoring, Sleep Staging, Feature Selection, Linear Discriminant Classiﬁcation, Unobtrusive Moni-

toring, Cohen’s Kappa, Spearman’s Ranked-order Correlation.

Abstract:

This paper describes an automatic feature selection algorithm integrated into a classiﬁcation framework devel-

oped to discriminate between sleep and wake states during the night. The feature selection algorithm proposed

in this paper uses the Mahalanobis distance and the Spearman’s ranked-order correlation as selection criteria to

restrict search in a large feature space. The algorithm was tested using a leave-one-subject-out cross-validation

procedure on 15 single-night PSG recordings of healthy sleepers and then compared to the results of a standard

Sequential Forward Search (SFS) algorithm. It achieved comparable performance in terms of Cohen’s kappa

(κ = 0.62) and the Area under the Precision-Recall curve (AUC

= 0.59), but gave a signiﬁcant computational

time improvement by a factor of nearly 10. The feature selection procedure, applied on each iteration of the

cross-validation, was found to be stable, consistently selecting a similar list of features. It selected an average

of 10.33 features per iteration, nearly half of the 21 features selected by SFS. In addition, learning curves

show that the training and testing performances converge faster than for SFS and that the ﬁnal training-testing

performance difference is smaller, suggesting that the new algorithm is more adequate for data sets with a

small number of subjects.

1 INTRODUCTION

Sleep is an essential process in most animals, includ-

ing human beings, and although it has been stud-

ied for centuries, relatively little is known about it.

It is clear, however, that sleep is essential to sur-

vive, as sleep deprivation studies on rats have shown

(Rechtschaffen and Bergmann, 1995). Computer-

aided sleep assessment was introduced to reduce the

manpower and costs needed to collect and interpret

data during these studies. However, most of these

systems still require the subjects to spend one or

more nights in a sleep laboratory, which remains a

rather expensive and inconvenient procedure. Ambu-

latory sleep monitoring aims precisely at eliminating

this requirement and can effectively be used for di-

agnosing several sleep disorders. For this, new sen-

sors and algorithms are needed. Signiﬁcant work has

been done to exploit the fact that certain autonomic

changes associated with different sleep stages also

manifest themselves differently in parameters such as

cardiorespiratory activity and body movements. By

evaluating how these parameters change, it should be

possible, at least to a certain extent, to distinguish

some of these stages without resorting to EEG. Sev-

eral research groups have worked on the extraction of

cardiorespiratory and body movement features (e.g.,

(Devot et al., 2007), (Devot et al., 2010), (Redmond

et al., 2007) or (Zoubek et al., 2007)). However, one

of the main issues is that many publications address

the sleep stage classiﬁcation problem from a rather

limited set of physiological features. Many authors

report how successful a certain feature is for the clas-

siﬁcation task, instead of focusing on methods that

aim at selecting the best set of features as we will

show later. There is, in fact, a plethora of features

described in literature which can be readily used for

the task of sleep staging or the extraction of relevant

sleep parameters.

Most available PSG data were generated for pa-

tients with sleep disorders. As a result, prior work

related to sleep staging of healthy subjects with car-

diorespiratory signals or actigraphy often relies on

very small data sets (often less than a dozen subjects)

and were collected by individual research groups

for the validation of a new sensor and/or feature.

178

Foussier J., Fonseca P., Long X. and Leonhardt S..

Automatic Feature Selection for Sleep/Wake Classiﬁcation with Small Data Sets.

DOI: 10.5220/0004245401780184

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2013), pages 178-184

ISBN: 978-989-8565-35-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Many authors opt to perform a single feature selec-

tion step on the entire data set when applying tra-

ditional machine learning approaches, clearly bias-

ing the classiﬁcation results towards positive per-

formance. Also, each single epoch is subject- and

time-dependent. Therefore the data of several sub-

jects cannot be (randomly) mixed and tested in a tra-

ditional leave-one-out-cross-validation (LOOCV) but

rather with a leave-one-subject-out-cross-validation

(LOSOCV) procedure, which reduces the number of

possible folds in the cross-validation.

Finding the ideal set of features for sleep/wake

classiﬁcation, especially for small data sets, is a com-

plex and challenging task especially when the number

of features is large. An exhaustive search, although

leading to the optimal feature set, is impractical in

terms of computational time as soon as the dimen-

sion of the feature space becomes larger. Sequential

search, backward (SBS) and forward (SFS), tries to

address this issue by following a single search path

during the process (Whitney, 1971). However, it often

delivers sub-optimal solutions especially in problems

with small data sets. In the work of (Zoubek et al.,

2007) an example of the employment of the SFS al-

gorithm can be found.

Building upon previous research published by

(Devot et al., 2007; Devot et al., 2010), we will de-

scribe a new feature selection method that is particu-

larly adequate for use in each single training step of

the LOSOCV and for linear discriminant classiﬁers.

Linear discriminant classiﬁers, like most other clas-

siﬁers, are sensitive to the dimension of the feature

space. A large number of features can also cause

over-ﬁtting and prevent the classiﬁer from general-

izing well to new data when assuming a certain de-

gree of independence between the features. On the

other hand, if the dimension is too small the classi-

ﬁer will often be too sensitive to noise (Duda et al.,

2001). Computational time also plays a role, es-

pecially when the number of available features in-

creases. All these constraints have been taken into

account during the design process of the feature se-

lection algorithm. In addition, as we will show, this

feature selection method is also well suited for data

sets with small number of subjects. Finally, by inte-

grating feature selection in the training step of a cross-

validation procedure, we will guarantee that the train-

ing (including feature selection) and testing steps are

performed on mutually exclusive data sets, and at the

same time on the largest possible data set. We will

then apply and evaluate the proposed feature selec-

tion method within a classiﬁcation framework used

for sleep/wake detection in healthy sleepers. In or-

der to highlight the properties of the proposed feature

selection algorithm, all classiﬁcation results, includ-

ing total computational time, stability of the selected

features and generalization capabilities, are compared

to a standard Sequential Forward Search (SFS) algo-

rithm.

2 METHODS AND MATERIALS

2.1 Data Set

The data set consists of 15 single-night PSG

recordings of healthy sleepers – ten female

(age 31±12.4 yrs, BMI 24.76±3.7 kg/m

)

and ﬁve male subjects (age 31±5.5 yrs,

BMI 24.38±2.72 kg/m

). Each PSG recording

includes at least the EEG channels recommended by

the American Academy of Sleep Medicine (AASM),

a 2-lead ECG and the thoracic respiratory effort.

In addition, actigraphy was acquired with a Philips

Actiwatch and synchronized with the PSG. Nine

subjects were measured in Boston (USA), at the

Sleep Health Center, and six subjects in Eindhoven

(The Netherlands), at the sleep laboratory of the High

Tech Campus. The study protocol was approved by

the Ethics Committees of the respective center and

all subjects signed an informed consent form. Sleep

stages were scored by professional sleep technicians

according to the guidelines of the AASM as wake,

non-REM sleep 1-3 (N1-N3) and REM sleep using

30-second epochs. In order to train and test our

classiﬁer for the sleep and wake classes, we merged

the N1-N3 and REM classes into a single sleep class.

Since the data were recorded in two different sleep

laboratories, with two differently conﬁgured PSG sys-

tems, the data were ﬁrst resampled to a common sam-

pling rate (512 Hz for ECG, 10 Hz for respiratory ef-

fort, and 30-second period for actigraphy).

2.2 Classiﬁcation Framework

The classiﬁcation framework, illustrated in Figure 1,

is divided in two main parts: training and classiﬁca-

tion. Before training, the data set is ﬁrst split into

independent training and testing sets. Each contains

manually annotated sleep scores, indicating the sleep

stage for every epoch. The predictions of the classi-

ﬁer are compared with the ground-truth annotations

and the performance of the classiﬁer on the testing set

is computed.

As mentioned in the introduction, autonomic

changes associated with different sleep stages will

manifest themselves differently in certain physiolog-

ical parameters. In order to exploit these changes we

AutomaticFeatureSelectionforSleep/WakeClassificationwithSmallDataSets

179

Feature subset

selection

Classification Evaluation

Training data

Feature

selection and

training

Feature short-list

and model

Selected features

Testing

data

Training

Classification

Model

Figure 1: Block diagram illustrating the classiﬁcation

framework.

extracted a total of 60 features from the ECG, the res-

piratory (thoracic) effort and the actigraphy signals

on 30-second epochs. The cardiac features are based

on heart-rate-variability (HRV) evaluated in time and

frequency domain. Non-linear properties were also

examined based on Detrended Fluctuation Analysis

(DFA) and Sample Entropy. The respiratory features

were deﬁned in the time domain - including statistical

measures derived from both the signal waveform and

respiratory period and non-linear measures of “simi-

larity” - as well as in the frequency domain. For actig-

raphy, we used so-called activity counts, directly ac-

quired with the Actiwatch. Since we did not put any

additional effort on the task of feature extraction, we

will not mention it further in this paper and refer to

previous work (Devot et al., 2010; Long et al., 2012).

The training step comprises an iterative feature

selection procedure whereby a short-list of features

of the original 60 features is chosen. This short-list

should comprise the set of features that best character-

izes the different sleep stages accordingly to the anno-

tations of the training set. On each iteration of feature

selection, the input feature vectors are reduced to a

subset of feature vectors. This subset is then used to

train a model which is in turn used to classify the same

input data. The training classiﬁcation performance is

fed back to the feature selection procedure.

Assuming that all features are normally dis-

tributed and the covariance matrices for all classes are

identical, i.e., Σ

= Σ

Σ, we have a “linear discriminant”

function given by

(f) = −

(f − µ

)

−1

(f − µ

) + ln(P(ω

)) (1)

where µ

and Σ

Σ are the mean vector for class ω

and

the pooled covariance matrix (Duda et al., 2001; Red-

mond et al., 2007). To use this function in the train-

ing step of our classiﬁcation framework, we need to

compute the sample mean and the prior probabilities

of each class and the inverse pooled covariance ma-

trix Σ

Σ. We chose the linear discriminant instead of

a quadratic discriminant, because quadratic discrim-

inants are known to require larger sample sizes than

linear discriminants and they also seem to be more

sensitive to possible violations of the basic assump-

tions of normality (Friedman, 2012). This is partic-

ularly important for classiﬁcation of features derived

from physiological data, which very often do not fol-

low a normal distribution. Furthermore, for prob-

lems with small sample sizes it is also common to use

the pooled covariance estimate as a replacement of

the population class covariance matrices (Friedman,

2012).

Regarding the prior probabilities P (ω

) of each

class, we used the observation that the different

classes have different probabilities throughout the

night (Redmond et al., 2007). The time-dependent

prior probabilities for a given class can be obtained

by counting, for each epoch relative to the beginning

(i.e., when lights were turned off) of each recording,

the number of times that epoch was annotated with

that class. The prior probability term in the linear dis-

criminant (1) can be used to bias the classiﬁcation to

a certain class.

The feature selection procedure described in this

paper aims at selecting features that simultaneously

offer a high discrimination power between classes, yet

are uncorrelated with each other. It makes use of the

classiﬁer structure and theory. It can be shown that

when using the linear discriminant function in (1) in a

two-class problem where the classes are equiprobable

the error probability of the classiﬁer depends on the

following metric

= (µ

− µ

)

−1

(µ

− µ

). (2)

This metric, also called the Mahalanobis distance, re-

ﬂects the “class separability” for a given feature set.

As such, it seems appropriate to measure the dis-

criminating power of each individual feature. When

evaluating a single feature k, the inter-class distance

between the classes ω

and ω

can be rewritten as



− µ



(3)

where µ

and µ

are the population means of each

class and σ

is the standard deviation of feature k. The

top-discriminating features are those with the highest

inter-class distance d

. As a measure of correlation

between features, the algorithm uses the Spearman’s

ranked-order correlation (Abdullah, 1990). This cor-

relation measure is particularly robust in the presence

of outliers, very common when measuring physiolog-

ical signs. Unlike the Pearson’s correlation, it does

not require a linear relation between the features to

express the correlation between them. As an example

of signals with high Spearman’s ranked-order and low

Pearson’s correlation, consider the inter-beat interval

BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

180

(IBI) and the derived instantaneous heart rate (HR):

HR = IBI

−1

. It is clear that HR and IBI correlate

monotonically, but not linearly.

Maximum discrimination power and minimum

correlation are combined in the feature selection al-

gorithm described in the box on the right hand side

(“mahal”). The algorithm assigns a score to each fea-

ture (steps 1 to 4). The higher the score, the better a

feature is for our classiﬁcation task. An iterative pro-

cedure will then search a variety of score thresholds,

and determine the classiﬁcation performance obtained

with the corresponding feature short-list (step 5). The

highest performance will correspond to the optimal

short-list of features for our training set. Note that

when cross-validation is used to evaluate the perfor-

mance of a classiﬁer, this procedure can be used with

the training set deﬁned on each iteration. Each short-

list can then be used for classiﬁcation with the testing

set of the same iteration.

Both the mahal and SFS feature selection meth-

ods will be evaluated by comparing the performance

in terms of κ and AUC

on the training and the test-

ing set. Performance curves during the feature se-

lection procedure, learning curves and AUC

val-

ues of the classiﬁcation results using the selected fea-

tures of each cross-validation step, will all help giv-

ing us a good insight of the overall performance of

each feature selection method. In addition, the num-

ber and diversity of the selected features and the total

computation time are analyzed. It can be shown that

the fraction of misclassiﬁed epochs during LOSOCV

corresponds to the maximum likelihood estimate for

the (unknown) error rate of a classiﬁer (Duda et al.,

2001). Although this procedure has also been used

to evaluate the performance of similar classiﬁers in

earlier work, the feature selection was applied on the

complete data set, and therefore, also on the testing

set. The feature selection described in this paper is

applied in each iteration of the LOSOCV, guarantee-

ing that the testing data used to validate the classi-

ﬁer were not exposed to the tuning and training steps.

That means that for each iteration a separate short-list

of features is determined.

First, the performance of the classiﬁer was evalu-

ated using the traditional metrics of accuracy, preci-

sion, speciﬁcity and sensitivity (considering wake as

the positive class) for each iteration of the LOSOCV

and for the pooled results (Fawcett, 2004). How-

ever, because the wake and sleep classes are very im-

balanced (the wake epochs represent less than 10%

of all epochs) these metrics can fail to give an ac-

curate overview of the performance for both classes

(Haibo and Garcia, 2009). For that reason we do not

present those metrics in this paper, but compute and

Algorithm 1: (mahal).

For the feature values and associated ground-truth

{

}

of each epoch in a given training set:

Step 1

Compute the inter-class distance for each feature k as



− f



(4a)

where the sample mean for a given class z and the standard

deviation are given by

∑

i∈Z

, for Z =

{

i|y

= z

}

(4b)

∑

i=1



− f



N − 1

, with f

∑

i=1

(4c)

Collect all unique inter-class distances in an array d.

Step 2

Compute the Spearman’s ranked-order correlation c

k,l

between each feature and the remaining features

k,l

= corr





(4d)

where f

and f

are the feature k and l respectively, for

each epoch in the training set.

Step 3

Assign a “score” s

of zero to each feature

:= 0, for k ∈

{

1,..., N

}

(4e)

Step 4

for each M in d and for C = 0...1, step size ∆

= 0.01

for each feature k

if d

> M and if the feature is uncorrelated

with the others,

k,l

< C,∀l ∈

{

1,..., N

}

(4f)

or has a higher distance than the feature it is

correlated with,

> m

,∀l ∈



l|l = 1, ...,N

k,l

≥ C|



(4g)

increase its score

:= s

(4h)

where N

is the number of loop steps,

∆

(4i)

Step 5

for S = 0...1, step size ∆

compile a short-list of features l

with score higher than

the threshold l

{

k|s

> S

}

, compute the performance

of the classiﬁer in the training set using l

Step 6

Return as the ﬁnal short-list the list of features which gave

the highest performance l = l

MAX

, with

MAX

{

S|κ

= max (

{

,κ

,...

}

)

}

. (4j)

AutomaticFeatureSelectionforSleep/WakeClassificationwithSmallDataSets

181

0 0.2 0.4 0.6 0.8 1

0.5

0.55

0.6

0.65

Cohen’s kappa (κ)

Score threshold

Figure 2: Performance κ on the training set for different

score thresholds.

analyze the Cohen’s Kappa coefﬁcient of agreement

(κ) instead. This metric is directly interpretable as

the proportion of joint judgments for which there is

agreement, after chance agreement is excluded (Co-

hen, 1960). Despite their widespread use, these met-

rics only assess the classiﬁer’s performance on a sin-

gle point in the entire solution space, namely that ob-

tained by directly comparing the output of the dis-

criminants deﬁned for both classes by (1). To com-

pare a classiﬁer with other classiﬁers this single point

might not be sufﬁcient. By doing so, we assume

equal misclassiﬁcation costs and fully known class

distributions (Provost et al., 1998). When there is

a large imbalance between the classes - as it is the

case for sleep/wake classiﬁcation - Precision-Recall

(PR) curves should be used instead of Receiver Oper-

ating Characteristic (ROC) curves (Davis and Goad-

rich, 2006). In order to assess the performance of a

classiﬁer across the entire solution space, it is custom-

ary to compute the area under the curve, in this case,

under the PR curve (AUC

). Unlike the computation

of the AUC for the ROC curve, computing the AUC

for the PR curve requires a more complex procedure,

the composite trapezoidal method proposed by (Davis

and Goadrich, 2006).

Finally, we also computed so-called “learning

curves” (Duda et al., 2001) to gain insight into the

generalization capabilities of the classiﬁer. By vary-

ing the number of subjects in the data set, these curves

can help predict what the performance of the classi-

ﬁer would be when using more training data. They

can be obtained by computing the testing and training

error or alternatively, the testing and the training per-

formance (e.g., κ) for data subsets of different size n

randomly selected from the whole data set. As n in-

creases, the testing and training performance should

approach the same asymptotic value. The conver-

gence speed indicates how well a classiﬁer is suited

for small data sets and has to be taken into account

when classifying small data sets.

0 10 20 30 40 50 60

0.5

0.55

0.6

0.65

0.7

Cohen’s kappa (κ)

Desired feature count

Figure 3: Performance κ on the training set for different

desired number of features in the SFS algorithm.

3 RESULTS

First, the performance κ obtained on the training set

for each score threshold S (step 5 of the feature se-

lection algorithm) is illustrated in Figure 2. To fur-

ther show the stability of the feature selection process,

we plotted with shaded bands the range between the

minimum and maximum κ obtained for each thresh-

old across all iterations of the LOSOCV. The perfor-

mance peaks around S = 0.5, with a relatively nar-

row shaded band. Intuitively, a narrow performance

band means that the performance obtained for a given

score threshold is similar across all iterations of the

LOSOCV, suggesting that the procedure is stable.

Note that with thresholds beyond 0.6, and therefore

with smaller short-lists, the performance drops. There

seems to be an optimal number of features which on

the one hand prevents overﬁtting while on the other

maximizes the generalization capabilities of our clas-

siﬁer. The use of a score threshold in this proce-

dure is advantageous since unlike many other feature

selection algorithms we do not need to specify the

“desired” number of features, letting that depend on

the actual properties of the training set. In a simi-

lar manner, by sweeping through a desired number

of features, we can observe how the performance of

SFS evolves (Figure 3). The average performance is

maximal when using 26 features. The width of the

shaded band is comparable to the one in Figure 2.

Note that the performance on the training set is higher

when compared with mahal, where from the begin-

ning more features correlated with each other, even

with high discriminative power, are excluded. This

typically leads to a higher performance on the train-

ing data set, but, as we will show, not necessarily on

the testing data set.

Table 1 lists the performance of the classiﬁer on

the testing set for each iteration of the LOSOCV and

also the overall performance. The ﬁrst column indi-

cates the iteration number of the LOSOCV. The three

columns with the headers mahal and SFS indicate the

BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

182

Table 1: Results on the testing set during cross-validation,

considering wake class as positive.

mahal SFS

(total time = 984 s) (total time = 9205 s)

it κ

AUC #

PR feat. PR feat.

1 0.77 0.83 8 0.77 0.85 20

2 0.58 0.73 7 0.70 0.75 15

3 0.66 0.61 7 0.71 0.77 21

4 0.55 0.82 10 0.65 0.89 31

5 0.61 0.80 15 0.67 0.68 31

6 0.94 0.94 7 0.73 0.92 17

7 0.76 0.85 16 0.69 0.78 24

8 0.89 0.93 10 0.81 0.86 22

9 0.76 0.89 8 0.63 0.88 27

10 0.28 0.28 11 0.32 0.32 17

11 0.70 0.79 10 0.71 0.77 21

12 0.60 0.83 9 0.64 0.74 11

13 0.56 0.68 14 0.68 0.76 10

14 0.24 0.35 13 0.49 0.59 30

15 0.53 0.80 10 0.58 0.80 18

pooled 0.62 0.59 - 0.64 0.60 -

mean 0.63 0.74 10.33 0.65 0.76 21

std 0.19 0.19 2.94 0.12 0.15 6.69

κ performance, the AUC

and the number of selected

features for each iteration of the LOSOCV for the

mahal and the SFS algorithm, respectively. The row

pooled indicates the overall performance obtained af-

ter pooling all classiﬁcation epochs. Note that this is

different from the average results, which are indicated

in the row mean with the standard deviation std. In

addition, the total computation time for all 15 feature

selection steps is included in the ﬁrst row of the table.

Considering the classiﬁcation performances com-

pared to the ground-truth, similar kappa values of 0.62

and 0.64 have been computed for mahal and SFS, re-

spectively. As it can also be seen, κ ranges from 0.24

to 0.94 for mahal and from 0.32 to 0.81 for SFS, re-

spectively. This shows that between-subject variation

is quite high, reﬂecting important physiological dif-

ferences between individuals, regardless of the em-

ployed feature selection algorithm. Also the AUC

of both algorithms are comparable with 0.59 (mahal)

and 0.60 (SFS). A Wilcoxon signed rank test con-

ﬁrmed that the results are not signiﬁcantly different

(p = 0.39).

Figure 4 displays the learning curves obtained

by varying the number of subjects in the data set.

The overall performances (pooled LOSOCV results)

achieved on the training and testing data sets with

mahal converge rapidly from 5 subjects and stabi-

lize around a κ of 0.65. SFS converges much slower.

The performance gap of κ between the training and

the testing remains higher than for mahal, even for

15 subjects. The performance on the training set re-

2 4 6 8 10 12 14

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of subjects

Cohen’s kappa (κ)

training (mahal)

testing (mahal)

training (SFS)

testing (SFS)

Figure 4: Learning curves obtained by varying the number

of subjects in the training and testing sets.

mains slightly above 0.7, whereas the performance

for both mahal and SFS on the testing set are at

around 0.65.

4 DISCUSSION AND

CONCLUSIONS

The feature selection step, essential to the proper de-

sign of a good classiﬁer, usually suffers from an im-

portant methodological issue: in the presence of little

available data, researchers often opt to perform fea-

ture selection on the complete data set. The feature se-

lection algorithm proposed in this paper addresses this

issue by offering the possibility of being integrated in

a cross-validation procedure. The algorithm is fully

automatic and, more importantly, does not require the

choice of a desired number of features.

Figure 2 and Figure 3 do not reﬂect how many

(different) features were chosen. Inspecting Table 1,

a big difference is noticeable in the number of se-

lected features. Where mahal selects about 10 fea-

tures in average, with standard deviation (std) of less

than 3 features, SFS selects 21 features with a stan-

dard deviation of nearly 7. The diversity of selected

features after selection with SFS is very high whereas

mahal consistently selects similar set of features dur-

ing the different iterations of the cross-validation pro-

cedure, also with one small data set. In order to

compare how consistently the feature selection algo-

rithms were across the different iterations of the cross-

validation procedure, we computed the mean number

(and standard deviation) of iterations each feature was

selected. The mahal algorithm chose 9.12 (5.42) and

SFS 6.43 (4.59) number of iterations per selected fea-

ture in average. Only 17 different features for mahal,

compared to 46 for SFS, were selected by the feature

selection process. Furthermore, each feature is se-

lected, in average, more times than with SFS which

further illustrates how stable the selection procedure

is to changes in the training set. A higher diversity of

features mainly has two drawbacks. First, it is more

difﬁcult to choose a ﬁnal set of features when design-

AutomaticFeatureSelectionforSleep/WakeClassificationwithSmallDataSets

183

ing a classiﬁer, since this seems to vary with every

small change in the training set. Second, more fea-

tures means higher feature extraction time. Despite

the smaller number of selected features, the classiﬁ-

cation performance was not signiﬁcantly affected.

The performance of a feature selection algorithm

can also be described in terms of the total computa-

tional time that an algorithm needs to ﬁnd the optimal

feature set. Here, we only analyze the time needed by

the feature selection itself. The feature extraction step

is not taken into account. mahal is nearly 10 times

faster than SFS, with 984 s and 9205 s, respectively.

By design, on each iteration SFS must redo the entire

classiﬁcation step in the training set for each feature

before choosing which feature to add to the feature

set. The time increases approximately exponentially

with each new feature added to the total feature set.

In contrast, the computational time of the mahal algo-

rithm increases approximately linearly as the perfor-

mance calculations are only performed on the selected

feature subsets (step 5 of the algorithm). In addition,

the algorithm automatically restricts the list of fea-

tures that have to be tested during selection by eval-

uating their statistical power in advance, i.e., the Ma-

halanobis distance and the Spearman’s ranked-order

correlation.

Our classiﬁer achieves a performance of κ = 0.62

in distinguishing sleep/wake, which is at least as high

as most work published so far, with fewer features

used during classiﬁcation. However, the differences

in performance obtained for different subjects are too

large to be ignored. It seems from the learning curves

that this classiﬁer is approaching its maximum per-

formance with the currently extracted features. In

order to further improve it, new approaches seem to

be needed. These could take into account, or bet-

ter yet, compensate for subject-speciﬁc differences

in the physiological expressions of different sleep

stages. Nevertheless, the feature selection algorithm

mahal described in this paper seems well-suited for

this problem since it is stable enough to be integrated

in a cross-validation procedure, also in the presence

of small data sets.

ACKNOWLEDGEMENTS

The authors thank Dr. Reinder Haakma and Sandrine

Devot, as well as Prof. Ronald Aarts for their com-

ments and careful reading of the manuscript.

REFERENCES

Abdullah, M. (1990). On a robust correlation coefﬁcient.

The Statistician, 39(4):455–460.

Cohen, J. (1960). A Coefﬁcient of Agreement for Nominal

Scales. Educational and Psychological Measurement,

20(1):37–46.

Davis, J. and Goadrich, M. (2006). The relationship

between Precision-Recall and ROC curves. In Pro-

ceedings of the 23rd international conference on Ma-

chine learning ICML 06, volume 10 of ICML ’06,

pages 233–240, Pittsburgh (USA). ACM Press.

Devot, S., Bianchi, A. M., Naujokat, E., Mendez, M.,

Brauers, A., and Cerutti, S. (2007). Sleep monitoring

through a textile recording system. In IEEE Engineer-

ing in Medicine and Biology Society, volume 2007,

pages 2560–2563.

Devot, S., Dratwa, R., and Naujokat, E. (2010). Sleep/wake

detection based on cardiorespiratory signals and actig-

raphy. In Annual International Conference of the

IEEE Engineering in Medicine and Biology Society

(EMBS), pages 5089–5092. IEEE.

Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern

Classiﬁcation. Wiley, 2nd edition.

Fawcett, T. (2004). ROC Graphs: Notes and Practical Con-

siderations for Researchers. ReCALL, 31(HPL-2003-

4):1–38.

Friedman, J. H. (2012). Regularized Discriminant Analy-

sis. Journal of the American Statistical Association,

84(405):165–175.

Haibo, H. and Garcia, E. (2009). Learning from Imbalanced

Data. IEEE Transactions on Knowledge and Data En-

gineering, 21(9):1263–1284.

Long, X., Fonseca, P., Foussier, J., Haakma, R., and Aarts,

R. (2012). Using Dynamic Time Warping for Sleep

and Wake Discrimination. In IEEE Engineering in

Medicine and Biology Society - International Confer-

ence on Biomedical and Health Informatics (BHI),

volume 25, pages 886–889, Hong Kong/Shenzhen

(China).

Provost, F., Fawcett, T., and Kohavi, R. (1998). The case

against accuracy estimation for comparing induction

algorithms. In Proceedings of the 15th International

Conference on Machine Learning, volume 445. JS-

TOR.

Rechtschaffen, A. and Bergmann, B. (1995). Sleep depri-

vation in the rat by the disk-over-water method. Be-

havioural Brain Research, 69(1-2):55–63.

Redmond, S. J., de Chazal, P., O’Brien, C., Ryan, S., Mc-

Nicholas, W. T., and Heneghan, C. (2007). Sleep

staging using cardiorespiratory signals. Somnologie,

11(4):245–256.

Whitney, A. (1971). A Direct Method of Nonparametric

Measurement Selection. IEEE Transactions on Com-

puters, C-20(9):1100–1103.

Zoubek, L., Charbonnier, S., Lesecq, S., Buguet, A.,

and Chapotot, F. (2007). Feature selection for

sleep/wake stages classiﬁcation using data driven

methods. Biomedical Signal Processing and Control,

2(3):171–179.

BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

184