remaining zones, we need to define the polytopes of
the samples in Reference 1 and Reference 2. These
polytopes are built using the convex hull of the ro-
bust principal component scores. More specifically,
the boundary of the green zone is defined by com-
puting the convex hull of the robust principal compo-
nent scores of the Reference 1. A short description of
each zone is provided in Table 1. Before determin-
ing the color tag for each new data, the samples are
checked for missing values and are imputed in case
needed by multivariate imputation methods such as
Josse et al. (2011). The idea behind the validity as-
sessment is visualized in Figure 4. For simplicity,
only 2 sensors are used for all computations in Fig-
ure 4 and a 2D presentation of zones is plotted using
the sensors’ coordinates. Suppose that X
N×11
repre-
sents the matrix of sensor values for N samples, y
N
the vector of corresponding odor concentration val-
ues and x
>
l
is the lth row of X
N×11
, l = 1, 2, . . . , N.
Furthermore, suppose that n
1
refers to the number of
samples in the proposed set of the sampling and n
2
refers to the number of samples in the calibration set.
The samples of the proposed set are always available,
but not necessary the calibration set. Two different
scenarios occur based on the availability of the cali-
bration set. If the calibration set is accessible, then
Scenario 1 happens. Otherwise, we only deal with
Scenario 2. Scenario 1 is a general case which is
explained more in details. The data undergo a pre-
processing stage, including imputation and outlier de-
tection, before any further analyses. Having done the
pre-processing stage, data are stored as Reference 1,
X
n
1
×11
, and Reference 2, X
n
2
×11
. The first k, e.g.
k = 2, 3, robust principal components of X
n
1
×11
are
calculated and the corresponding loading matrix is
Sub-Algorithm: (Scenario 1).
1: if the point x
>
l
, l = 1, 2, . . . , N is identified as an
outlier by AO measure then
2: x
>
l
is in red zone,
3: else if x
>
l
L
1
∈ ConvexHull
(1)
AND x
>
l
L
1
6∈
ConvexHull
(2)
then
4: x
>
l
is in green zone,
5: else if x
>
l
L
1
6∈ ConvexHull
(1)
AND x
>
l
L
1
∈
ConvexHull
(2)
then
6: x
>
l
is in blue zone,
7: else if x
>
l
L
1
∈ ConvexHull
(1)
AND x
>
l
L
1
∈
ConvexHull
(2)
then
8: x
>
l
is in orange zone,
9: else
10: x
>
l
is in yellow zone.
11: end if
denoted by L
1
. The pseudo code of two algorithms
for Scenario 1 is provided below. Scenario 2 is a spe-
cial case of Scenario 1 in which Sub-Algorithm (Sce-
nario 1) is used with ConvexHull
(2)
= ∅ that elimi-
nates the blue and the orange zones. Consequently,
there is no model for odor concentration prediction in
the Main Algorithm.
Main Algorithm: (Scenario 1).
Require: X
n
1
×11
, X
n
2
×11
, and the loading matrix L
1
using robust PCA over Reference 1, X
n
1
×11
.
1: ConvexHull
(1)
← the convex hull of the projected
values of the Reference 1, X
n
1
×11
L
1
.
2: Train a supervised learning model on Refer-
ence 2, X
n
2
×11
, and its odor concentration vector,
y
n
2
.
3: ConvexHull
(2)
← the convex hull of the projected
values of the Reference 2, X
n
2
×11
L
1
.
4: Do Sub-Algorithm for new data x
∗
.
5: Predict the odor concentration for new data x
∗
us-
ing the trained supervised learning model.
The above steps are implemented over 8 months
of data collected by the e-nose in Section 6. In order
to justify our choice of statistical techniques, the pro-
posed methodology is run over a set of simulated data
in a following section.
5 SIMULATION
To emphasize on the importance of the assump-
tions such as non-eliptical contoured distribution and
robust estimation considered in our methodology,
we examine the methodology on a set of simu-
lated data. Assume the matrix of data X
N×2
, where
x
>
l
= (x
l1
, x
l2
); l = 1, 2, . . . , N, are generated accord-
ing to the mixture of Gaussian and the Student’s t-
distributions, Figure 5 (top left panel). Ignoring the
distribution of data and seeking for any classical ap-
proach toward outlier detection, renders some ob-
servations as outliers mistakenly, Figure 5 (top right
panel). The parameters of interest, the mean vector
and the covariance matrix, need to be estimated ro-
bustly, otherwise the confidence region misrepresents
the underlying distribution. In Figure 5 (bottom left
panel), the classical confidence region is pulled to-
ward the outlier observations. On the contrary, the
robust confidence region perfectly unveil the distri-
bution of the majority of observations because of the
robust and efficient estimation of the mean and the
covariance matrix. Consequently, the classical prin-
cipal components are affected by the inefficient esti-
Statistical Measurement Validation with Application to Electronic Nose Technology
411