Multifactorial Dimensionality Reduction for Disordered Trait

Alexander Rakitko

Steklov Mathematical Institute, Russian Academy of Sciences, Moscow, Russia

Keywords:

GWAS, Multifactorial Dimensionality Reduction, Associated Factors, Disordered Response, Cross-validation

Procedure.

Abstract:

We develop our recent works concerning the identiﬁcation of the factors associated with a certain complex

disease. The case of disordered discrete trait is studied. We build two models (3D and 2D) for the range

of response variable indicating the state of the health of a patient. In this work we consider the problem of

optimal forecast for response variable depending on a ﬁnite collection of factors with values in arbitrary ﬁnite

set. The quality of prediction is described by the error function involving a penalty function. The estimation

of the error requires some cross-validation procedure. The developed approach provides the basis to identify

the set of signiﬁcant factors. Such problem arises naturally, e.g., in the genome-wide association study. Using

simulated data we illustrate the efﬁciency of our method.

1 INTRODUCTION

The problem of the high dimension arises naturally

in various stochastic models. For instance, in ge-

netics studies the number of explanatory factors (e.g.

SNP - single nucleotide polymorphism) X

,...,X

much more than the possible sample size. But nowa-

days the main part of specialists share the paradigm

that not all of them are signiﬁcant for certain com-

plex disease provoking. So the challenging problem

is to ﬁnd among huge number of factors the collection

,...,X

of factors associated with the disease.

In the previous works (Bulinski and Rakitko,

2014) we considered the case when response vari-

able Y (Y depends on a number of factors X

,...,X

and indicates disease status) takes values in the set

{−m,...,0,...,m} for some natural m ∈ N . How-

ever, in many applications it is impossible to intro-

duce a linear order for traits under consideration. In

this paper we study the model which assumes that Y

can take values in an arbitrary ﬁnite set with no as-

sumption about its serializability.

In medical and biological studies there exists a

special research domain called the genome-wide as-

sociation studies (GWAS). This branch of bioin-

formatics has already included a number of differ-

ent approaches for identiﬁcation of signiﬁcant fac-

tors. Among powerful statistical methods applied in

GWAS one can ﬁnd the principal component anal-

ysis (Lee et al., 2012), logic and logistic regres-

sion (Ruczinski et al., 2003), (Sikorska et al., 2013),

LASSO (Tibshirani and Taylor, 2012) and various

methods of machine learning (Hastie et al., 2001). In

our work we concentrate on the development of mul-

tifactorial dimensionality reduction (MDR) method.

For the ﬁrst time this method was implemented in the

paper by M. Ritchie (Ritchie et al., 2001). Our variant

of this method is based on the estimation of the error

functions. Details are given in the next section.

The paper is organized as follows. In the second

section we describe our method and build two mod-

els for response variable. Moreover, we introduce the

procedure of simulation of genetics data. Some sta-

tistical results for proposed estimations are given as

well. In the third section we discuss the results of ap-

plication of our method to the analysis of generated

data. Here we compare two different models and give

some recommendations about the choice of the size of

signiﬁcant collection. Summary of the work is given

in conclusion.

2 MATERIALS AND METHODS

Let X = (X

,...,X

) be a random vector with com-

ponents X

: Ω → {0, 1, . ..,s} where k = 1,..., p and

s, p ∈ N. All random variables are deﬁned on a prob-

ability space (Ω, F ,P). In general for different k

one could consider different s as it is of no impor-

tance. For instance, X

can characterizes single nu-

232

Rakitko A.

Multifactorial Dimensionality Reduction for Disordered Trait.

DOI: 10.5220/0005285302320236

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOSTEC 2015), pages 232-236

ISBN: 978-989-758-070-3

cleotide polymorphism (SNP) and takes values in the

set {0, 1, 2} (corresponding to the number of minor

alleles) whereas X

, l 6= k, can be binary and indicates

smoking addiction.

We assume that X : Ω → X (e.g.

X = {0,1,...,s}

) and Y : Ω → Y. In our

model a response variable Y depends on factors

,...,X

and describes the state of the health of a

patient. In some recent papers (Bulinski and Rakitko,

2014) the case of linearly ordered set Y was studied.

For example, if Y = {−1,0,1} then Y = 1 or Y = −1

mean that person is sick or healthy, respectively. The

value 0 one can interpret as ”intermediate”. In other

words, in this ”grey zone” corresponding to 0 one

cannot make conclusion whether disease appear or

not. However, in some applications it is difﬁcult (or

even impossible) to introduce a linear order of the set

Y. For instance, Y can indicates the subtype of acute

ischemic stroke according TOAST classiﬁcation

(Adams et al., 1993): large-artery atherosclerosis,

cardioembolism or small-artery occlusion. Besides,

we should take into account the other two groups

corresponding to stroke of other determined etiology

and stroke of undetermined etiology. So we add

an extra value of Y which is responsible for such

uncertainty.

2.1 2D and 3D Models

Here we consider two models for linearly unordered

set Y. The following theory with little effort could

be extend onto the case of any ﬁnite capacity of Y.

But for simplicity we assume that Y = {y

}

where y

, y

correspond, as example, to one of the

three subtypes of ischemic acute stroke and y

indi-

cates unclassiﬁed patient.

3D-Model. In this model all elements of Y are

equidistant from each other. It means that {y

}

i=0

are

located at the vertices of tetrahedron (Figure 1). With-

out loss of generality let edge of tetrahedron equals 1.

2D-Model. Let us put elements y

, y

at the ver-

tices of the regular triangle with unit edge. The el-

ement y

is located in the middle of triangle (Figure

2).

2.2 Methods

In this subsection we describe new modiﬁcation of

MDR-EFE method (Bulinski and Rakitko, 2014) con-

cerning introduced 2D- and 3D-Models.

MDR-EFE Method. To predict Y we use determinis-

tic function f : X →Y of factors X

,...,X

. The qual-

ity of such f is determined by means of error function

Figure 1: 3D-case.

Figure 2: 2D-case.

Err( f ) involving a penalty function ψ : Y → R

Err( f ) := E|Y −f(x)|ψ(Y). (1)

The trivial case ψ ≡ 0 is excluded. In fact, the choice

of penalty functions ψ gives an additional degree of

freedom. But for further analysis we take function

ψ(y) =

P(Y = y)

, y ∈ Y, c = const > 0, (2)

proposed by Velez (Velez et al., 2007). Assuming

here that P(Y = y) > 0 for y ∈ Y one can take c = 1

without loss of generality. In (Bulinski, 2014) it was

explained that this choice is natural.

For r = 1,..., p set X

= {0, 1 . . . , s}

. Then X =

. We write α = (k

,...,k

), X

= (X

,...,X

) and

= (x

,...,x

) where x

∈ {0, . . . , s}, i = 1,. . . , p.

In many models it is natural to assume that response

variable Y depends signiﬁcantly only on a certain col-

lection of factors X

,...,X

where 1 ≤k

< . . . <

≤ p. In other words, for x ∈ X, P(X = x) > 0

and y ∈ Y the following relation holds true

P(Y = y|X = x) = P(Y = y|X

= x

). (3)

Here P(X = x

) ≥ P(X = x) > 0.

In medical and biological studies the factors

,...,X

can be viewed as essential for provoking

Multifactorial Dimensionality Reduction for Disordered Trait

233

complex disease whereas the impact of other can be

neglected. Any collection of such indexes {k

,...,k

}

from (3) is called signiﬁcant.

Fortunately, it is possible to describe all functions

f : X

→ Y which are the solution of the problem

Err( f ) →inf. It is natural to approximate Y by means

of one of this optimal functions. So we take a certain

function f

opt

deﬁned below from the whole class of

optimal functions.

Optimal function f

opt

in 3D-Model. Simple arith-

metic conversions conclude that in 3D-Model the fol-

lowing f

opt

is optimal. Let us deﬁne the system of

sets B

, y ∈ Y as shown

x ∈B

⇐⇒ P(X

= x

|Y = y) > P(X

= x

|Y = z)

for all z 6= y, z ∈ Y. If for some z 6= y, z,y ∈ Y and

x ∈ X one has P(X

= x

|Y = y) = P(X

= x

|Y = z)

then we add this x to one of B

or B

at one’s dis-

cretion. Augmented sets B

we denote as A

for any

y ∈ Y. Obviously, {A

}

y∈Y

is a partition of the set X.

Then the following function is optimal

opt

(x) =

∑

y∈Y

yI{x ∈A

}. (4)

Optimal function f

opt

in 2D-Model. In this case

let’s deﬁne the sets {B

}

y∈Y

by the following way

x ∈B

⇐⇒











P(X

= x

|Y = y

) >

(1 −

√



P(X

= x

|Y = y

P(X

= x

|Y = y

)



P(X

= x

|Y = y

P(X

= x

|Y = y

) >

(1 −

√



P(X

= x

|Y = y

P(X

= x

|Y = y

)



P(X

= x

|Y = y

P(X

= x

|Y = y

) >

(1 −

√



P(X

= x

|Y = y

P(X

= x

|Y = y

)



P(X

= x

|Y = y

(5)

And for i = 1, 2, 3 and k,l ∈ {1,2,3}\i, k 6= l

x ∈B

⇐⇒











P(X

= x

|Y = y

) >

(

√

3 −1)



P(X

= x

|Y = y

P(X

= x

|Y = y

)



P(X

= x

|Y = y

P(X

= x

|Y = y

) >

P(X

= x

|Y = y

j ∈ {1, 2, 3}\i.

(6)

Then after augmentation of the sets {B

}

y∈Y

we come

to the sets {A

}

y∈Y

and the optimal function

opt

(x) =

∑

y∈Y

yI{x ∈A

}. (7)

In fact, function f

opt

depends only on x

. Moreover,

if the collection of indexes α is signiﬁcant then the

property of optimality of f

opt

implys for any collec-

tion β = (m

,...,m

), where 1 ≤ m

≤ . . . ≤ m

≤ p

the following relation

Err( f

) ≤ Err( f

). (8)

Let ξ

,ξ

,... be a sequence of independent iden-

tically distributed (i.i.d.) random vectors having the

same law as (X,Y ). For N ∈ N, set ξ

= (ξ

,...,ξ

We will use approximation of Err( f ) by means of ξ

(as N → ∞) and a prediction algorithm (PA). This PA

employs a function f

= f

(x,ξ

) deﬁned for x ∈X

and ξ

and taking values in Y. More exactly, we op-

erate with a family of functions f

(x,v

) (with values

in Y) deﬁned for x ∈X and v

∈(X ×Y)

where t ∈N,

t ≤ N. To simplify the notation we write f

(x,v

) in-

stead of f

(x,v

Following (Bulinski and Rakitko, 2014) we can

construct an estimate of Err( f ) involving ξ

, predic-

tion algorithm deﬁned by f

and K-cross-validation

(on cross-validation we refer, e.g., to (Arlot and

Celisse, 2010)).

Theorem 1. Let α = (k

,...,k

) where a signiﬁcant

collection {k

,...,k

} ⊂ {1, . . . , n}. Then, for any

ε > 0 and each β = (m

,...,m

) with {m

,...,m

} ⊂

{1,...,n}, the following inequality holds

Err

(

) ≤

Err

(

) + ε a.s. (9)

for all N large enough.

Theorem 1 shows that it is quite natural to take

for further analysis as signiﬁcant such collection of

indexes {k

,...,k

} ⊂ {1, . . . , n} that

Err

(

,ξ

)

with α = (k

,...,k

) has the minimal value (or near

the minimal value) among all

Err

(

,ξ

) where

β = (m

,...,m

) and {m

,...,m

} ⊂ {1, . . . , n}.

2.3 Simulation

Our following aim is to test MDR-EFE-algorithm on

generated data. So in this subsection we introduce the

way for simulation of genetic markers (X ) and corre-

sponding phenotypes (Y ).

First of all we generate an array of genotypes. We

deﬁne the marginal distributions of each factor by se-

lecting alleles’ frequencies in a certain way and add

prescribed correlation structure responsible for Link-

age Disequilibrium (LD). Let us assume that there are

BIOINFORMATICS 2015 - International Conference on Bioinformatics Models, Methods and Algorithms

234

Figure 3: Correlation matrix.

N patients and for each of them we observe p fac-

tors. Then set X

= (X

i,1

,...,X

i,p

) to be the genetic

information about the i-th person. And the disease

status for this patient we will deﬁle below. Every

factor takes values in the set {0, 1, 2}. For all j ∈

{1,..., p}we generate numbers q

(1)

and q

(2)

such that

(1)

∼ U[0, 1] and q

(2)

∼ U[q

(1)

,1]. Let’s consider

the collections p

= (p

j,AA

, p

j,Aa

, p

j,aa

) = [q

(1)

(2)

−

(1)

,1 −q

(2)

] (here square brackets [·] mean an order-

ing by ascending) as the marginal distribution of the

j-th factor. For any j we can write

j,A

= p

j,AA

j,Aa

Here ”A” and ”a” denote major and minor allele cor-

respondingly.

We sample matrix X of dimention N × p with in-

dependent elements such that A

i, j

∼ Ber(p

j,A

). To

obtain desired correlations between factors we ap-

ply Iman&Conovers’ (Iman and Conover, 1982) algo-

rithm to matrix A. We take block-diagonal correlation

matrix where different blocks could be corresponded

to different genes. For each block the correlations are

constructed by means of some positive deﬁned func-

tion

LD(i, j) = exp{−|i − j|/c}∗(1 + |i − j|/c).

In such a way we get alleles on the ﬁrst chromosome.

Using the same procedure we generate matrix B with

the only difference that B

i, j

∼ Ber(p

j,A|A

i, j

) where

j,A|A

i, j

is the conditional probability to ﬁnd allele A

on the second chromosome given allele A

i, j

on the

ﬁrst one. One can see that in this case

i, j

= A

i, j

+ B

i, j











0,with probability p

j,AA

1,with probability p

j,Aa

2,with probability p

j,aa

Parameters of Simulation:

1. N = 6000

2. p = 10

3. Correlation matrix consists of two blocks of the

size 5 ×5, c

= 2, c

= 1

4. Y ∈ {y

} with P(Y = y

) = θ

, k = 1,2,3.

To deﬁne θ = (θ

,θ

) we introduce











i,1

= exp{1 + 2X

i,1

−1.5X

i,2

+ X

i,3

i,2

= exp{1 + 0.5X

i,1

+ 2X

i,2

−1.5X

i,3

= exp{1 + 0.5X

i,1

+ 0.6X

i,2

+ 0.7X

i,3

and assume that (θ

,θ

) ∼ Dir(α

i,1

,α

i,2

,α

i,3

To add more noise and uncertainty in our data we

change the disease status of each patient onto y

with

probability 0.1. After simulations the following dis-

tribution of response function Y were acquired: 611,

1856, 1169, 2364.

3 RESULTS

Here we demonstrate the results obtained by applying

our algorithm. In the Table 1 one can see the top 10

collections (with the minimal estimation of the error

function Err). The estimations are counted assuming

3D-model.

Table 1: 3D-case.

Error

1 2 3 2.3794

2 3 5 2.4929

2 3 9 2.4967

2 3 4 2.5034

2 3 6 2.5043

2 3 8 2.5088

2 3 10 2.5111

2 3 7 2.5285

1 2 10 2.6117

1 2 5 2.6178

In the Table 2 the results for 2D-model are listed.

In both models the signiﬁcant collection is deﬁned

correctly. But one can see that the gap between the

ﬁrst and the second collection in the 2D-case almost

two times bigger than the gap in the 3D-case. It may

implicate that the model with triangle and its center

works better with the data involving uncertainty. It

should be noted that, in fact, we are interested not in

the minimization of the absolute value of error, but in

the growth of the gap between signiﬁcant and other

collections.

Multifactorial Dimensionality Reduction for Disordered Trait

235

Table 2: 2D-case.

Error

1 2 3 2.5394

2 3 9 2.7503

2 3 8 2.7539

2 3 5 2.7595

2 3 10 2.7647

2 3 6 2.7662

2 3 7 2.7682

2 3 4 2.7697

1 2 5 2.7754

1 2 6 2.8143

Table 3: r-variation.

Error

3 2.95372

2 3 2.74907

1 2 3 2.56917

1 2 3 4 2.57371

1 2 3 5 9 2.56517

There is no formal rule for the choice of optimal r

(the size of signiﬁcant collection). But it seems quite

natural to stop increase r if it doesn’t decrease the

estimation of the error. In the Table 3 one can ﬁnd

the results for another one application of our method

to simulated data. In this table estimations of Err

for different r are listed. It is not difﬁcult to con-

clude that the error does not decrease when r exceeds

3 (the number of signiﬁcant factors). Besides, one

more reason to choose the r = 3 is that for r = 4 the

gap between the ﬁrst and the second collections is just

0.0003 (the error of collection (X

) equals

2.57371 and the error of collection (X

) is

equal to 2.57396). It means that factors X

and X

are

not in strong association with the trait in contrast to

the factors X

, X

4 CONCLUSIONS

In this paper we studied the problem of identiﬁca-

tion of the collection of signiﬁcant factors determin-

ing some disordered complex trait. We introduced

two models for the set of possible values of response

variable and developed multifactorial dimensionality

reduction approach based on estimation of error func-

tion. Using simulated data we demostrated the ef-

ﬁciency of our method. Further research remains a

comparison our algorithm with other methods of di-

mensionality reduction (e.g., Discriminant Principal

Component Analysis).

ACKNOWLEDGEMENTS

This work is partially supported by RSF grant 14-21-

00162.

REFERENCES

Adams, H. P., Bendixen, B. H., Kappelle, L. J., Biller, J.,

Love, B. B., Gordon, D. L., and Marsh (1993). Clas-

siﬁcation of subtype of acute ischemic stroke. Deﬁ-

nitions for use in a multicenter clinical trial. TOAST.

Trial of Org 10172 in Acute Stroke Treatment. Stroke,

24(1):35–41.

Arlot, S. and Celisse, A. (2010). A survey of cross-

validation procedures for model selection. Statistics

Surveys, 4:40–79.

Bulinski, A. (2014). On foundation of the dimensionality

reduction method for explanatory variables. Journal

of Mathematical Sciences, 199(2):113–122.

Bulinski, A. and Rakitko, A. (2014). Estimation of

nonbinary random response. Doklady Mathematics,

89(2):225–229.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The

Elements of Statistical Learning. Springer Series in

Statistics. Springer New York Inc., New York, NY,

USA.

Iman, R. L. and Conover, W. J. (1982). A distribution-free

approach to inducing rank correlation among input

variables. Communications in Statistics - Simulation

and Computation, 11(3):311–334.

Lee, S., Epstein, M. P., Duncan, R., and Lin, X. (2012).

Sparse principal component analysis for identifying

ancestry-informative markers in genome-wide associ-

ation studies. Genetic Epidemiology, 36(4):293–302.

Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R.,

Dupont, W. D., Parl, F. F., and Moore, J. H.

(2001). Multifactor-dimensionality reduction reveals

high-order interactions among estrogen-metabolism

genes in sporadic breast cancer. The American Jour-

nal of Human Genetics, 69(1):138 – 147.

Ruczinski, I., Kooperberg, C., and LeBlanc, M. (2003).

Logic regression. Journal of Computational and

Graphical Statistics, 12(3):475–511.

Sikorska, K., Lesaffre, E., Groenen, P. F. J., and Eilers, P.

H. C. (2013). Gwas on your notebook: fast semi-

parallel linear and logistic regression for genome-

wide association studies. BMC Bioinformatics, pages

166–166.

Tibshirani, R. J. and Taylor, J. (2012). Degrees of free-

dom in lasso problems. The Annals of Statistics,

40(2):1198–1232.

Velez, D. R., White, B. C., Motsinger, A. A., Bush, W. S.,

Ritchie, M. D., Williams, S. M., and Moore, J. H.

(2007). A balanced accuracy function for epista-

sis modeling in imbalanced datasets using multifac-

tor dimensionality reduction. Genetic Epidemiology,

31(4):306–315.

BIOINFORMATICS 2015 - International Conference on Bioinformatics Models, Methods and Algorithms

236