MINING QUANTITATIVE ASSOCIATION RULES IN MICROARRAY

DATA USING EVOLUTIVE ALGORITHMS

M. Mart´ınez-Ballesteros, C. Rubio-Escudero, J. C. Riquelme

Department of Computer Science, University of Seville, Seville, Spain

F. Mart´ınez-

Alvarez

Department of Computer Science, Pablo de Olavide University of Seville, Seville, Spain

Keywords:

Data mining, Evolutionary algorithms, Quantitative association rules, MicroArray.

Abstract:

The microarray technique is able to monitor the change in concentration of RNA in thousands of genes si-

multaneously. The interest in this technique has grown exponentially in recent years and the difﬁculties in

analyzing data from such experiments, which are characterized by the high number of genes to be analyzed

in relation to the low number of experiments or samples available. In this paper we show the result of apply-

ing a data mining method based on quantitative association rules for microarray data. These rules work with

intervals on the attributes, without discretizing the data before. The rules are generated by an evolutionary

algorithm.

1 INTRODUCTION

The use of massive processing techniques has revolu-

tionized the biotechnology research and it has highly

increased the amount of data obtained(Durbin et al.,

1998). In particular, microarray technology has revo-

lutionized the biological research due to its ability to

monitor changes in RNA concentration in thousands

of genes simultaneously (Durbin et al., 1998). Re-

search in molecular biology has traditionally focused

on the study gene to gene, but nowadays we are in

the genomic era and genes are studied in thousands or

even whole genomes. Besides the genes, it is neces-

sary to know the relationships between them.

In this context we present the result of applying a

data mining technique, speciﬁcally, association rules,

to gene expression data from experiments using mi-

croarray technology. The aim of this process of min-

ing association rules is to discover the presence of

pairs (attribute - value), which appear in a dataset

with a certain frequency. This technique is applied to

discover associations between genes from microarray

datasets, in which gene expression is linked to another

gene expression, Gen1 ⇒ Gen2.

There are many efﬁcient algorithms to ﬁnd these

rules, most focused on discrete data. However in the

real world, particularly in the problem to deal in this

paper, datasets consists of continuous attributes. In

addition, the tools that work in continuous domains

just discretize the attributes using a speciﬁc strategy

and treat these attributes as if they were discrete (Van-

nucci and Colla, 2004). In this paper, the result of

applying a genetic algorithm (GA) is presented. The

algorithm can ﬁnd association rules in databases with

continuous attributes from microarray data, avoiding

the discretization as a step in the process. The results

will show that the rules obtained have been able to

successfully characterize the data underlying and also

to group relevant genes for the problem studied.

The rest of the paper is divided as follows. Sec-

tion 2 provides the methodology used in this work.

The results obtained by the algorithm developed are

discussed in Section 3. Finally, Section 4 describes

the achieved conclusions.

2 METHODOLOGY

2.1 Search of Rules

This work is focused on a continuous domain. It is

necessary to group the sets of values in intervals to be

able to express the membership of the values to each

574

Martinez Ballesteros M., Rubio Escudero C., C. Riquelme J. and Martíınez-Álvarez F..

MINING QUANTITATIVE ASSOCIATION RULES IN MICROARRAY DATA USING EVOLUTIVE ALGORITHMS.

DOI: 10.5220/0003152705740577

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 574-577

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

group. Ranges have not been ﬁxed for intervals. The

Genetic Algorithm ﬁnds and adjust the most appro-

priate intervals to ﬁnd quantitative association rules.

Each individual in the population is a rule. The set

of rules comprising the population undergo an evolu-

tionary process in which mutation and crossover op-

erators will be applied. The individual with the best

ﬁtness at the end of the process represents the best

rule. The user can drive the search process because

the ﬁtness function has been provided with a set of

parameters. Our proposal performs an IRL process

(Iterative Rule Learning) (Venturini, 1993) to penal-

ize instances already covered by rules in order to em-

phasize the covering of instances still not covered.

In the following sections we provide details of the

general scheme of the algorithm, the ﬁtness function,

representation of individuals and genetic operators.

2.2 Scheme of the Algorithm

First, the rules population is initialized and evaluated.

All rules are evaluated according to equation 1. Thus,

in each iteration the selection operator is applied to se-

lect the best rules on the basis of the ﬁtness function.

Then, the crossover operator is applied to the selected

rules while the population size is not completed. In-

dividuals are randomly selected according to p

ut in

order to apply the mutation operator. Finally, the new

population is again evaluated by the ﬁtness function

and the evolutive process restarts. Note that the pro-

cess will be repeated as many times as the maximum

number of preset generations indicates.

2.3 Individuals Codiﬁcation

The lower and upper limits of the intervals of each

attribute will be represented by the different genes of

an individual. Because the attributes are continuous,

individuals are represented by an real coding. An in-

dividual consists of a not ﬁxed number of attributes

less than n, which represents the number of attribute

in the database.

The representation of an individual consists in two

data structures as shown in Figure 1. The upper struc-

ture includes all the attributes of the database, where

is the lower limit of the range and s

is the upper

limit. The bottom structure indicates the membership

of an attribute to the rule represented by a individual.

The type of each attribute t

, can have three values: 0

when the attribute does not belong to the rule, 1 if it

belongs to the antecedent of the rule and 2 when it be-

longs to the consequent part. If an attribute is wanted

to be retrieved for a speciﬁc rule, it can be done by

modifying the value equal to 0 of the type by a value

equal to 1 o or 2 depending on the antecedent or con-

sequent.

Figure 1: Representation of an individual of the population.

An example of one individual of the population is

shown in Figure 2.

∈ [20.1, 23.5] and A

∈ [10.3, 15.8] =⇒ A

∈

[54.4, 59.6].

Figure 2: Example of an individual of the population.

2.4 Initial Population

The number of attributes for each individual is ran-

domly chosen to generate the initial population tak-

ing into account the desired format for the rules. In

addition, the minimum and maximum numbers in the

antecedents and consequents, the minimum and max-

imum number of attributes that belong to rule repre-

sented by an individual are controlled.

2.5 Genetic Operators

The genetic operators implemented in the propose ge-

netic algorithm are: Selection, Crossover and Muta-

tion.

• Selection. An elitist strategy replicating the indi-

vidual with best ﬁtness and a roulette selection-

based method for the remaining individuals ac-

cording to their ﬁtness are used .

• Crossover. Two parents are chosen by the roulette

selection-based method and they are combined to

generate a new individual. The type of all the rel-

evant attributes in both parents are analyzed.

If both parents have an equal type for the same

attribute, it will assigned to offspring. The interval

is obtained as a random value between the limits

of the intervals of both parents.

Nevertheless, if both parents have a different type

for the same attribute, one of the two parents is

randomly chosen and offspring have the intervals

and type attribute of the selected parent.

• Mutation. Individuals of the population are ran-

domly selected in order to apply the mutation de-

MINING QUANTITATIVE ASSOCIATION RULES IN MICROARRAY DATA USING EVOLUTIVE ALGORITHMS

575

pending on a mutation probability p

Mut

. The mu-

tation process consists in modifying individuals

genes, according to a probability p

MutGen

in the

individuals selected. The mutation can be focused

on the attribute type or on the intervals, in which

are possible three separate cases: mutation of the

upper limit, lower limit or both limits of the inter-

val.

For this aim, a random value between 0 and 10%

of the total domain in the attribute is generated

and it is added or subtracted to the limit of the

interval randomly selected.

2.6 Fitness Function

The ﬁtness function calculation involves several mea-

sures that give us information about the rules. In par-

ticular, the most representative are the support and

conﬁdence that will positively affect the rule. How-

ever, it is necessary to take into account a number of

factors with negative affect in the quality of the rule.

In the amplitude of the intervals, the algorithm may

try to extend the intervals to complete the domain of

each attribute. For this aim, it is necessary to include

a measure limiting growth of the intervals during the

evolutive process.

The evaluation function f should be maximized in

the evolutionary process is given by the equation . It

consists in several parameters which values are cal-

culated from the individual multiplied by a weight to

calibrate the effect of each parameter in the overall

evaluation.

f(i) = w

· sup+ w

· conf − w

· recov (1)

· nAttrib − w

· ampl

where sup is the support, conf is the conﬁdence,

recov is the number of recovered instances, nAttrib is

the number of attributes in the rule, ampl is the av-

erage size of intervals of the attributes belong to the

rule and w

, w

and w

are weights in order to

drive the process of search of rules.

The meaning of each parameters in the equation

is:

• Support (sop). Percentage of records in the

dataset covered by the rule.

• Conﬁdence (conf ). Conditional probability of

consequent given the antecedent. Conﬁdence is

calculated dividing the support of the rule and the

support of the antecedent.

• Number of Recovered Instances (recub). It is

used to indicate a sample has already been cov-

ered by a previous rule. Rules covering different

regions of search of space are preferred.

• Number of Attributes (natrib). Number of at-

tributes (genes) belong to the rule (individual).

• Amplitude (ampl). Average of intervals size of

the attributes belong to the rule.

3 RESULTS

The results of applying the algorithm proposed in

Section 2 to a dataset acquired from a microarray

experiment related to inﬂammation and immune re-

sponse are presented. Inﬂammation is a critical pro-

cess because the human body uses to protect it-

self from infections and lesions. In this experi-

ment, conducted at the University of St. Louis, Mis-

souri(Calvano et al., 2005), the blood of eight volun-

teers is analyzed, four treated with a toxin produces

an inﬂammatory process and 4 with placebo. Sam-

ples has been taken at 6 time points over 24 hours,

obtaining a total of 48 microarrays.

The algorithm was tested with the following pa-

rameters of AG: 100 for the size of the population,

100 for the number of generations, 20 for the num-

ber of rules to obtain, 0.8 for the mutation probability

Mut

of the individuals and 0.2 for the mutation prob-

ability p

MutGen

of each gene in the individual. The ﬁt-

ness function weights are: 1 for w

, 0.5 for w

, 0.3 for

, 0.1 for w

, and 0.1 for w

. The reason to assign

a high value to the weight w

is to cover the maxi-

mum number of examples obtained by the rules. The

weight associated to the instances covered by other

rules and the size of the intervals are set to penalize

rules whose intervals are too large and covering ex-

amples already covered by other rules.

The algorithm has been executed 10 times, and

only those rules that cover a minimum of 6 samples

out of 48 (support 12.5 %) have been taken into ac-

count, obtaining a total of 76 rules of the 200 possible

rules (10 executions x 20 rules in each run). The limit

of the support has been set at that value because 6

samples shows data from a complete volunteer, and

such low limit for the support has been chosen be-

cause in this type of experiments we are interested

both in frequent relations, but also in the not so fre-

quent ones (McIntosh and Chawla, 2007).

The average support obtained for the 151 rules

has been 47.17% with a conﬁdence close to 100% for

most of them. The average amplitude of intervals in

the rules was 24.7%, which justiﬁes the use of quanti-

tative rules in place of the classical rules in which the

whole domain of the attribute is taken into account.

The rules obtained have accurately characterized

dataset treated, having two types of rules: those with

a support value between 75 % and 100 %, and those

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

576

Table 1: Analyzed Rules.

Id Rule Sup. (%) Conf. (%) Ampl. (%)

1 215091 s at ∈ [98.35 , 376.99] and 215760 s at ∈ [527.04 , 1168.82] 52 100 17

=⇒ 203944 x at ∈ [890.80 , 5308.61]

2 205119 s at ∈ [783.83 , 1527.60] and 215597 x at ∈ [8301.78 , 9819.85] 20 100 16

=⇒ 212967 x at ∈ [2076.59 , 2592.60]

3 222099 s at ∈ [859.491 , 1425.210] 55 100 17

=⇒ 49327 at ∈ [1517.45 , 2239.45]

with support values less than 50 % where in almost

cases cover records or endotoxin-treated group or

placebo group.

The number of rules covering the placebo group is

signiﬁcantly higher, which makes sense because this

group has gene expression values more stable and fre-

quent than the group treated with endotoxin(Rubio-

Escudero, 2007). To examine the relevance of the

rules obtained in the studied problem, we have used

the Onto-CC software (Romero-Zliz et al., 2008),

which retrieves information regarding the functional-

ity of a set of genes that is passed as a query, and a PI

value (the probability of intersection) associated with

the relevance of these genes appear together in one

rule. PI is a value to minimize between 0 and 1 and

considered relevant those obtained under 0.05.

The results of only 3 rules are listed in Table 1 for

readability. When Onto-CC is applied, the PI values

obtained for every rule are quite low, indicating the

relevance of grouping these three genes with respect

to these terms, immune response and related terms are

explicitly included.

4 CONCLUSIONS

In this paper we present the result of applying an evo-

lutive technique for extracting association rules from

microarray data. We have seen the rules obtained are

able to successfully characterize the dataset applied,

either covering almost all samples, or covering sam-

ples only one of the two groups in the data: treated

with endotoxin or treated with placebo. In addition,

the mean amplitude of the intervals was 24.7%, which

justiﬁes the use of quantitative rules in place of the

classical rules.

We have shown the relevance of the rules obtained

for the problem studied using the Onto-CC program.

The PI values obtained show signiﬁcance in the group

of genes found in the rules, and secondly the terms

obtained querying Gene Ontology are closely related

to the problem of inﬂammation.

Thus, we conclude that the use of quantitative as-

sociation rules, in particular those obtained by the

algorithm proposed, is a valid method for analyzing

microarray data, and we consider it a starting point

for future work, applying this technique to other mi-

croarray data, comparing with other analytical tech-

niques and seeing the importance of the inﬂuence

antecedent-consequent obtained by the rules with re-

gard to genetic networks.

ACKNOWLEDGEMENTS

The ﬁnancial support from the Spanish Ministry of

Science and Technology, project TIN 2007-68084-C-

02, and from the Junta de Andaluca, project P07-TIC-

02611, is acknowledged.

REFERENCES

Calvano, S. E., Xiao, W., Richards, D. R., Felciano, R. M.,

Baker, H. V., Cho, R. J., Chen, R. O., Brownstein,

B. H., Cobb, J. P., Tschoeke, S. K., Miller-Graziano,

C., Moldawer, L. L., Mindrinos, M. N., Davis, R. W.,

Tompkins, R. G., Lowry, S. F., and Large Scale Collab

Res Program, I. A. (2005). A network-based analysis

of systemic inﬂammation in humans. Nature, 437.

Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998).

Biological Sequence Analysis: Probabilistic Models

of Proteins and Nucleic Acids. Cambridge University

Press.

McIntosh, T. and Chawla, S. (2007). High-conﬁdence rule

mining for microarray analysis. IEEE/ACM Transac-

tions on Computational Biology and Bioinformatics,

4(4):611–623.

Romero-Zliz, R., del Val, C., Cobb, J., and Zwir, I. (2008).

Onto-cc: a web server for identifying gene ontology

conceptual clusters. Nucleic Acids Res, 36(4):W352–

W357.

Rubio-Escudero, C. (2007). Fusion of Knowledge towards

Identiﬁcation of Genetic Proﬁles in the Systemic In-

ﬂammation Problem. Ph.D Thesis. Unversidad de

Granada.

Vannucci, M. and Colla, V. (2004). Meaningful discretiza-

tion of continuous features for association rules min-

ing by means of a som. In Proceedings of the Euro-

pean Symposium on Artiﬁcial Neural Networks, pages

489–494.

Venturini, G. (1993). SIA: a Supervised Inductive Algo-

rithm with genetic search for learning attribute based

concepts. In Proceedings of the European Conference

on Machine Learning, pages 280–296.

MINING QUANTITATIVE ASSOCIATION RULES IN MICROARRAY DATA USING EVOLUTIVE ALGORITHMS

577