RULE EXTRACTION FROM MEDICAL DATA WITHOUT

DISCRETIZATION OF NUMERICAL ATTRIBUTES

Juan L. Domínguez-Olmedo, Jacinto Mata, Victoria Pachón and Manuel J. Maña

Escuela Técnica Superior de Ingeniería, Universidad de Huelva, Ctra. Palos de la Frontera SN, Huelva, Spain

Keywords: Data mining, Association rules, Atherosclerosis data.

Abstract: Association rule mining is a popular technique used to find associations between attributes in a dataset.

When using deterministic algorithms, if the attributes have numerical values the usual approach is to

discretize them defining proper intervals. But the discretization can notably affect the quality of the rules

generated. This work presents a method based on a deterministic exploration of the interval search space

without a previous discretization of the numerical attributes. It has been applied to medical data from an

atherosclerosis study. The quality of the obtained rules seems to support this method as a valid alternative

for this kind of rule extraction.

1 INTRODUCTION

In computer science, the field of Knowledge

Discovery in Databases (KDD) treats the problem of

finding useful knowledge from data. It is based on

several stages that try to derive information of

interest from raw data: selection, preprocessing,

transformation, data mining and evaluation (Fayyad,

1996).

One kind of task of application in the data

mining step is association rule learning. This

technique is used to discover associations between

several variables (attributes) in a dataset, which

could be of interest because they might show

unexpected or unknown relationships between the

variables they associate.

Association rules are not predictive but

descriptive. Descriptive mining tasks characterize

the general properties of the data (Han, 2006).

In this work, a deterministic method to generate

association rules without the discretization of

numerical variables is applied to medical data from

an atherosclerosis study.

Concretely, the STULONG dataset has been

used to extract association rules from it. This dataset

holds data from a longitudinal study of the factors of

the atherosclerosis in the population of 1417 middle

aged men (Boudík, 2004).

Atherosclerosis (also known as arteriosclerotic

vascular disease) is a condition in which an artery

wall thickens as a result of the accumulation of fatty

materials such as cholesterol. Among the first

symptom of atherosclerotic cardiovascular disease is

heart attack or sudden cardiac death.

The organization of this paper is as follows. The

next section provides a preliminary on association

rules and some quality measures. Section 3 describes

the method employed. In section 4 the experimental

results are shown. Finally, section 5 provides some

conclusions.

2 ASSOCIATION RULES

Association rule learning is a popular technique used

to find associations between several attributes in a

dataset. An association rule takes the form A  C,

where A and C express conditions on attributes of

the dataset, respectively, the antecedent and the

consequent of the rule.

At the beginning of the use of association rules

for data mining tasks, its application was mainly to

transactional format data, as for market basket

datasets (Agrawal, 1993), where the aim was to

discover regularities between products in large scale

transaction data from supermarkets, as the basis for

decisions about marketing activities.

In addition to this early application, association

rules are employed today in many application areas

including scientific data analysis, Web usage mining

or bioinformatics.

397

L. Domínguez-Olmedo J., Mata J., Pachón V. and J. Maña M..

RULE EXTRACTION FROM MEDICAL DATA WITHOUT DISCRETIZATION OF NUMERICAL ATTRIBUTES.

DOI: 10.5220/0003784603970400

In Proceedings of the International Conference on Health Informatics (HEALTHINF-2012), pages 397-400

ISBN: 978-989-8425-88-1

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Rules potentially of interest (strong rules) are

those ones with “good” measures of their support

and confidence. The support evaluates the number of

instances (also it can be shown as a percentage) in

which both the antecedent and the consequent of the

rule hold. The confidence is the quotient between the

support of the rule and the number of instances in

which the antecedent holds (the accuracy of the

rule):

conf (A  C) = supp (A  C) / supp (A)

(1)

So, the extraction of association rules is based on the

search for those ones satisfying minsup and minconf,

thresholds for the minimum support and minimum

confidence of a rule to be considered interesting.

Another measure of the interestingness of a rule is

the lift (Brin, 1997), which measures how many

times more often the antecedent and consequent hold

together in the dataset than would be expected if

they were statistically independent. It can be

computed by the quotient between the confidence of

the rule and the probability of the consequent:

lift (A



C) = conf (A



C) / P (C)

(2)

Several methods have been developed to treat this

problem, mainly variations of the Apriori algorithm

(Bodon, 2005); (Borgelt, 2003).

Nevertheless, applying these methods to data

containing numerical (quantitative) attributes, the

common case in many datasets, can not be done

directly (Srikant, 1996).

The typical approach to the treatment of

numerical attributes is by a previous discretization of

them. But when discretization is applied to

numerical attributes in association rule mining, it is

not possible to employ the usual methods of

application in classification models, such as those

based on the information theory (Tsai, 2008); (Lee,

2007). Instead, unsupervised discretization methods

such as equi-width or equi-frequency have to be used

(Liu, 2002).

3 METHOD EMPLOYED

In this work, we have used a method based on a

deterministic approach to treat the problem of

generating association rules without a previous

discretization of the numerical attributes.

In contrast to the typical deterministic fashion of

obtaining quantitative association rules, that is, by

previously discretizing those attributes, the method

employed is based on a dynamic generation of

intervals for each numerical attribute, searching for

valid rules satisfying the thresholds minsup and

minconf.

The method also employs auxiliary data

structures and certain optimizations to reduce the

search and improve the quality of the rules extracted.

Following are described its main features:

 The bounds of the intervals of the numerical

attributes are restricted to existing values in the

dataset.

 To have an efficient way of generating the

interval bounds and calculating the rule quality

measures, several auxiliary tables are used. The

bounds are going to be searched in the range [1, n],

where n is the number of instances of the dataset.

And at the end, the bounds are transformed into the

original values for each attribute.

 To reduce the number of rules generated,

although probably discarding some rules of good

quality, a parameter delta  [0, 1] is also used to

control the exhaustivity of the rule searching

process.

4 APPLICATION TO MEDICAL

DATA

The method described has been applied to medical

data from an atherosclerosis study.

STULONG is a dataset concerning the twenty

years lasting longitudinal study of the factors of the

atherosclerosis in the population of 1417 middle

aged men. Its table Entry holds results of the entry

examinations of each patient, and the table Death

holds data concerning the death of 389 patients. In

the experiments the table Entry_Dead has been used.

This table is the join of Entry and Death tables, with

some previous preprocessing (Salleb, 2004).

First, we have searched for rules associating

attributes of the group “Physical examination” (BMI

“Body Mass Index”, SYST “Blood pressure

systolic”, DIAST “Blood pressure diastolic”, TRIC

“Skin fold triceps” and SUBSC “Skin fold

subscapularis”) with attributes of the group

“Biochemical examination” (CHLST “Cholesterol”,

TRIGL “Triglycerides”, MOC_SUC “Urine sugar”

and MOC_ALB “Urine albumen”).

We have run the algorithm searching for those

kinds of rules, with the parameters minsup = 29

(2%), minconf = 0.5 and delta = 0.2. A selection of

the rules generated is shown in Table 1, based on

rules with high values of lift or with both high

confidence and not-low lift.

HEALTHINF 2012 - International Conference on Health Informatics

398

Table 1: A selection of rules associating the groups of

attributes “Physical examination” and “Biochemical

examination”.

Rule Sup Conf Lift

CHLST [112, 242]

TRIGL [28, 71]



BMI [19.05, 28.41]

TRIC [3, 35]

SUBSC [7, 16]

31 0.79 2.3

DIAST [82, 125]

BMI [24.11, 27.36]

TRIC [11, 35]

SUBSC [16, 70]



MOC_SUC = ‘no’

MOC_ALB = ‘no’

CHLST [221, 250]

TRIGL [105, 274]

29 0.51 2.6

TRIC [7, 12]

SUBSC [32, 49]



MOC_SUC = ‘no’

CHLST [211, 300]

TRIGL [103, 350]

29 1.00 2.0

The first of the shown rules states that “79% of

patients having a cholesterol measure in the interval

[112, 242] and a triglycerides measure in the

interval [28, 71], also had a value of BMI in the

interval [19.05, 28.41], a value in the interval [3,

35] for skin fold triceps and a value in the interval

[7, 16] for skin fold subscapularis. The conditions of

the consequent occur 2.3 more times in the group of

patients holding the conditions of the antecedent

than in the whole group of studied patients”.

The last shown rule states that “all patients with

a value in the interval [7, 12] for skin fold triceps

and a value in the interval [32, 49] for skin fold

subscapularis, had also no urine sugar, a

cholesterol measure in the interval [211, 300] and a

triglycerides measure in the interval [103, 350].

The conditions of the consequent occur 2 more times

in the group of patients holding the conditions of the

antecedent than in the whole group of studied

patients”.

We have also searched for rules associating the

attribute

DEATH? with the attributes ALCO_CONS

(“Alcohol consumption”), TOBA_CONS (“Tobacco

consumption”), TOBA_DURA (“Smoking

duration”), MOC_SUC (“Urine sugar”), MOC_ALB

(“Urine albumen”), CHLST (“Cholesterol”), TRIGL

(“Triglycerides”), SYST (“Blood pressure systolic”),

DIAST (“Blood pressure diastolic”) and BMI (“Body

Mass Index”).

The attribute ALCO_CONS measures the volume

of alcohol ingested, taking into account three

factors: the equivalent amount of alcohol, the type of

alcohol, and the patient’s weight (as a normalizing

factor).

Table 2 presents a selection of the rules obtained

after searching with the parameters minsup = 29

(2%), minconf = 0.5 and delta = 0.2.

Table 2: A selection of rules associating the attribute

DEATH?

Antecedent Sup Conf Lift

ALCO_CONS [1, 1.2]

TOBA_DURA = 20

CHLST [197, 273]

SYST [149, 220]



DEATH? = yes

30 0.73 2.7

TOBA_CONSO = 0

CHLST [190, 261]

SYST [ 80, 133]

BMI [22.64, 27.41]



DEATH? = no

52 1.00 1.4

Table 3: Analysis of a rule regarding patient’s death.

Antecedent Sup Conf Lift

ALCO_CONS [1.12, 1.69]

TOBA_CONSO = 1.25

SYST [140, 220]



DEATH? = yes

30 0.65 2.4

ALCO_CONS [1.12, 1.69]

TOBA_CONSO = 1.25



DEATH? = yes

81 0.43 1.6

ALCO_CONS [1.12, 1.69]

SYST [140, 220]



DEATH? = yes

71 0.41 1.5

TOBA_CONSO = 1.25

SYST [140, 220]



DEATH? = yes

47 0.56 2.0

ALCO_CONS [1.12, 1.69]



DEATH? = yes

214 0.27 1.0

TOBA_CONSO = 1.25



DEATH? = yes

132 0.38 1.4

SYST [140, 220]



DEATH? = yes

127 0.39 1.4

RULE EXTRACTION FROM MEDICAL DATA WITHOUT DISCRETIZATION OF NUMERICAL ATTRIBUTES

399

Finally, a kind of analysis about the “additive”

effect of conditions in patient’s death has been done.

For that, the first of the rules shown in Table 3 has

been chosen from the rules generated previously. It

associates alcohol consumption, tobacco

consumption and systolic blood pressure with

patient’s death. This rule expresses that “65% of the

patients with an alcohol consumption in [1.12,

1.69], smoking more than 20 cigarettes/day and with

a systolic blood pressure in [140, 220], were dead”.

To compare the effect of those conditions, alone

and in pairs, rules having the desired conditions have

been selected, and their quality measures are shown

in Table 3.

An analysis of the rules indicates that although

the condition associated to alcohol consumption is

less correlated to death (with a lift value of 1) than

the other two conditions evaluated, when added to

the combination of tobacco consumption and blood

pressure, it increases the confidence from 0.56 to

0.65.

5 CONCLUSIONS

In this work, medical data from an atherosclerosis

study has been used to extract association rules from

it.

Association rules can express unknown

knowledge present in data, in the form of

relationships between the values of the variables.

The method employed is based on a

deterministic approach that generates association

rules without a previous discretization of the

numerical attributes. Discretization can notably

affect the quality of the rules generated, and it is

usually difficult to know the best discretization

technique to apply it to a deterministic algorithm for

a particular dataset.

A variety of rules has been obtained, with good

values of their quality measures, what seems to

support the method employed as a valid way to

generate association rules without a previous

discretization of the numerical attributes.

Also, a particular analysis of a selected rule has

been performed. The rule associates some conditions

with the death of patients object of the study.

ACKNOWLEDGEMENTS

This work was partially funded by the Spanish

Ministry of Science and Innovation, the Spanish

Government Plan E and the European Union through

ERDF (TIN2009-14057-C03-03).

REFERENCES

Agrawal, R., Imielinski, T., Swami, A., 1993. Mining

Association Rules between Sets of Items in Large

Databases. In ACM SIGMOD ICMD, pp. 207-216.

ACM Press.

Bodon, F., 2005. A Trie-based APRIORI Implementation

for Mining Frequent Item Sequences. In 1st

International Workshop on Open Source Data Mining:

Frequent Pattern Mining Implementations, Chicago,

Illinois, pp. 56–65. ACM Press.

Borgelt, C., 2003. Efficient Implementations of Apriori

and Eclat. In Workshop on Frequent Itemset Mining

Implementations. CEUR Workshop Proc. 90, Florida.

Boudík, F., Tomečková, M., Bultas, J., 2004. STULONG

medical project. http://euromise.vse.cz/challenge2004.

Prague.

Brin, S., Motwani, R., Ullman, J.D., Tsur, S., 1997.

Dynamic Itemset Counting and Implication Rules for

Market Basket Data. In Proc. of the ACM SIGMOD

1997, pp. 265-276.

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From

Data Mining to Knowledge Discovery in Databases.

AI Magazine, Vol. 17, pp. 37-54.

Han, J., Kamber, M., 2006. Data Mining: Concepts and

Techniques. Morgan Kaufmann, San Francisco.

Lee, C.-H., 2007. A Hellinger-based Discretization

Method for Numeric Attributes in Classification

Learning. Knowledge-Based Systems, 20(4), 419-425.

Liu, H., Hussain, F., Tan, C., Dash, M., 2002.

Discretization: An Enabling Technique. Data Mining

and Knowledge Discovery, 6(4), 393-423.

Salleb, A., Turmeaux, T., Vrain, C., Nortet, C., 2004.

Mining Quantitative Association Rules in a

Atherosclerosis Dataset. Contribution to the PKDD

Discovery Challenge 2004, http://www.univ-

orleans.fr/lifo/Members/salleb/Challenge2004.

Srikant, R., Agrawal, R., 1996. Mining Quantitative

Association Rules in Large Relational Tables. In Proc.

of the ACM SIGMOD 1996, pp. 1-12.

Tsai, C.-J., Lee, C.-I., Yang, W.-P., 2008. A Discretization

Algorithm Based on Class-Attribute Contingency

Coefficient. Information Science, 178(3), 714-731.

HEALTHINF 2012 - International Conference on Health Informatics

400