DATABASES REDUCTION

Roberto Ruiz and Jos

e C. Riquelme and Jes

us S. Aguilar-Ruiz

Department of Computer Science. University of Seville

Avda. Reina Mercedes s/n. 41012-Sevilla. Spain.

Keywords:

Data mining, preprocessing technics, database reduction, feature selection, data set editing.

Abstract:

Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. Nev-

ertheless, these techniques often have high computational cost. Then, it is advisable to apply a preprocessing

phase to reduce the time complexity. These preprocessing techniques are fundamentally oriented to either of

the next goals: horizontal reduction of the databases or feature selection; and vertical reduction or editing. In

this paper we present a new proposal to reduce databases applying sequentially vertical and horizontal reduc-

tion technics. They are based in our original works, and they use a projection concept as a method to choose

examples and representative features. Results are very satisfactory, because the reduced database offers the

same intrinsic performance for the later application of classiﬁcation techniques with low computational re-

sources.

1 INTRODUCTION

The data mining researchers, especially those dedi-

cated to the study of algorithms that produce knowl-

edge in some of the usual representations (decision

lists, decision trees, association rules, etc.), usually

make their tests on standard and accessible databases

(most of them with small size). The purpose is to

independently verify and validate the results of their

algorithms. Nevertheless, these algorithms are mod-

iﬁed to solve speciﬁc problems, for example real

databases that contain much more information (tens

of attributes and tens of thousands of examples) than

standard ones used in training. Therefore, applying

these data mining techniques is a task that takes a lot

of time and memory size, even with the capability of

current computers, which make the adaption of the al-

gorithm to solve the problem extremely difﬁcult.

Therefore, it is important to apply preprocessing

techniques to the databases. These preprocessing

techniques are fundamentally oriented to one of the

next goals: feature selection methods (eliminating

non-relevant attributes) and editing algorithms (re-

duction of the number of examples). Existing meth-

ods solve one out of the two aforementioned prob-

lems.

In this paper we present a new approach to reduce

databases in both directions. This is due to sequential

application of vertical and horizontal reduction tech-

niques. They both are based in our original works,

and they use a projection concept as a method to

choose examples and representative features.

2 VERTICAL REDUCTION

2.1 Related work

Editing methods are related to the nearest neigh-

bours (NN) techniques (Cover and Hart, 1967). Some

of them are brieﬂy cited in the following lines.

Hart (Hart, 1968) proposed to include in the subset

S, those examples of the training set T whose classi-

ﬁcation with respect to S are wrong using the near-

est neighbour technique, so that every member of T

is closer to a member of S of the same class than to

a member of S of a different class; Aha et al. pro-

posed a variant of Harts method; Wilson (Wilson,

1972) proposed to eliminate the examples with in-

correct K-NN classiﬁcation, so that each member of

T is removed if it is incorrectly classiﬁed with the

N nearest neighbours; (Tomek, 1976) extended the

idea of Wilson eliminating the examples with incor-

rect classiﬁcation from any i=1 to N, where N is the

maximum number of neighbours to be analysed; the

work of (Ritter et al., 1975) extended Harts method

and every member of T must be closer to a mem-

ber of S of the same class than to any member of T.

Other variants are based on Voronoi diagrams (Klee,

Ruiz R., C. Riquelme J. and S. Aguilar-Ruiz J. (2004).

DATABASES REDUCTION.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 98-103

DOI: 10.5220/0002632300980103

 SciTePress

1980), Gabriel neighbours (two examples are said to

be Gabriel neighbours if their diametrical sphere does

not contain any other examples) or relative neigh-

bours (Toussaint, 1980) (two examples p and q are rel-

ative neighbours if for all other examples x in the set,

is true the expression dist(p,q)¡maxdist(p,x),dist(q,x).

All of these techniques need to calculate distances be-

tween examples, which is rather time consuming. If

N examples with M attributes are considered, the ﬁrst

methods take O(MN

) time, the Ritters algorithm is

O(MN

); the Voronoi neighbours, Gabriel neigh-

bours and relative neighbours are O(MN

The most important characteristics of our editing

algorithm, called EOP (Aguilar et al., 2000) (Editing

by Ordered Projection), are:

• Considerable reduction of the number of examples.

• Lower computational cost O(MNlog N) than other

algorithms.

• Absence of distance calculations.

• Conservation of the decision boundaries, especially

interesting for applying classiﬁers based on axis-

parallel decision rules (like C4.5).

2.2 Editing

If we choose a region where all examples inside have

the same class, perhaps we could select some of

them, which are not decisive, in order to establish the

boundaries of the region. For example, in two dimen-

sions we need a maximum of four examples to de-

termine the boundaries of one region. In general, in

d-dimensions we will need 2d examples, maximum.

Therefore, if a region has more than 2d examples, we

could reduce the number of them.

This is the main idea of our algorithm: to elimi-

nate the examples that are not in the boundaries of the

regions to which they belong. The aim is to calcu-

late which set of examples could be covered by a pure

region and then eliminate those inside that are not es-

tablishing the boundaries. A region is pure if all the

examples inside have the same class.

Consider the situation depicted in Figure 1: the pro-

jection of the examples on the abscissa axis produces

four ordered sequences {I; P; I; P} corresponding to

the examples {[9; 3; 5; 1; 11]; [8]; [7]; [4; 6; 2; 12;

10]}. Identically, with the projection on the ordinate

axis, we can obtain the sequences {P; I; P; I} formed

by the examples {[12; 10; 8; 6; 4]; [11]; [2]; [9; 7;

5; 3; 1]}. Each sequence represents a rectangular re-

gion as a possible solution of a classier (a rule) and

the initial and ﬁnal examples of the sequence (if it

has only one, it is simultaneously the initial and the

ﬁnal one) represent the lower and upper values for

each coordinate of this rectangle. For example, there

is a rectangle formed by the examples {1; 3; 5; 7; 9}.

4

1

3

9

7

11

8

2

6

5

10

12

I

P

I

I P I P

Figure 1: Results of applying EOP

This region needs the examples {9; 7} to establish the

boundaries of a dimension and the examples {1; 9}

for another one. Therefore, the remaining examples

will be candidates to be eliminated because they are

never boundaries. The idea is best understood by an-

alyzing the non-empty regions obtained by means of

projections on every axis and deleting the examples

that are not relevant so as to establish the boundaries

of a rule.

In regards to the analysis of this editing method,

we have dealt with eighteen databases from the UCI

repository (Blake and Merz, 1998). To show the per-

formance of our method we have used C4.5 (Quinlan,

1993) and k–NN (Hart, 1968) before and after apply-

ing EOP. The obtained results (Riquelme et al., 2003)

prove the validity of the method.

Table 1: EOP algorithm

Input: E training (N ex., M att.)

Output: E reduced (L ex., M att.)

for each example e ∈ E

weakness(e) = 0

for each attribute a

sort E in increasing order

for each example e ∈ E

if it is NOT border

increase weakness(e)

for each example e ∈ E

if weakness = M

delete register e

3 HORIZONTAL REDUCTION

3.1 Related work

Depending on the evaluation strategies, feature selec-

tion algorithms can generally be placed into one of

DATABASES REDUCTION

two broad categories: wrappers, Kohavi (Kohavi and

John, 1997), which employ a statistical re-sampling

technique (such as cross validation) using the actual

target learning algorithm to estimate the accuracy of

feature subsets. This approach has proved to be use-

ful but is very slow to execute because the learning

algorithm is called upon repeatedly. Another option

called ﬁlter, operates independently of any learning

algorithm. Undesirable features are ﬁltered out of the

data before induction begins. Filters use heuristics

based on general characteristics of the data to eval-

uate the merit of feature subsets. As a consequence,

ﬁlter methods are generally much faster than wrapper

methods, and, as such, are more practical for use on

data of high dimensionality. LVF (Liu and Setiono,

1996) use class consistency as an evaluation measure.

One method called Chi2 (Liu and Setiono, 1995) real-

ize selection by discretization . Relief (Kira and Ren-

dell, 1992) works by randomly sampling an instance

from the data, and then locating its nearest neighbour

from the same and opposite class. Relief was origi-

nally deﬁned for two-class problems and was later ex-

panded as ReliefF (Kononenko, 1994) to handle noise

and multi-class data sets, and RReliefF handles re-

gression problems. Other authors suggest Neuronal

Networks for attribute selector. In addition, learn-

ing procedures can be used to select attributes, like

ID3 (Quinlan, 1986), FRINGE (Pagallo and Haussler,

1990) and C4.5 (Quinlan, 1993) as well as methods

based on correlations like CFS (Hall, 1997).

The most important characteristics of our feature

selection algorithm, called SOAP (Ruiz et al., 2002)

(Selection of Attributes by Projection), are very simi-

lar to that of EOP.

3.2 Feature selection

In this paper, we propose a new feature selection cri-

terion not based on measures calculated between at-

tributes, or complex and costly distance calculations.

This criterion is based on a unique value called NLC.

It relates each attribute with the label used for classiﬁ-

cation. This value is calculated by projecting data set

elements onto the respective axis of the attribute (or-

dering the examples by this attribute), then crossing

the axis from the beginning to the greatest attribute

value, and counting the Number of Label Changes

(NLC) produced.

Consider the situation depicted in Figure 2: the pro-

jection of the examples on the abscissa axis produces

three ordered sequences {O; E; O} corresponding to

the examples {[1,3,5],[8,4,10,2,6],[7,9]}. Identically,

with the projection on the ordinate axis, we can ob-

tain the sequences {O; E; O; E; O; E} formed by the

examples {[8],[7,5],[10,6],[9,3],[4,2],[1]}. Then, we

calculate the Number of Label Changes, NLC. Two

for the ﬁrst attribute and ﬁve for the second.

E 8

7

5

10

6

9

3

4

2

O 1

O E O

O

E

O

E

Figure 2: Results of applying SOAP

We conclude that it will be easier to classify by at-

tributes with the smallest number of label changes. If

the attributes are in ascending order according to the

NLC, we obtain a ranking list with the better attributes

from the point of view of the classiﬁcation.

We have dealt with eighteen databases from the

UCI repository [3]. To show the performance of our

method we have used k-NN and C4.5 before and af-

ter applying EOP. Results obtained (Ruiz et al., 2002)

prove the validity of the method.

Table 2: SOAP algorithm

Input: E training (N ex., M att.)

Output: E reduced (N ex., K att.)

for each attribute a

sort E in increasing order

count label changes

ranking attributes by NLC

choose the k first attributes

4 INTEGRATION OF

REDUCTION TECHNIQUES

The size of a data set can be measured in two dimen-

sions, number of features and number of instances.

Both can be very large. This enormity may cause se-

rious problems to many data mining systems.

Our approach is to reduce the database in the two

directions, vertically and horizontally, applying the

aforementioned algorithms sequentially.

The algorithm is very simple and efﬁcient. The

computational cost of EOP and SOAP is O(m × n

× log n), being the lowest of its category. Therefore,

the new algorithm is efﬁcient too.

Figure 3 shows the process to reduce a database

with two thousand examples and forty one attributes,

the last feature being the class. There are three posi-

ble labels A,B,C. At the beginning, vertical reduction

is applied with the algorithm EOP. The number of ex-

amples decreases to three hundred and ﬁfty. Then,

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

100

EPO

Horizontal

reduction

SOAP

Vertical

reduction

EPO

Vertical

reduction

SOAP

A

1

A

2

A

3

...

A

40

Class

R

1

44 Hi 5.96 54 A

R

2

7 Hi 6.88 24 A

R

3

23 No 4.20 18 B

R

4

52 Lo 9.53 30 C

R

5

85 No 7.33 22 A

...

R

2000

59 Lo 3.21 81 B

BD: 2000 ex. x (40 att. + class)

A

1

A

2

A

3

...

A

40

Class

R

1

44 Hi 5.96 54 A

R

4

52 Lo 9.53 30 C

R

5

85 No 7.33 22 A

...

R

458

37 No 8,36 34 B

BD: 350 ex. x 41 att.

A

1

A

3

...

A

15

Class

R

1

44 5.96 V A

R

4

52 9.53 F C

R

5

85 7.33 V A

...

R

458

37 8,36 V B

BD: 350 ex. x 9 att.

A

1

A

3

...

A

15

Class

R

1

44 5.96 V A

R

5

85 7.33 V A

...

R

385

65 6.48 F C

BD: 334 ex. x 9 att.

Figure 3: Database reduction process

Table 3: Main algorithm

Input: E training (N ex., M att.)

Output: E reduced (L ex., K att.)

E’ = EPO(E)

while ( E’ <> E )

E = SOAP(E’)

E’ = EPO(E)

endwhile

E’ = SOAP(E)

the most relevant attributes are chosen by mean of the

SOAP algorithm. The number of features decreases

to nine attributes. The one and the other algorithm are

applied using EOP as a stopping criterion, until a new

execution does not reduce the data set.

5 EXPERIMENTS

In this section, we want to analyze our algorithm with

a large data set. We would like obtain a set with a

considerable size (more than twenty thousand exam-

ples), and with low missing values, and where we do

not know the attributes’ relevancy. This is not easy to

ﬁnd and we decided to create one. We considered a

large enough database would be with forty thousand

examples and forty features, and it would present an

adequate difﬁculty to prove our algorithm in the de-

sired environment.

We generate examples randomly, and we label

them according to the same given rules. At the end,

we add a percentage of examples with noise. We try to

set the minimum number of parameters (Table 5), and

the other parameters are solved randomly. Therefore,

there are three important ﬁxed parameters: Number of

rules (thirty-ﬁve); Conditions used in each rule (four);

And the set of possible labels (ﬁve). We label exam-

ples consecutively (rule 0: label A, rule 1: label B,...,

rule 5: label A,...). Then, each label can be obtained

for seven different rules (35 rules / 5 classes = seven).

Table 4: Parameters

Examples 40000

Attributes 40

Rules 35

Conditions x rule 4

Labels 5

35 rules x 4 = 140 conditions

35 rules / 5 labels = 7

There are four attributes in each rule (one for each

condition Table 5) and they are constants. But inter-

vals of each attribute are obtained randomly within

some ﬁxed limits. One rule is different than the rest

at least in two attributes, then, we offer a wide range

of regions.

Condition : (li < att < hi)

att : constant

li, hi : random

∀X, Y #(ruleX ∩ ruleY ) 6 2atts.

Consider that one attribute is more relevant than

another if we use it more times in the set of rules,

DATABASES REDUCTION

101

Table 5: Rules

rule 1 condition1 AND condition2 AND condition3 AND condition4 THEN label A

rule 2 condition5 AND condition6 AND condition7 AND condition8 THEN label B

rule 3 condition9 AND condition10 AND condition11 AND condition12 THEN label C

. . . . . . AND . . . AND . . . AND . . . THEN . . .

rule 35 condition137 AND condition138 AND condition139 AND condition140 THEN label E

Ex. [lo1<att1<hi1] AND [lo2<att2<hi2] AND [lo6<att6<hi6] AND [lo9<att9<hi9] THEN label A

A

1

,... A

5

,A

6

,...,A

10

,A

11

,...,A

15

,A

16

,...,A

20

14 veces

x 5 atts

= 70

8 veces

x 5 atts

= 40

4 veces

x 5 atts

= 20

2 veces

x 5 atts

= 10

relevants less relevants

redundants

Irrelevants

+

Figure 4: Relevance of the attributes

and we want to obtain four groups of well sepa-

rated attributes. They are relevant, less relevant, re-

dundant and dummy attributes with ten attributes in

each group. The ﬁrst and the second group (rele-

vant and less relevant) are divided into two subgroups,

with ﬁve attributes more relevant than the other ﬁve.

Therefore, we have six different groups of attributes.

The importance of the attributes is related to the

number of times that it appears in the set of rules.

Rules are built by four of the twenty attributes, those

who belong to the relevant and less relevant group.

All in all, the distribution of the attributes in the set

of rules is as follows: fourteen times ﬁve attributes

belonging to the relevant group, and eight times the

other attributes of the group. Four times ﬁve attributes

belonging to the less relevant group, and two times the

other attributes of this group.

The ﬁrst step (Figure 5) in the process followed to

generate the database is to obtain the value of twenty

from forty attributes randomly (belonging to relevant

and less relevant group). Then, we ﬁlter them through

the set of rules and it is classiﬁed with a label. If an

example validate more than one rule with a different

label, then it is thrown out. In the contrary case, we set

the value of the rest of the attributes; ten features ob-

tained randomly, the irrelevant attributes, and another

ten according to some part of the other thirty attributes

(redundant group). This process is repeated until the

database is full. At the end of the process, we add a

number of register with noise (10%), i.e. the value of

each attribute is obtained randomly (label included).

The only thing remaining is to mix the attributes of

the different groups.

We conclude that to obtain a learning model from

the data set generated is very complex, i.e. if we try

Random(A

1

A

2

,...,A

20

)

36000 ex.

Random(A

1

A

2

,A

3

,...,A

40

,class) 

4000 ex.

rules

Class

Throw out

f(A

1

A

2

,A

3

,...,A

30

)= (A

31

A

32

,A

33

,...,A

40

)

Random(A

21

A

22

,A

23

,...,A

30

)

1label

>1

Figure 5: Process to generate de database

to generate a set of decision rules or decision tree

from the database, the model of accuracy rate from

the training set will be low. Therefore, preprocessing

techniques must be very robust and efﬁcient, because

of the size of the data.

Table 6: Results obtained

Original Reduced

Examples 40000 22451

Attributes 40 5

C4.5 % 53.33 58.45

C4.5 size 9897 8011

Table 5 shows, after the generation of the data

set with 40000 examples and 40 attributes, we ap-

ply the classiﬁer C4.5. and we obtain a tree with a

size 9897, and 21334 well classiﬁed examples, this

is 53.33% of the data. Now we apply our new algo-

rithm, and we obtain a reduced data set with 22451

examples and 5 attributes. We reduce the number

of examples to 56% of the original examples and

13% of the original attributes. In general, the size

of the database (40000x40=1600000) is reduced to

(22451x5=112255). Our method reduces the data by

93% of the original size, leaving us with 7% of the

original data ().

Applying C4.5 to the reduced database, i.e. we

use 7% of the original data. We obtain a smaller tree

size, 8011 nodes. If we classify the 40000 examples,

the accuracy is 58.45%, better than with the origi-

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

102

nal data. Therefore, this new method is robust. We

conclude that a classiﬁer like C4.5 generates a bet-

ter result when an effective preprocessing technique

is used.

6 CONCLUSIONS

In this paper we present an integration technique of

database reduction algorithms. This integration is

based on two techniques applied sequentially. The

ﬁrst is a reduction of examples method (editing or

vertical reduction), and the second one is a reduction

of attributes algorithm (feature selection or horizontal

reduction). Both techniques are valid in previous pa-

pers, and they are efﬁcient techniques, because their

computational costs are the lowest of its respective

categories.

Given the satisfactory results, using a speciﬁcally

generated database with extreme complexity, we can

state that the of integration approach of the two re-

duction techniques (horizontal and vertical) is very

interesting from the data mining techniques applica-

tion point of view.

Results show a very important data reduction, 93%.

Nevertheless, the quality of the information is the

same with only 7% of the original data. A model is

generated based on a decision tree, and its accuracy is

slightly better than the accuracy obtained with all the

data.

Our work is going to be oriented to the integration

of other reduction techniques and the application to

real world data sets.

ACKNOWLEDGEMENTS

This work has been supported by the Spanish Re-

search Agency CICYT under grant TIC2001-1143-

C03-02.

REFERENCES

Aguilar, J. S., Riquelme, J. C., and Toro, M. (2000). Data set

editing by ordered projection. In Proceedings of the

European Conference on Artiﬁcial Intelligence,

pages 251–255, Berlin, Germany.

Blake, C. and Merz, E. K. (1998). Uci repository of ma-

chine learning databases.

Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pat-

tern classiﬁcation. IEEE Transactions on Information

Theory, IT-13(1):21–27.

Hall, M. (1997). Correlation-based feature selection for

machine learning. PhD thesis, University of Waikato,

Hamilton, New Zealand, Department of Computer

Science.

Hart, P. (1968). The condensed nearest neighbor rule. IEEE

Transactions on Information Theory, 14(3):515–516.

Kira, K. and Rendell, L. (1992). A practical approach to

feature selection. In International Conference on Ma-

chine Learning, pages 368–377.

Klee, V. (1980). On the complexity of d-dimensional

voronoi diagrams. Arch. Math., 34:75–80.

Kohavi, R. and John, G. H. (1997). Wrappers for feature

subset selection. Artiﬁcial Intalligence, 1-2:273–324.

Kononenko, I. (1994). Estimating attributes: Analysis and

estensions of relief. In European Conference on Ma-

chine Learning, pages 171–182.

Liu, H. and Setiono, R. (1995). Chi2: Feature selection and

discretization of numeric attributes. In Proceedings of

the Seventh IEEE International Conference on Tools

with Artiﬁcial Intelligence.

Liu, H. and Setiono, R. (1996). Feature selection and clas-

siﬁcation: a probabilistic wrapper approach. In Pro-

ceedings of the IEA-AIE.

Pagallo, G. and Haussler, D. (1990). Boolean feature

discovery in empirical learning. Machine Learning,

5:71–99.

Quinlan, J. (1986). Induction of decision trees. Machine

Learning, 1:81–106.

Quinlan, J. R. (1993). C4.5: Programs for machine learn-

ing. Morgan Kaufmann, San Mateo, California.

Riquelme, J., Aguilar-Ruiz, J. S., and Toro, M. (2003).

Finding representative patterns with ordered projec-

tions. Pattern Recognition, 36(4):1009–1018.

Ritter, G., Woodruff, H., Lowry, S., and Isenhour, T. (1975).

An algorithm for a selective nearest neighbor deci-

sion rule. IEEE Transactions on Information Theory,

21(6):665–669.

Ruiz, R., Riquelme, J., and Aguilar-Ruiz, J. S. (2002).

Projection-based measure for efﬁcient feature selec-

tion. Journal of Intelligent and Fuzzy System, 12(3-

4):175–183.

Tomek, I. (1976). An experiment with the edited nearest-

neighbor rule. IEEE Transactions on Systems, Man

and Cybernetics, 6(6):448–452.

Toussaint, G. T. (1980). The relative neighborhood graph

of a ﬁnite planar set. Pattern Recognition, 12(4):261–

268.

Wilson, D. (1972). Asymtotic properties of nearest neigh-

bor rules using edited data. IEEE Transactions on Sys-

tems, Man and Cybernetics, 2(3):408–421.

DATABASES REDUCTION

103