INFORMED k-MEANS: A CLUSTERING PROCESS BIASED BY

PRIOR KNOWLEDGE

A case study in the dactyloscopic domain

Wagner Francisco Castilho

Federal Savings Bank and Catholic University of Brasília, Brasília, Brazil

Hércules Antônio do Prado

Brazilian Enterprise for Agricultural Research and Catholic University of Brasília, Brasília, Brazil

Marcelo Ladeira

University of Brasília, Brasília, Brazil

ywords: Knowledge Discovery in Databases (KDD), Clustering Analysis, Dactyloscopy.

Abstract: Knowledge Discovery in Databases (KDD) is the process by which unknown and useful knowledge and

information are extracted, by automatic or semi-automatic methods, from large amounts of data. Along the

evolution of Information Technology and the rapid growth in the number and size of databases, the

development of methodologies, techniques, and tools for data mining has become a major concern for

researchers, and has led, in turn, to the development of applications in a variety of areas of human activity.

About 1997, the processes and techniques associated with cluster analysis had begun to be researched with

increasing intensity by the KDD community. Within the context of a model intended to support decisions

based on cluster analysis, prior knowledge about the data structure and the application domain can be used

as important constraints that lead to better results in the clusters’ configurations. This paper presents an

application of cluster analysis in the area of public safety using a schema that takes into account the burden

of prior knowledge acquired from statistical analysis on the data. Such an information was used as a bias for

the k-means algorithm that was applied to identify the dactyloscopic (fingerprint) profile of criminals in the

Brazilian capital, also known as Federal District. These results was then compared with a similar analysis

that disregarded the prior knowledge. It is possible to observe that the analysis using prior knowledge

generated clusters that are more coherent with the expert knowledge.

1 INTRODUCTION

Fayyad (1996) argues that KDD is the process of

extracting new, useful, and interesting knowledge

from databases. This process has an iterative and

interactive nature and is composed of a series of

activities, which includes, as well, previous

knowledge and the adequate interpretation of results.

KDD is a field in which various areas related to

knowledge converge, integrating mature

technologies associated with Statistics, Databases,

Machine Learning, Computational Intelligence, Data

Warehouse, Artificial Intelligence, and Standards

Recognition. Its application has also spread to

various areas of human activity, such as finance,

science, government, health care, sales and

marketing, health insurance and plans,

transportation, industry, among others.

This paper presents an application of KDD

techniques in the area of public safety, which centers

on identifying the patterns corresponding to the

dactyloscopic (fingerprint) profile of criminals in the

Brazilian capital (Federal District), in comparison to

469

Francisco Castilho W., Antônio do Prado H. and Ladeira M. (2004).

INFORMED k-MEANS: A CLUSTERING PROCESS BIASED BY PRIOR KNOWLEDGE - A case study in the dactyloscopic domain.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 469-475

DOI: 10.5220/0002646704690475

 SciTePress

the national profile, on the basis of the application of

a clustering task and statistical resources.

2 INFORMED CLUSTERING

Figure 1 presents a general scheme of the informed

clustering process. Departing from the information

regarding to the data structure and the application

domain the clustering analysis objectives and

expectations are defined. Subsequently, it is selected

a set of variables or attributes of the objects that are

relevant or discriminant within the classification

problem under consideration. In constructing an

information matrix, we can consider the gradations

of interest or relevance of the attributes, as well as

the implication and correlation maps within them.

Information on the rules of production can influence

the homogeneity coefficient and guide the

specification of prior knowledge as hypotheses that

are introduced at the time the algorithm is

reallocated in the search for a better configuration.

3 RECOGNITION OF PATTERNS

IN DACTYLOSCOPIC DOMAIN

The purpose of Biometrics is to identify an

individual based on his or her physical

characteristics. In conjunction with the resources

offered by Information Technology, Biometrics

offers interesting and effective solutions in the area

of public safety, particularly in the identification of

individuals involved in criminal activity.

Dactyloscopy is a biometric technique that has been

widely used to identify criminals, given that it

satisfies the requirements of the permanence,

immutability, and singularity of fingerprints

(Oliveira, 2003). Dactyloscopy is the process by

which individuals are identified through the

examination of their fingerprints. The digital

impression is the mirror image of the digital pattern.

3.1 Dactyloscopic classification

system

Dactyloscopic classification systems were developed

to reduce the complexity of and time required for the

identification of fingerprints. Two main

classification systems have been adopted around the

world: Vucetich and Henry. The Brazilian police

force employs the Vucetich system, the most widely

used method in the world. Vucetich defined four

primary types of digital impressions in his system

with the following classifications: arches, internal

loops, external loops, and whorls. Subsequently, the

accidental, scar, and amputation types were added.

These seven primary types are defined (INI, 1987):

Figure 1: Informed clustering process

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

470

Arch - refers to the dactylogram

made up of generally parallel and

convex ridges that run or tend to run

from one side of the print to the other

and very often reveal angular or

vertical ridges. Represented by the

number 1

or the letter A.

Internal Loop: refers to the

dactylogram that presents a delta to

the observer’s right and a nucleus

composed of one or more ridges,

which run from the left of the print

toward the center, recurving and

returning, or tending to return, to the

side from which they originated,

thereby forming one or more loops.

Loops involve the two-way movement of a papillary

line, which must have perfect inflection.

Represented by the number 2

or the letter I.

External Loop: refers to the

dactylogram that reveals a delta to the

observer’s left and a nucleus

composed of one or more ridges that

run from the left of the print toward

the center, recurving and returning, or

tending to return, to the side from

which they originated, thereby

forming one or more loops.

Represented by the number 3 or the

letter E.

Whorl: refers to the dactylogram

characterized by the presence of a

delta to the observer’s left and right

and a varied nucleus, which presents

at least one curved ridge in front of

each delta. Represented by the

number 4 or the letter W.

Accidental: refers to the dactylogram that does

not fit within any of the four primary types cited

before and which is represented by the number 5.

Scar: refers to the dactylogram that presents a

permanent mark caused by a cut, pustule, burn, or

crushing, thereby making its classification within

one of the 5 types cited above impossible and which

is represented by the number 6.

Amputation (or failure): refers to the type in

which a total or partial loss of the phalange is

evidenced, therefore compromising or even

precluding the classification of the primary type, and

which is represented by the number 7.

If we create a fraction in which the numerator is

the number formed by the numbers that represent the

pattern of the fingers of the right hand, extending

from the thumb to the small finger, and the

denominator constitutes the same number for the left

hand, we arrive at the dactyloscopic formula, as it is

known.

Two fingerprints will only be considered

identical when they demonstrate twelve or more

characteristic points having the same configuration

and location. In the majority of countries, these

criteria are required by law for purposes of a positive

identification in criminal cases.

4 CASE STUDY

The purpose of the analysis was to identify the

pattern of the dactyloscopic (fingerprint) profile of

criminals in the Federal District, in comparison to

the national profile, on the basis of the application of

a clustering analysis and statistics, supported by a

clustering model that uses prior knowledge.

It is the task of the National Identification

Institute (INI), a branch of the Federal Police

Department (DPF), linked to the Ministry of Justice,

founded in 1963 and headquartered in Brasilia, to

centralize information and fingerprints associated

with the subjects of police investigations or

individuals charged with crimes within the territorial

boundaries of Brazil, as well as foreign nationals

subject to registration, through the use of the

dactyloscopic identification process. The

Dactyloscopic Research Section has an Individual

Dactyloscopic Archive (AID) comprised by 19

manual archiving machines for individual

dactyloscopic criminals, model NG Class 5500, in

which approximately 1,360,000 records are stored.

Those records have ten fields in which the ten

fingerprints are stored. The archiving of the

individual dactyloscopics is initially accomplished

on the basis of the fundamental types established in

the classification key. The Dactyloscopic Formula

(FD) is the set of numerical symbols representing

the primary classification of the AID.

The database, known as “MECA-Sinic”, was

extracted from the DPF’s mainframe in November

2000 by a domain expert. The database has a total of

502,052 registries. It represents a sample of 37% of

the total number of identification records, randomly

extracted. Complete attribute types: criminal

violation code, sex, skin, birth date, and main types

for each finger. Text attribute types: State. From the

database, all the State’s records matching those of

the Federal District were selected, specifically, a

total of 5,363. The attributes selected for the

clustering analysis were the 10 primary types

corresponding to each finger.

INFORMED K-MEANS: A CLUSTERING PROCESS BIASED BY PRIOR KNOWLEDGE

471

4.1 Incidence of the fundamental

dactyloscopic types

According to Araújo (2003), considering the country

as a whole, the averages corresponding to the

statistics by incidence of the fundamental types are:

Whorl (31.16%), Internal Loop (30.84%), External

Loop (29.21%), Arch (7.50%). The primary types,

Accidental, Scar, and Amputation, combined, do not

reach 1.5% of the cases. Of the total, 90.16% were

found to be men and 9.84% women. The right hand

was found to be predominant in the External Loop

type, while the left hand proved predominant in the

Internal Loop type, for both sex (the principle of

symmetry). Both the right and left index fingers

have the highest degree of frequency distribution.

The small finger revealed the lowest degree. This

highlights the long-standing error in Brazil of using

the right thumb instead of the right index finger as

the standard for identification documents.

Table 2 presents the percentage of the

fundamental types compared by sex. There is a

higher incidence among men of the Whorl and

Internal Loop types, while for women higher

incidences are found in the External Loop and Arch

types. The analysis found that among women, the

most frequent fundamental type is the External Loop

(32.11%), followed by the Internal Loop (29.43%),

Whorl (28.01%), and Arch (9.59%), while men

displayed a higher incidence of the Whorl type

(31.51%), followed by the Internal Loop (30.99%),

External Loop (28.89%), and Arch (7.28%).The

small fingers have the highest absolute incidence of

frequency among the fundamental types, External

Loop (83.35%, Right Small Finger) and Internal

(80.17%, Left Small Finger) in women. Among

men, the inverse is found: Internal Loop (81.80%,

Left Small Finger) and External (78.84%, Right

Small Finger).

Table 1: Incidence of fundamental types (%)

Table 2: Fundamental types by sex (%)

4.2 Incidence of the fundamental

dactyloscopic type in the Federal

District

The statistics for the data corresponding to the

Federal District are presented in table 3. Notice the

following statistics of fundamental types: Internal

Loop (81.79%), the left small finger; External Loop

(80.10), the right small finger; Whorl (50.81%), the

right thumb; Arch with an incidence of 16.89% in

the left forefinger, followed by 14.64% in the right

index finger.

The overall average incidence of the fundamental

types in the Federal District reveals only small

absolute difference in relation to the national

statistics: lower arch (0.17) and internal loop types

(0.04); higher external loop (0.19) and whorl types

(0.41). Figure 3 presents the different slopes for the

Federal District data relative to the national data for

each fundamental type and for each finger.

Table 3: Fundamental types in the Federal District (%)

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

472

Figure 2: Fundamental types in the Federal District (%)

4.3 Distribution of the dactyloscopic

formulas

In Vucetich’s dactyloscopic classification system,

alternative formulas can occur, all constructed on

the basis of the combination of the seven primary

types and referring to each finger on both hands. In

the database originally constructed by the domain

experts, containing 502,052 registries, an occurrence

of only 36,175 formulas was verified. Of the total

number of dactyloscopic formulas verified in the

original database, five formulas were found to have

higher frequencies, representing 11.26% of the

occurrences. The dactyloscopic formulas with the

highest incidences are set out in figure 4. The

remaining occurrences of less frequent formulas

(88.74%) are mostly spread out in a distribution of

individual frequency (90.72%) of less than 1%

(Araújo, 2003).

Experiments with clustering in a number k of

different clusters restrict the decision for 6, 5, or 4

groups. Experiments with clustering in a number k

of 6 and 5 groups demonstrated the proximity of the

third and fifth groups produced. Thus, in light of this

particular configuration for the data structure and the

domain application, and following numerous trials

using samples, a total number of four clusters was

chosen for purposes of the segmentation of the

database containing 7,953 registries, corresponding

to the total volume of data for the Federal District.

Figure 4: Dactyloscopic formulas with highest incidences

(%)

4.4 Results of the clustering

Tables 4 and 5 present the comparative results of

the sum of the residual square (SQRes) and the

information on the configuration of the resulting

clusters for the standard algorithm and the

classification process supported by prior knowledge.

Figure 3: Difference between the slopes of the fundamental types in Brasil and the Federal District

INFORMED K-MEANS: A CLUSTERING PROCESS BIASED BY PRIOR KNOWLEDGE

473

The results presented indicate that the informed

clustering process performed better. Within the

process, the SQRes was smaller and the

dactyloscopic formulas were more evenly distributed

among the four groups, with formulas 3 and 5

located within the same group. In cluster 1, 3,340

objects were allocated with a greater resemblance to

formula 1 (375 objects). Cluster 2, 1,672 more

closely related to formula 2 (206). Cluster 3, 1,591

more closely related to formula 3 (179) and 5 (102).

Cluster 4, 1,750 with a greater resemblance to

formula 4. The results suggest, therefore, a notable

homogeneity in the standard distribution of the

fundamental dactyloscopic formulas. The Federal

District has a standard percentage that is highly

similar to the National figure.

Table 4: Sum of the Residual Square (SQRes)

Table 5: Distribution of the objects

5 CONCLUSION

KDD has provided useful results for both

researchers and companies alike. In this context,

cluster analysis occupies an important place. The

clustering process is complex and requires multiple

decisions at each stage that influence the final results

(Han, 2001). Hanson (1990) highlights the need for

a quota of prior knowledge as a requirement for the

clustering process. The purpose of the informed

clustering is to make use of the information on the

data structure and the application domain, not only

for purposes of optimizing the operation of the

process, but also as restrictions from the data space

or the expert’s knowledge. This is known as Domain

Theory, which guides the inductive process. The

structural and causal correlations and dependencies

among the attributes can be applied to the

homogeneity coefficient. A gradation scale of

relevance and interest for the attributes in relation to

the desired configuration for the clustering process

can also be employed. Research has been driven,

furthermore, to apply, the production rules to the

homogeneity coefficient of the clustering

algorithms. A fertile field for research thus has

opened up, especially regarding applications in the

area of public safety. The progress of biometric

techniques, together with the development of

Information Technology, offers interesting

possibilities with respect to the problems regarding

the civil and criminal identification and

differentiation of individuals. According to Araújo

(2003), continued research on the database and

information of the Federal Police Department’s

National Identification Institute is significant for

boosting the technical criteria used in developing the

laws governing civil and criminal identification.

Advances in this area of research may also serve to

jumpstart and support the development and

implementation of computerized fingerprint analysis

systems (Automated Fingerprint Identification

System – AFIS), thereby generating interesting

possibilities for the study of criminal psychology,

biology, and anthropology, among other

specializations, through improved management and

application of the information collected from efforts

in the area of civil and criminal identification.

ACKNOWLEDGMENT

We are grateful to the papilloscopist Mr. Marcos

Elias Cláudio de Araújo from the INI/DPF who

kindly provided the access to the “MECA-

Sinic”database.

REFERENCES

Araújo, M. E. C., Bossois L. M., Santana J. L., 2003. O

Arquivo datiloscópico criminal brasileiro: os tipos

fundamentais e suas freqüências. In XIII Congresso

Mundial de Criminologia. Sociedade Internacional de

Criminologia.

Fayyad, U. M. et al, 1996. From data mining to knowledge

discovery: an overview. In: Fayyad, U. M. et al.

Advances in Knowledge discovery and data mining,

AAAI Press. Menlo Park, CA.

Han, J., Kamber, M., 2001. Data Mining: concepts and

techniques, Morgan Kaufmann Publishers.

Hanson, S. J., 1990. Conceptual Clustering and

Categorization: Bridging The Gap Between Induction

and Causal Models. In: Kodratoff, Y. & Michalski, R.

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

474

(Eds.), Machine Learning: An Artificial Intelligence

Approach, Morgan: San Mateo, CA.

INI - Instituto Nacional de Identificação, 1987.

Identificação Papiloscópica, Departamento de Polícia

Federal (DPF). Brasília.

Oliveira, M. G., 2003. Otimização de busca decadactilar

para métodos manuais, tradicionais ou sistemas

automatizados de identificação de impressões digitais

(AFIS), utilizando técnicas de Data Mining. UNB.

Brasília.

INFORMED K-MEANS: A CLUSTERING PROCESS BIASED BY PRIOR KNOWLEDGE

475