AN ONTOLOGY DRIVEN DATA MINING PROCESS

Laurent Brisson

Institut TELECOM, TELECOM Bretagne, CNRS FRE 3167 LAB-STICC, Technop

ole Brest-Iroise, France

Martine Collard

INRIA Sophia Antipolis, 2004 route des Lucioles, 06902 BP93 Sophia Antipolis, France

University of Nice Sophia Antipolis, France

Keywords:

Data mining, Knowledge integration, Ontology Driven Information System.

Abstract:

This paper deals with knowledge integration in a data mining process. We suggest to model domain knowledge

during business understanding and data understanding steps in order to build an ontology driven information

system (ODIS). We present the KEOPS Methodology based on this approach. In KEOPS, the ODIS is dedi-

cated to data mining tasks. It allows using expert knowledge for efﬁcient data selection, data preparation and

model interpretation. In this paper, we detail each of these ontology driven steps and we deﬁne a part-way in-

terestingness measure that integrates both objective and subjective criteria in order to evaluate model relevance

according to expert knowledge.

1 INTRODUCTION

In knowledge discovery from data, methods and tech-

niques are developed for discovering speciﬁc trends

in a system or organization business by analyzing its

data. The real advantage for decision making relies on

the add-on provided by comparing extracted knowl-

edge against a priori domain knowledge. Integrating

domain a priori knowledge during the data mining

process is currently an important research issue in the

data mining ﬁeld.

In this paper, we present KEOPS methodology

based on an ontology driven information system

which integrates a priori knowledge all along the data

mining process in a coherent and uniform manner. We

detail each of these ontology driven steps and we de-

ﬁne a part-way interestingness measure that integrates

both objective and subjective criteria in order to eval-

uate model relevance according to expert knowledge.

The paper is organized in six sections. Section 2

presents the issue addressed and KEOPS main char-

acteristics. Section 3 is devoted to ontology driven in-

formation systems. In section 4, the KEOPS method-

ology is presented step by step. In Section 5, we com-

ment some results which demonstrate the relevance of

the approach. We conclude in Section 6.

2 KNOWLEDGE INTEGRATION

IN DATA MINING

The Data Mining process described according to the

CRISP-DM model (Chapman et al., 2000) is pre-

sented as both iterative and interactive. The itera-

tive nature is due to the way processes run cycling

test-error experiments. Indeed data miners have to re-

peat the pre-processing steps of domain understand-

ing, data understanding and data preparation until ﬁ-

nal models are considered relevant. The interactive

nature is inherent to a data mining activity since com-

munications with experts is necessary for understand-

ing domain and data and for interpreting results. Is-

sues in evaluating and interpreting mining process re-

sults are currently big research challenges. In order to

avoid useless iterations on preliminary tasks and facil-

itate model interpretation, one solution is to explore

deeply expert knowledge and source data in order to

formalize them in conceptual structures and exploit

these structures both for robust data preparation and

for ﬂexible model interpretation.

In the literature, partial solutions for domain

knowledge interpretation are proposed for optimiz-

ing pre-processing steps (Kedad and M

etais, 2002).

For model evaluation, detailed studies have been de-

Brisson L. and Collard M. (2008).

AN ONTOLOGY DRIVEN DATA MINING PROCESS.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - AIDSS, pages 54-61

DOI: 10.5220/0001697400540061

 SciTePress

Figure 1: KEOPS methodology.

voted to interestingness measures (McGarry, 2005).

A consensus among researchers is now established

to consider objective interestingness versus subjective

interestingness. Objective interestingness is tradition-

ally evaluated by a variety of statistic indexes while

subjective interestingness is generally evaluated by

comparing discovered patterns to user knowledge or

a priori convictions of domain experts. In this pa-

per we present the KEOPS methodology based on an

ontology driven information system which addresses

the knowledge integration issue. The system relies

on three main components: an ontology, a knowl-

edge base and a mining oriented database rebuilt from

source raw data. These components allow to model

domain concepts and relationships among them. They

are used to pre-process data and to identify mapping

between discovered patterns and expert knowledge.

3 ONTOLOGY DRIVEN

INFORMATION SYSTEM

(ODIS)

An ontology driven information systems is an infor-

mation system (IS) which relies mainly on an ex-

plicit ontology. This ontology may underlie all as-

pects and components of the information system. An

ODIS contains three kinds of components: applica-

tion programs, information resources and user inter-

faces. (Guarino, 1998) discusses the impact of an on-

tology on an information system according to tempo-

ral and structural dimension.

The temporal dimension refers to ontology role

during IS construction and run-time. If we have a

set of reusable ontologies, the semantic content ex-

pressed can be transformed and translated into an IS

component. Even if the volume of ontology knowl-

edge available is modest it may nevertheless help a

designer in a conceptual analysis task. This task con-

sists frequently of redesigning an existing information

system. This approach ﬁts the needs of data mining

tasks where an operational database has to be trans-

formed into datasets before the data mining modeling

step.

The structural dimension refers to each informa-

tion system component which may use the ontology

in a speciﬁc way.

• Database component: at development time, an on-

tology can play an important role in requirement

analysis and conceptual modeling. The resulting

conceptual model can be represented as a com-

puter processable ontology mapped to a concrete

target platform (Ceri and Fraternali, 1997). Usu-

ally, IS conceptual schemes (CS) are created from

scratch, wasting a lot of time and resources.

• Interface components may be assisted by ontolo-

gies which are used to generate personalized in-

terfaces or to manage user proﬁles (Guarino et al.,

1998; Penarrubia et al., 2004).

• Application program components use implicit

knowledge in order to perform a task. However,

this knowledge is often hardcoded in software.

Ontologies may provide a formal base helping to

access domain knowledge.

AN ONTOLOGY DRIVEN DATA MINING PROCESS

4 KEOPS METHODOLOGY

KEOPS is a methodology which drives data mining

processes by integrating expert knowledge. These are

the goals addressed:

• To manage interactions between knowledge and

data all along the data mining process: data prepa-

ration, datasets generation, modeling, evaluation

and results visualization.

• To evaluate extracted models according to domain

expert knowledge.

• To provide easy navigation throughout the space

of results.

KEOPS (cf. ﬁg. 1) is based upon an ontology driven

information system (ODIS) set up with four compo-

nents:

• An application ontology whose concepts and re-

lationships between them are dedicated to domain

and data mining task.

• A Mining Oriented DataBase (MODB): a rela-

tional database whose attributes and values are

chosen among ontology concepts.

• A knowledge base to express consensual knowl-

edge, obvious knowledge and user assumptions.

• A set of information system components - user in-

terfaces, extraction algorithms, evaluation meth-

ods - in order to select the most relevant extracted

models according to expert knowledge.

KEOPS methodology extends the CRISP-DM

process model by integrating knowledge in most steps

of the mining process. The initial step focuses on

business understanding. The second step focuses on

data understanding and activities in order to check

data reliability. Data reliability problems are solved

during the third step of data preparation. The fourth

step is the evaluation of extracted models. In this

paper we don’t focus on modeling step of CRISP-

DM model since we ran CLOSE algorithm (Pasquier

et al., 1999) which extracts association rules without

domain knowledge.

4.1 Business Understanding

During business understanding step, documents, data,

domain knowledge and discussion between experts

lead to assess situation, to determine business objec-

tives and success criteria, and to evaluate risks and

contingencies. However this step is often rather infor-

mal.

KEOPS methodology requires to build an ontol-

ogy driven information system during the next step,

data understanding. Consequently an informal speci-

ﬁcation of business objectives and expert knowledge

is henceforth insufﬁcient. Thus, it is necessary to for-

malize expert knowledge during business understand-

ing. We chose to state knowledge with production

rules, also called “if ... then ...” rules. These rules are

modular, each deﬁning a small and independent piece

of knowledge. Furthermore, they can be easily com-

pared to extracted association rules. Each knowledge

rule has some essential properties to select the most

interesting association rules:

• Knowledge conﬁdence level: ﬁve different values

are available to describe knowledge conﬁdence

according to a domain expert. These values are

ranges of conﬁdence values: 0-20%, 20-40%, 40-

60%, 60-80% and 80-100%. We call conﬁdence

the probability for the rule consequence to occur

when the rule condition holds.

• Knowledge certainty:

– Obvious: knowledge cannot be contradicted.

– Consensual: domain knowledge shared among

experts.

– Assumption: knowledge the user wants to

check.

Since the description of expert interview methodol-

ogy in order to capture knowledge is beyond the

scope of this paper, the reader should refer to (Becker,

1976).

4.2 Data Understanding

Data understanding means selection and description

of source data in order to capture their semantic and

reliability. During this step, the ontology is built in

order to identify domain concepts and relationships

between them (the objective is to select among data

the most interesting attributes according to the busi-

ness objectives), to solve ambiguities within data and

to choose data discretization levels.

Consequently, the ontology formalizes domain

concepts and information about data. This ontology

is an application ontology; it contains the essential

knowledge in order to drive data mining tasks. On-

tology concepts are related to domain concepts, how-

ever relationships between them model database rela-

tionships. During next step, data preparation (cf. sec-

tion 4.3), a relational database called Mining Oriented

DataBase (MODB) will be built.

In order to understand links between the MODB

and the ontology it is necessary to deﬁne notions of

domain, concept and relationships:

• Domain: This notion in KEOPS methodology,

refers to the notion of domain in relational theory.

ICEIS 2008 - International Conference on Enterprise Information Systems

Figure 2: Bookshop ontology snapshoot.

A domain represents a set of values associated to

a semantic entity (or concept).

• Concept: Each concept of the ontology has a

property deﬁning its role. There exist two classes

of concepts: attribute concepts and value con-

cepts.

– An attribute concept is identiﬁed by a name and

a domain.

– Each value of domain is called a value concept.

Thus a domain is described by an attribute

concept and by value concepts organized into a

taxonomy. Each MODB attribute is linked to

one and only one attribute concept and takes

its values in the associated domain. In ﬁgure 2

“Bookshop” is an attribute concept, “Academic”

a value concept and the set {Academic, General,

Sciences, Letters} deﬁnes “Bookshop” domain.

• Relationships: There exists three kinds of rela-

tionships between concepts:

– A data-related relationship: “valueOf” relation-

ship between an attribute concept and a value

concept. The set of value concepts linked to

an attribute concept with “valueOf” relation-

ship deﬁne a domain within the MODB.

– Subsumption relationship between two value

concepts. A concept subsumed by another one

is member of the same domain. This relation-

ship is useful during data preparation (to se-

lect data granularity in datasets), reduction of

rule volume (to generate generalized associa-

tion rules, see 4.4.1, comparison between mod-

els and knowledge (to consider sibling and an-

cestor concepts) and ﬁnal results visualization.

– Semantic relationships between value concepts.

These relationships could be order, compo-

sition, exclusion or equivalence relationships.

They can be used to compare extracted models

and knowledge and to visualize results.

KEOPS methodology aims to extract interesting mod-

els according user knowledge. Consequently, it is

necessary during ontology construction to be careful

with some usual concerns in data mining:

• Aggregation level: like data, ontology concepts

have to represent disjoint domains.

• Discretization level: ontology concepts have to

model various solutions for data discretization.

Bad choices may affect modeling step efﬁciency.

• Data correlation: if concepts are strongly related

into the MODB, extracted models might be trivial

and uninteresting.

Since these concerns are beyond the scope of this

paper, the reader should refer to (De Leenheer and

de Moor, 2005) for a better insight on concept elicita-

tion and (Berka and Bruha, 1998) for a better insight

on discretization and grouping.

Table 1: Bookshop ontology concept elicitation.

Source Attribute Value

Data Concept Concept

St Denis Shop Bookshop Academic

St Denis Shop Location St Michel bd

Rive Gauche 5th Bookshop General

Rive Gauche 5th Location 5th District

Example. Let’s take the case of a bookstore com-

pany with several bookshops in Paris which plan to

improve customer relationships. Bookshops may be

specialized in a ﬁeld like “academic” or not (general)

(see ﬁgure 2). Bookshops are located geographically.

Data are provided on bookshops, customers and sales.

Table 1 shows a way for mapping source values to on-

tology concepts.

AN ONTOLOGY DRIVEN DATA MINING PROCESS

4.3 Data Preparation

Data preparation is very iterative and time consuming.

The objective is to reﬁne data: discretize, clean and

build new attributes and values in the MODB. Dur-

ing this step, KEOPS suggests building MODB by

mapping original data with ontology concepts. The

database contains only bottom ontology concepts.

The objective is to structure knowledge and data in

order to process efﬁcient mining tasks and to save

time spent into data preparation. The idea is to al-

low generation of multiple datasets from the MODB,

using ontology relationships without another prepara-

tion step from raw data. Furthermore, during ODIS

construction, experts can express their knowledge us-

ing the ontology which is consistent with data.

4.3.1 Mining Oriented Database (MODB)

Construction

Databases often contain several tables sharing simi-

lar information. However, it is desirable that each

MODB table contains all the information semanti-

cally close and it’s important to observe normal forms

in these tables. During datasets generation, it’s easy

to use join in order to create interesting datasets to be

mined. However these datasets don’t have to observe

normal forms.

4.3.2 Datasets Generation

It’s often necessary, in a data mining process, to

step back to data preparation. Algorithms were pro-

posed to choose relevant attributes among large data

sources. However, sometimes results don’t satisfy

user expectations and datasets have to be built again

to run new tests. KEOPS methodology suggests using

the ontology in order to describe domain values and

relationships between these values. Consequently,

various datasets could be generated according to ex-

pert user choices.

The ontology driven information system allows

choosing all data preparation strategies providing var-

ious datasets from the same source values. A dataset

is built using the following operators:

• Traditional relational algebra operators: projec-

tion, selection and join.

• Data granularity: this operator allows choosing,

among ontology, concepts which will be in the

mining oriented database.

In order to generate datasets we developed soft-

ware whose inputs are MODB and user parameters

and outputs are new datasets. The user can graphi-

cally select relational algebra operator and data gran-

ularity. While database attributes and values are also

ontology concepts KEOPS methodology and KEOPS

software make easier the data preparation task.

4.4 Evaluation

This step assesses to what extent models meet the

business objectives and seeks to determine if there is

some business reason why these models are deﬁcient.

Furthermore, algorithms may generate lots of mod-

els according to parameters chosen for the extraction.

That’s why evaluation is an important task in KEOPS

methodology in order to select the most interesting

models according to expert knowledge.

4.4.1 Rule Volume Reduction

We choose an association rule extraction algorithm

which can generate bases containing only minimal

non-redundant rules without information loss. Then,

these rules are ﬁltered to suppress semantic redun-

dancies. KEOPS methodology is based on Srikant’s

generalized association rules deﬁnition (Srikant and

Agrawal, 1995). These rules are minimal because

they forbid all irrelevant relationships within their

items. We give a formal deﬁnition below:

Let T be a taxonomy of items. R : A → C is called

generalized association rule if:

• A ⊂ T

• C ⊂ T

• No item in C is an ancestor of any item in A or C

• No item in A is an ancestor of any item in A

Consequently relationships appearing within

these rules are semantic and generalization relation-

ships from C items to A items. The objective is to

maximize information level in minimal rules. The

last step consists of replacing a set of these rules by a

more generalized one.

4.4.2 Rule Interestingness Evaluation

KEOPS methodology suggests comparing extracted

rules with expert’s knowledge. Extracted rules having

one or more items that are in relationship with some

knowledge rules items (i.e. value concepts are linked

in the ontology) have to be selected. Then, for each

pair knowledge rule/extracted rule:

• Extracted rule antecedant coverage is compared

to knowledge rule antecedent coverage, then ex-

tracted rule consequent coverage is compared to

knowledge rule consequent coverage.

ICEIS 2008 - International Conference on Enterprise Information Systems

Table 2: Interestingness measure if conﬁdence levels are similar.

Kind of knowledge Rule R informative level

More than K Similar Less than K

Obvious weak none none

Consensual medium weak weak

Assumption strong medium medium

• By coverage comparison the most informative

rule is deduced, i.e. rule predicting the largest

consequence from the smallest condition.

• IMAK interestingness measure is applied (Bris-

son, 2007). This measure evaluates rule qual-

ity considering relative conﬁdence values, relative

information levels and knowledge certainty (see

section 4.1).

Thus, ontology driven information systems are use-

ful in order to formalize domain concepts, to ex-

press knowledge, to generate models and to facilitate

knowledge and models ontology-based comparison.

Example. Let us assume that a domain expert

makes the following assumption: “If a student wants

to buy a book about JAVA he comes to an academic

bookshop.’ and gives it a 60%-80% estimation of

conﬁdence. Let us assume that the extracted rule is

slightly different because it says that “Every young

customer buying a book about J2EE comes to an aca-

demic bookshop” and has 75% conﬁdence.

Assumption K. book=’JAVA’ ∧ buyer=’student’ →

bookshop=’Academic’

Extracted Rule R. book=’J2EE’ ∧ buyer=’youngs’

→ bookshop=’Academic’

According KEOPS methodology these two rules

are said to be comparable because at least one ex-

tracted rule item is in relationship with a knowledge

rule item: ’youngs’ is more general than ’student’ and

’JAVA’ is more general than ’J2EE’. Then, the algo-

rithm compares the coverage of these two rules in or-

der to evaluate the more informative one. Let’s make

the assumption that R is more informative than K.

Since these two rules have similar conﬁdence we can

use table 2 in order to evaluate extracted rule interest-

ingness (similar tables for various conﬁdence levels

are presented in (Brisson, 2007)). While the knowl-

edge is an assumption, the interestingness degree of

the extracted rule is strong.

5 EXPERIMENTS

Although we illustrated in this paper the KEOPS

methodology with bookstore example, we run exper-

iments on real data provided by French Family Al-

lowance Ofﬁce (CAF: Caisses d’allocations famil-

iales). In this section we don’t express some speciﬁc

rules about allowance beneﬁciaries behavior (because

of privacy) but only extracted rules reliability. These

results show we are able to select relevant rules to pro-

vide to experts for ﬁnal human evaluation.

CAF data were extracted during 2004 in the town

of Grenoble (France). Each row describes one contact

between the ofﬁce and a beneﬁciary with 15 attributes

and data about 443716 contacts were provided. We

ran CLOSE algorithm and extracted 4404 association

rules. The interestingness measure, IMAK, helps to

ﬁlter the best ones. Figure 5 plots 4404 rules accord-

ing to conﬁdence and lift.

Experiments illustrated by ﬁgure 3 and 4 com-

pare these rules to a speciﬁc knowledge. We may ob-

serve that among all of the extracted rules only few

of them are selected. Selection condition is to match

the knowledge and to have an interestingness value

greater than 0. In these ﬁgures interestingness value

is illustrated by the dot size.

In ﬁgure 3 lift of selected rules is greater than 1

and often greater than knowledge lift (lift equals 1

at independency). Furthermore, some extracted rules

have a better conﬁdence but a smaller support: they il-

lustrated the discovery of rare events which could be

very interesting for expert users.

Figure 4 shows some results for another speciﬁc

knowledge. We may observe again that only few rules

are selected. These rules offer various tradeoff to se-

lect rare events (low support and high conﬁdence) or

general rules (high support and good conﬁdence) to

provide to domain experts.

As future work, we plan to evaluate rules selected

by KEOPS software with the help of some expert

groups who are able to validate their semantic rele-

vance.

AN ONTOLOGY DRIVEN DATA MINING PROCESS

Figure 3: Extracted rules (dots) matching knowledge rule 335 (square) (IMAK interestingness value increase with dot size).

a) Conﬁdence vs Lift - b) Conﬁdence vs Support.

Figure 4: Extracted rules (dots) matching knowledge rule 565 (square) (IMAK interestingness value increase with dot size).

a) Conﬁdence vs Lift - b) Conﬁdence vs Support.

6 CONCLUSIONS

Managing domain knowledge during the data min-

ing process is currently an important research issue

in the data mining ﬁeld. In this paper, we presented

the so-called KEOPS methodology for integrating ex-

pert knowledge all along the data mining process in a

coherent and uniform manner.

We built an ontology driven information system

(ODIS) based on an application ontology, a knowl-

edge base and a mining oriented database rebuilt from

source raw data. Thus, expert knowledge is used dur-

ing business and data understanding, data preparation

and model evaluation steps. We show that integrat-

ing expert knowledge during the ﬁrst step, gives ex-

perts a best insight upon the whole data mining pro-

cess. In the last step we introduced IMAK, a part-way

interestingness measure that integrates both objective

and subjective criteria in order to evaluate models rel-

evance according to expert knowledge.

We developed KEOPS software in order to run

experiments. Experimental results show that IMAK

measure helps to select a reduced rule set among data

mining results. These rules offer various tradeoff al-

lowing experts to select rare events or more general

rules which are relevant according to their knowledge.

ICEIS 2008 - International Conference on Enterprise Information Systems

Figure 5: Conﬁdence vs Lift of all of the extracted rules.

REFERENCES

Becker, H. S. (1976). Sociological Work: Method and Sub-

stance. Transaction Publishers, U. S.

Berka, P. and Bruha, I. (1998). Discretization and grouping:

Preprocessing steps for data mining. In PKDD, pages

239–245.

Brisson, L. (2007). Knowledge extraction using a concep-

tual information system (excis). In Ontologies-Based

Databases and Information Systems, volume 4623 of

Lecture notes in computer science, pages 119 – 134,

Berlin, Heidelberg. Springer.

Ceri, S. and Fraternali, P. (1997). Designing Database Ap-

plications with Objects and Rules: The IDEA Method-

ology. Series on Database Systems and Applications.

Addison Wesley.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz,

T., Shearer, C., and Wirth, R. (2000). Crisp-dm 1.0:

Step-by-step data mining guide. In SPSS Inc.

De Leenheer, P. and de Moor, A. (2005). Context-driven

disambiguation in ontology elicitation. In Shvaiko, P.

and Euzenat, J., editors, Context and Ontologies: The-

ory, Practice and Applications, pages 17–24, Pitts-

burgh, Pennsylvania. AAAI, AAAI Press.

Guarino, N. (1998). Formal Ontology in Information

Systems. IOS Press, Amsterdam, The Netherlands.

Amended version of previous one in Proceedings

of the 1st International Conference June 6-8, 1998,

Trento, Italy.

Guarino, N., Masolo, C., and Vetere, G. (1998). Ontoseek:

Using large linguistic ontologies for gathering infor-

mation resources from the web. Technical report,

LADSEB-CNR.

Kedad, Z. and M

etais, E. (2002). Ontology-based data

cleaning. In NLDB ’02: Proceedings of the 6th Inter-

national Conference on Applications of Natural Lan-

guage to Information Systems-Revised Papers, pages

137–149, London, UK. Springer-Verlag.

McGarry, K. (2005). A survey of interestingness measures

for knowledge discovery. Knowl. Eng. Rev., 20(1):39–

61.

Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. (1999).

Closed set based discovery of small covers for asso-

ciation rules. In Actes des 15mes journes Bases de

Donnes Avances (BDA’99), pages 361–381.

Penarrubia, A., Fernandez-Caballero, A., Gonzalez, P.,

Botella, F., Grau, A., and Martinez, O. (2004).

Ontology-based interface adaptivity in web-based

learning systems. In ICALT ’04: Proceedings of the

IEEE International Conference on Advanced Learn-

ing Technologies (ICALT’04), pages 435–439, Wash-

ington, DC, USA. IEEE Computer Society.

Srikant, R. and Agrawal, R. (1995). Mining generalized as-

sociation rules. In VLDB ’95: Proceedings of the 21th

International Conference on Very Large Data Bases,

pages 407–419, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

AN ONTOLOGY DRIVEN DATA MINING PROCESS