NOTES ON THE ARCHITECTURAL DESIGN OF TMINER

Design and Use of a Component-based Data Mining Framework

Fernando Berzal, Juan-Carlos Cubero and A

ıda Jim

enez

Department of Computer Science and Artiﬁcial Intelligence, University of Granada

ETSIIT, C/ Periodista Daniel Saucedo Aranda, s/n, 18071 Granada, Spain

Keywords:

Data Mining, component-based systems, application frameworks, software architecture, design patterns.

Abstract:

This paper describes the rationale behind some of the key design decisions that guided the development of

the TMiner component-based data mining framework. TMiner is a ﬂexible framework that can be used as a

stand-alone tool or integrated into larger Business Intelligence (BI) solutions. TMiner is a general-purpose

component-based system designed to support the whole KDD process into a single framework and thus facil-

itate the implementation of complex data mining scenarios.

1 INTRODUCTION

Traditional on-line transaction processing systems,

also known as OLTP systems, work with relatively

small chunks of data at a time, while on-line analyt-

ical processing systems, or OLAP systems, require

the analysis of huge amounts of data (Chaudhuri and

Dayal, 1997). It comes as no surprise that OLAP sys-

tems have very speciﬁc needs that conventional appli-

cation frameworks do not properly address. This fact

has led to the development of data mining (Tan et al.,

2006) (Han and Kamber, 2006) and data warehousing

(Widom, 1995) (Kimball and Ross, 2002), which try

to satisfy the expectations of the so-called knowledge

workers (executives, managers, and analysts).

This paper describes the rationale behind some

key design decisions that led to the development of

a component-based data mining framework called

TMiner. As we will see, TMiner can be used as a

ﬂexible stand-alone data mining tool, but it has also

been designed so that it can be easily incorporated

into larger Business Intelligence solutions.

It should be noted that the tools and techniques

TMiner collects somewhat overlap with existing Ma-

chine Learning algorithm collections, such as Weka

(Witten and Frank, 2005). However, TMiner is more

that a mere collection of independent algorithms for

data mining tasks that can be directly applied on pre-

pared datasets or invoked from your own code.

Some open-source and commercial data mining li-

braries (Prudsys, 2008) (Rapid-I, 2008) include facili-

ties for their integration into actual enterprise systems.

TMiner also provides usage modes specially designed

for its tight integration into larger solutions.

TMiner is a general-purpose component-based

system designed to support the whole KDD process

into a single framework and thus facilitate the im-

plementation of complex data mining scenarios. In

this sense, TMiner is designed to be useful in a wide

variety of application domains, in sharp contrast to

domain-speciﬁc data mining systems such as iKDD or

SA. While the Interactive Knowledge Discovery and

Data mining system, iKDD, was designed for partic-

ular bioinformatics-related problems (Etienne et al.,

2006), Perttu Laurinen’s Smart Archive, SA, has been

proposed for implementing data mining applications

using data streams. (Laurinen et al., 2005)

The rest of our paper is organized as follows. Sec-

tion 2 describes the architectural design of the TMiner

framework and its component model. Section 3 de-

scribes the facilities TMiner offers for different usage

scenarios, from the casual user who wants to perform

simple data analysis tasks and the researcher who

needs a more thorough experimentation, to the sys-

tems integrator who needs to incorporate data mining

features into ﬁnal solutions. Finally, Section 4 con-

cludes our paper with some comments on the current

status of TMiner and our expectations for its future.

Berzal F., Cubero J. and Jiménez A. (2008).

NOTES ON THE ARCHITECTURAL DESIGN OF TMINER - Design and Use of a Component-based Data Mining Framework.

In Proceedings of the Third International Conference on Software and Data Technologies - ISDM/ABF, pages 98-103

DOI: 10.5220/0001870600980103

 SciTePress













70LQHU&RPSRQHQW

70LQHU&RPSRQHQW0HWDGDWD







3RUW















70LQHU0RGHO





70LQHU7DVN







Figure 1: TMiner component model base classes.

2 TMINER COMPONENT

MODEL

This section describes the TMiner component model,

what TMiner components look like and the basic in-

frastructure provided by the TMiner framework for

the design and use of new components.

2.1 TMiner Components

A software component is “a unit of composition with

contractually deﬁned interfaces and explicit context

dependencies” (Szyperski et al., 2002). These con-

text dependencies are speciﬁed by stating the re-

quired interfaces and the acceptable execution plat-

form(s) for the software component. TMiner com-

ponents have well-deﬁned I/O ports whose speciﬁc

properties can be easily consulted by means of the

TMinerComponentMetadata object that is attached

to every TMinerComponent (see Figure 1).

Software components are units of independent de-

ployment, in contrast to objects in object-oriented

programming (OOP), which are mere units of instan-

tiation. Components are also units of third-party com-

position and they typically have no (externally) ob-

servable state. However, as objects in OOP, they en-

capsulate their state and behavior.

Obviously, components still act through objects in

object-oriented systems, but they do not only contain

classes. They can also contain additional resources.

In TMiner, components must include a component

descriptor describing its I/O ports and any additional

metadata that might be required for the component

to be used in practice. Figure 2 shows an example

of such a descriptor in XML format. The compo-

nent descriptor includes the names, textual descrip-

tions, and required interfaces for all the component

<?xml version="1.0" encoding="utf-8" ?> <component>

<type>tminer.kdd.association.ItemsetMiner</type>

<name>Itemset miner</name>

<description>TMiner base itemset miner</description>

<input>

<id>dataset</id>

<name language="en">Dataset</name>

<description language="en">Input dataset</description>

<type>tminer.data.Dataset</type>

</input>

<input>

<id>encoder</id>

<name language="en">Dataset encoder</name>

<description language="en">Dataset encoder</description>

<type>tminer.data.instance.DatasetEncoder</type>

</input>

</inputs>

<id>itemsets</id>

<name language="en">Itemset collection</name>

<description language="en">Itemsets</description>

<type>tminer.kdd.association.ItemsetCollection</type>

</output>

</outputs>

</component>

Figure 2: TMiner component descriptor.

I/O ports, as well as default values for component pa-

rameters. This information is extremely useful, for

instance, in the automatic generation of the user inter-

face for data mining tools (component developers do

not need to worry about user interface issues and they

can just focus on the development of the component

themselves).

In a data mining framework such as TMiner, the

user has to analyze large datasets with the help of min-

ing tools and techniques. Data is gathered from differ-

ent data sources and data mining algorithms are used

in order to build knowledge models (Berzal et al.,

2002). Hence TMiner components fall into two main

categories:

• TMinerModels represent the entities data miners

work with. They may provide the information our

user needs to access different data sources (i.e.

dataset metadata). They can also be descriptive or

predictive models built from those data sources.

They can even be used as the input to other min-

ing algorithms in order to solve second-order data

mining problems.

• TMinerTasks represent the tasks data miners

must perform to analyze data. They are the active

NOTES ON THE ARCHITECTURAL DESIGN OF TMINER - Design and Use of a Component-based Data Mining

Framework

objects that users need to build new models.

All components in TMiner must be serializable

(i.e. they can be stored as byte sequences for

later reconstruction). This is necessary because

TMinerModels must be stored for later use, while

TMinerTasks might need to be transferred to differ-

ent processing nodes in a distributed computing envi-

ronment.

Additionally, TMiner core classes include some

standardized interfaces designed to simplify the rep-

resentation, distribution, and use of the models dis-

covered in the KDD process. For instance, XMLable

components can be represented as XML documents,

hence facilitating their storage for later use, as well as

their visualization in standard web browsers (with the

help of the corresponding XSLT style sheets). Like-

wise SQLable components, such as many different

kinds of symbolic classiﬁcation models, can be con-

verted into SQL scripts that might be invaluable in

practice, since they provide a very convenient method

for using such models on relational databases.

Any information system can be described by a

structure, a mechanism, and a policy following Perry

and Kaiser’s SMP model (Perry and Kaiser, 1991). In

our case, TMiner models determine the structure of

the system. TMiner tasks, which are responsible for

the implementation of data mining techniques, are the

mechanisms that let us solve data mining problems.

Finally, the set of rules and strategies imposed by the

system environment are used to establish its usage and

security policies. This is the job of the TMiner frame-

work we now proceed to describe.

2.2 The TMiner Framework

A component framework is “a collection of rules and

interfaces (contracts) that govern the interaction of

components plugged into the framework” (Szyperski

et al., 2002). Component frameworks can also be seen

as components that plug into higher-level component

frameworks (e.g. when integrated into larger solu-

tions).

Component-based frameworks are intended to

help developers to build increasingly complex sys-

tems, enhance productivity and promote component

reuse by means of well-deﬁned patterns (Fayad and

Schmidt, 1997) (Larsen, 2000). Such frameworks are

widely-used in enterprise information systems, but

they usually only provide low-level information pro-

cessing capabilities, since they are OLTP-application-

oriented. In contrast, TMiner is a component-based

framework that has been custom-tailored to solve de-

cision support problems, even though it should be

noted that its approach could also be of use in a wide

Container

Persistence

Service

Client

Factory

Proxy

Remote

Proxy

Component

Figure 3: The Enterprise Component Framework.

range of scientiﬁc computing applications.

Most commercial component-based frameworks,

such as Sun Microsystems Enterprise JavaBeans or

the Microsoft .NET Common Language Runtime,

are based on a common architectural pattern, known

as the Enterprise Component Framework (Kobryn,

2000). A simpliﬁed representation of this pattern is

depicted in Figure 3. This pattern contains six roles,

shown as rectangles in the ﬁgure, whose responsibili-

ties can be described as follows:

• Clients are the entities that request services from

a component in the framework. End users, com-

ponent developers, researchers, and other auto-

mated systems may act as clients in TMiner (see

Section 3 for more information on TMiner usage

modes).

• Components provide the services requested by

clients. As mentioned above, both data access

modules and knowledge models are full-ﬂedged

components in TMiner. Data mining algorithms

could also be considered as components on their

own, but they are just used through factory prox-

ies to build knowledge models.

• Proxies relay calls from clients to components.

This level of indirection is hidden from the client

perspective and it makes location transparency

possible (when needed, it also supports message

interception). Factory proxies perform object fac-

tory operations that are common to all frame-

work components (such as creation or retrieval),

while remote proxies handle operations that are

component-speciﬁc (e.g. inspection and param-

eter setting). Proxies are usually supported by the

reﬂection capabilities found in modern computing

platforms.

• The Container represents the framework’s run-

time environment. The container provides dis-

tributed computing services, load balancing, in-

terprocess communication, security, persistence,

resource discovery, and hot deployment mech-

ICSOFT 2008 - International Conference on Software and Data Technologies

100

anisms. Transactions are usually supported by

enterprise frameworks but are not needed in the

TMiner data mining framework. TMiner, how-

ever, needs speciﬁc scheduling, monitoring, and

notiﬁcation mechanisms to manage data mining

tasks.

• Finally, the persistence service, typically man-

aged and coordinated by the container, is used

for the storage and retrieval of framework com-

ponents.

Component-based frameworks are suitable for

large-scale systems because they provide a solid foun-

dation upon which whole applications and product

lines can be deployed in a systematic and controlled

way. The TMiner architecture makes the development

of new techniques and algorithms faster. Once a new

component has been devised and tested in the labo-

ratory, it can be easily deployed into production en-

vironments with the help of a simple component de-

scriptor. This simpliﬁes maintenance and upgrade ef-

forts, hence smoothing the evolution any system must

face during its lifetime.

2.3 TMiner Subsystems

Apart from the common infrastructure needed to build

typical data mining tools, TMiner provides support

for ﬂexible data access and many of the most com-

mon data mining algorithms and techniques:

• The TMiner data access subsystem acts as

its extract, transform, and load (ETL) front-end

and it makes heavy use of well-known object-

oriented design patterns for improving its ﬂexibil-

ity (Gamma et al., 1995).

• The TMiner knowledge-discovery subsystem

provides a wealth of classiﬁcation models and

clustering techniques, as well as efﬁcient associa-

tion rule mining components and anomaly detec-

tion tools.

3 TMINER USAGE MODES

System usability is critical for user acceptance. Pro-

vided that knowledge workers are not necessarily

knowledgeable about computers, TMiner must pro-

vide different usage modes for different usage scenar-

ios.

TMiner supports the progressive usage model or

triphasic model. This model recognizes that patterns

of usage evolve as users build experience and that sup-

porting these patterns requires speciﬁc and somewhat

different facilities within the user interface architec-

ture (Constantine and Lockwood, 1999). TMiner,

therefore, provides different usage modes for novice,

intermediate, and advanced users:

3.1 Basic Usage Modes

Novice users tend to perform basic tasks, with a lim-

ited use of alternatives. The underlying complexity of

the system should be invisible to this kind of users.

TMiner acknowledges this fact and the default values

typically included in a component descriptor let users

employ a data mining component ‘out of the box’.

Beginners usually interact with the system by trial

and error, and they have a strong dependence on

help and guidance. For them, TMiner provides an

attractive Web interface that incorporates novel vi-

sualization techniques in order to motivate explo-

ration, such as VisAR (Techapichetvanich and Datta,

2005). This rule visualization technique is based on

parallel coordinates, a common way of visualizing

high-dimensional geometry and analyzing multivari-

ate data (Inselberg, 1985)

The ability to share data among system users is

another aspect that is closely related to a data mining

system usability. TMiner lets casual system users to

browse through already-discovered models and share

their own models with other system users. This

computer-supported cooperative work (CSCW) fo-

cus is especially relevant in data mining applications,

where the discovered knowledge must be properly

represented and communicated. (Berzal et al., 2003)

3.2 Intermediate Usage Modes

Intermediate TMiner users have expanding needs and

they typically exhibit changing patterns of interac-

tions. “Intermediates (those who are neither begin-

ners nor old hands, and who make up most of the user

population) are perhaps the most neglected user seg-

ment in terms of interface design, yet there are pos-

sibly more intermediate users than beginners or ex-

perts.” (Constantine, 1994)

TMiner component model facilitates the construc-

tion of data mining tools that let knowledgeable users

tune data mining techniques by playing with the

knobs TMiner components provide, i.e. their I/O

ports. Component descriptors are extremely useful

here, since they can be used to automatically gener-

ate the user interface needed for such knob-turning.

NOTES ON THE ARCHITECTURAL DESIGN OF TMINER - Design and Use of a Component-based Data Mining

Framework

101

// Database connection details...

var database = {

driver: "interbase.interclient.Driver",

url: "jdbc:interbase://localhost/datasets.gdb",

user: "SYSDBA",

password: "*****" };

// Dataset...

var id = "CENSUS";

var dataset = new tminer.data.wrapper.jdbc.JDBCDataset

(id, database.user, database.password,

database.url, database.driver);

// Attributes

var discreteAttributes = dataset.getAttributeIDs

(tminer.data.type.StringType);

var continuousAttributes = dataset.getAttributeIDs

(tminer.data.type.NumericType);

Figure 4: TMinerScript code snippet needed to access a par-

ticular dataset through the standard JDBC API.

3.3 Advanced Usage Modes

Expert users’ primary concerns are efﬁciency and pro-

ductivity. They need to perform complex, sophisti-

cated tasks that are often nonstandard or might be un-

supported. They need interfaces that operate in mul-

tiple modes, frequently changed to ﬁt the particular

demands of the task at hand.

These advanced users can directly use TMiner

components from their own Java code. They can even

customize them by attaching dynamic ports to TMiner

components without having to modify their source

code nor create component subclasses. This can be

helpful when monitoring system performance or ad-

dressing other cross-cutting concerns that appear in

practice. AspectJ, an aspect-oriented extension for the

Java programming language, can also be used with

TMiner components to simplify the implementation

of such cross-cutting concerns.

Sometimes, advanced users prefer faster methods

to interact with TMiner components, without having

to go through a complete edit-compile-build-run cy-

cle each time they want to tweak anything. TMiner-

Script is included in TMiner for such users. TMiner-

Script is a scripting language that can be used to con-

trol TMiner using the syntax of JavaScript (Flanagan,

2006). Scripting languages such as TMinerScript, by

being accessible to the end user, enable the behavior

of an application to be adapted to the user’s speciﬁc

needs and thus provide the greatest possible degree of

control to the user. Figures 4 through 6 show some

simple TMinerScript scripts that can be used to per-

form common data mining tasks in TMiner.

...

// Classification model

var classifier = new classification.tdidt.TDIDTClassifier();

classifier.setDefault();

// Classifier parameters (common to all algorithms)

classifier.dataset = dataset; // e.g. JDBC Dataset

classifier.encoder = encoder; // ... to encode input data

classifier.classAttribute = "classLabel";

classifier.discreteAttributes = discreteAttributes;

classifier.continuousAttributes = continuousAttributes;

// Specific parameters (TDIDT algorithm)

classifier.divisionRule = "GainRatio";

classifier.pruning = true;

classifier.pruningCF = 0.25;

classifier.binarySplits = true;

classifier.balanceThreshold = 0.2;

// Classifier construction

classifier.build();

// Output (e.g. in SQL format)

classifier.toSQL("TABLE", "CLASS");

Figure 5: TMinerScript code snippet needed to build a de-

cision tree classiﬁer.

4 CURRENT STATUS AND

FUTURE DIRECTIONS

We have described some of the main features of

TMiner, a component-based data mining framework.

TMiner can be used as a stand-alone web-based

data mining tool, providing components for many

of the tasks we might need to analyze data, ranging

from data access to knowledge discovery. Its current

version lets users build classiﬁcation models, cluster

data, mine associations, and detect anomalies.

TMiner component model is also suitable for its

integration into larger solutions whose requirements

include some of the data mining features TMiner pro-

vides. In fact, TMiner offers alternative usage modes

intended to facilitate such integration. On the one

hand, it can be directly called from third-party code

as any other component library. On the other hand,

clients can use the scripting facilities TMiner provides

for automating the execution of data mining tasks.

Our current research efforts focus on the develop-

ment of novel data mining techniques (e.g. dealing

with different kinds of data sources) as well as on the

improvement of current data mining solutions by pro-

viding a scalable data mining system for scientiﬁc and

business applications.

ICSOFT 2008 - International Conference on Software and Data Technologies

102

...

// Algorithm details

var algorithm

= "tminer.kdd.classification.tdidt.TDIDTClassifier";

var parameters = new tminer.model.adt.Dictionary();

parameters.add("divisionRule", "GainRatio");

parameters.add("binarySplits", true); parameters.add("pruning",

true); parameters.add("pruningCF", 0.25);

// Cross-validation experiment

var experiment = new classification.CrossValidation();

experiment.type = algorithm;

experiment.parameters = parameters;

experiment.partitions = 10;

experiment.dataset = dataset;

experiment.encoder = encoder;

experiment.classAttribute = "classLabel";

experiment.discrete = discreteAttributes;

experiment.continuous = continuousAttributes;

experiment.run();

// Experiment results

experiment;

Figure 6: TMinerScript code snippet needed to run a cross-

validation experiment.

ACKNOWLEDGEMENTS

Work partially supported by research project

TIN2006-07262.

REFERENCES

Berzal, F., Blanco, I., Cubero, J. C., and Mar

ın, N. (2002).

Component-based data mining frameworks. Commu-

nications of the ACM, 45(12):97–100.

Berzal, F., Cubero, J. C., Mar

ın, N., Serrano, J.-M., and

Blanco, I. (2003). Usability issues in data mining sys-

tems. In ICEIS 2003: Proceedings of the 5th Interna-

tional Conference on Enterprise Information Systems

(Volume II - Artiﬁcial Intelligence and Decision Sup-

port Systems), pages 418–421.

Chaudhuri, S. and Dayal, U. (1997). An overview of

data warehousing and OLAP technology. SIGMOD

Record, 26(1):65–74.

Constantine, L. L. (1994). Interfaces for intermediates.

IEEE Software, 11(4):96–99.

Constantine, L. L. and Lockwood, L. A. D. (1999). Soft-

ware for Use: A practical guide to the models and

methods of usage-centered design. ACM Press /

Addison-Wesley.

Etienne, J., Wachmann, B., and Zhang, L. (2006). A

component-based framework for knowledge discov-

ery in bioinformatics. In KDD ’06: Proceedings of

the 12th ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 916–

921.

Fayad, M. E. and Schmidt, D. C. (1997). Object-oriented

application frameworks. Communications of the

ACM, 40(10):32–38.

Flanagan, D. (2006). JavaScript: The Deﬁnitive Guide.

O’Reilly & Associates, Inc., Sebastopol, CA, USA.

Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1995).

Design Patterns: Elements of reusable object-oriented

software. Addison-Wesley.

Han, J. and Kamber, M. (2006). Data Mining: Concepts

and Techniques. Morgan Kaufmann.

Inselberg, A. (1985). The plane with parallel coordinates.

The Visual Computer, 1(2):69–91.

Kimball, R. and Ross, M. (2002). The Data Warehouse

Toolkit: The Complete Guide to Dimensional Model-

ing. John Wiley & Sons, Inc.

Kobryn, C. (2000). Modeling components and frameworks

with UML. Communications of the ACM, 43(10):31–

38.

Larsen, G. (2000). Component-based enterprise frame-

works. Communications of the ACM, 43(10):24–26.

Laurinen, P., Tuovinen, L., and Roning, J. (2005). Smart

archive: A component-based data mining application

framework. In ISDA’05: Proceedings of the 5th In-

ternational Conference on Intelligent Systems Design

and Applications, pages 20–25.

Perry, D. E. and Kaiser, G. E. (1991). Models of soft-

ware development environments. IEEE Transactions

on Software Engineering, 17(3):283–295.

Prudsys (2008). XELOPES library - eXtEnded

Library fOr Prudsys Embedded Solutions.

http://www.prudsys.com/.

Rapid-I (2008). RapidMiner (formerly YALE, Yet Another

Learning Environment). http://rapid-i.com/.

Szyperski, C., Gruntz, D., and Murer, S. (2002). Com-

ponent Software: Beyond Object-Oriented Program-

ming. Addison-Wesley.

Tan, P.-N., Steinbach, M., and Kumar, V. (2006). Introduc-

tion to Data Mining. Addison-Wesley.

Techapichetvanich, K. and Datta, A. (2005). VisAR: A

new technique for visualizing mined association rules.

In ADMA 2005: 1st International Conference on Ad-

vanced Data Mining and Applications, LNCS 3584,

pages 88–95.

Widom, J. (1995). Research problems in data warehousing.

In CIKM ’95, Proceedings of the 1995 International

Conference on Information and Knowledge Manage-

ment, November 28 - December 2, 1995, Baltimore,

Maryland, USA, pages 25–30. ACM.

Witten, I. H. and Frank, E. (2005). Data Mining: Practi-

cal machine learning tools and techniques. Morgan

Kaufmann.

NOTES ON THE ARCHITECTURAL DESIGN OF TMINER - Design and Use of a Component-based Data Mining

Framework

103