DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION

FRAMEWORK

Lauri Tuovinen, Perttu Laurinen and Juha R¨oning

Department of Electrical and Information Engineering, P.O. Box 4500, FIN-90014 University of Oulu, Finland

Keywords:

Data mining, Application framework, Pipes and ﬁlters, Data representation, Data type, Interoperability.

Abstract:

Building application frameworks is one of the major approaches to code and design reuse in object-oriented

software engineering. Some frameworks target a particular application domain, adopting a number of domain-

speciﬁc problems to be addressed by the framework in such a fashion that there is no need for application

developers to devise solutions of their own to those problems. When the target domain is data mining, one

interesting domain-speciﬁc problem is management of the data types of model parameters and data variables.

This is not trivial because the framework must be able to convert parameter and variable values between

different representations, and it would be preferable to have these conversions take place transparently, without

involving the application programmer. This is not difﬁcult to achieve if the framework restricts the programmer

to a predeﬁned set of allowed data types, but if such a restriction is undesirable, the framework needs an

extension mechanism in its type management subsystem. Smart Archive, a framework for developing data

mining applications in Java or C++, includes such a mechanism, based on a type dictionary document and

a type renderer programming interface. These make it possible to handle even highly complex values such

as collections of instances of programmer-deﬁned classes in a variety of platform-independent representation

formats. The beneﬁts of this approach can be seen in how the framework interfaces with databases through

data sinks and in how it exports and imports application conﬁgurations.

1 INTRODUCTION

Reusing designs and code is considered desirable in

software engineering. An application framework—a

software skeleton that can be specialized into differ-

ent applications by plugging in a comparably small

quantity of new code—provides both a reusable de-

sign and a body of code that implements the design

in a reusable form. A good framework is therefore

a valuable commodity in software development, and

in certain areas a thoroughly indispensable one; one

would hardly consider writing a graphical applica-

tion for a modern desktop operating system without

adopting a framework such as MFC, Swing or GTK+

to take care of handling user interface elements and

events. (Fayad and Schmidt, 1997)

The purpose of an application framework is to

support a family of applications, often deﬁned by

a particular type of functionality required. Smart

Archive (Laurinen et al., 2005; Tuovinen et al., 2008)

is a framework for applications that use data mining to

extract knowledge from a large body of data. It pro-

vides an application design based on the pipes-and-

ﬁlters architectural style, breaking down the solution

to a data mining problem into a sequence of transfor-

mations that can be implemented independently. The

transformations (ﬁlters) are encapsulated in compo-

nents, which may also contain a data sink for storing

the transformation results into a ﬁle or a database.

Smart Archive is intended to be highly generic,

adaptable to a wide variety of data mining problems

without having to modify the framework code. One

aspect of this generic nature is the manner in which

the data types of variables and parameters are han-

dled. Clearly it would make the framework less

generic if only a small ﬁxed set of data types could

be used. On the other hand, there needs to be a way

of controlling the types used in Smart Archive appli-

cations in order to ensure interoperability between the

framework and other application subsystems such as

databases. Interoperability between different imple-

mentations of the framework is another important is-

sue, as Smart Archiveis being concurrently developed

for two programming languages, Java and C++.

333

Tuovinen L., Laurinen P. and Röning J. (2009).

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK.

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence, pages 333-338

DOI: 10.5220/0001658903330338

 SciTePress

A concrete example of how the type management

problem affects application development can be seen

by considering the case of storing the results of a data

transformation into a database table. In order to ac-

complish this an application has to generate a table

structure with suitable column types and a sequence

of SQL queries to write the data into the table. This,

in turn, requires a mapping from application platform

(e.g. Java) types into SQL types and conversions of

values into strings that can be appended to an

INSERT

query. Without type management this involves a con-

siderable manual effort.

Smart Archive solves this problem by introducing

an extensible data type engine. The engine offers a

range of commonly used default types and is open

to the addition of new types by developers using the

framework. The extension API ensures that each new

type created by application programmers satisﬁes cer-

tain conditions designed to enforce interoperability.

Thus the use of data types in Smart Archive is unre-

stricted without also being uncontrolled. The primary

contribution to the state of the art is an abstract type

system covering a number of concrete data processing

and storage platforms; the usefulness of the type sys-

tem is highlighted by important framework features

that could not be implemented without it.

Section 2 describes how applications are devel-

oped using Smart Archive and brieﬂy reviews other

data mining frameworks. Section 3 explains in detail

how the Smart Archive type engine works. Section 4

discusses the practical implications of type manage-

ment in Smart Archive, demonstrating how the type

engine beneﬁts the framework. Section 5 discusses

the ﬁndings, pointing out strengths and opportunities

for future work as well as known shortcomings and

possible remedies. Section 6 concludes the paper.

2 DEVELOPMENT WITH SMART

ARCHIVE

Applications based on Smart Archive are built from

components. A component, in this context, is very

speciﬁcally deﬁned as an instance of a class that im-

plements the

Component

interface, which is one of the

core elements of the framework. A component class

may have some distinguishing features of its own, but

the functionality of a component is determined by the

ﬁlter and sink it contains. Simply put, a component is

a receptacle into which different ﬁlters and sinks can

be plugged as the application developer wishes.

Similarly, a ﬁlter is an instance of a class that im-

plements the

Filter

interface. A ﬁlter receives its in-

put from its parent component and likewise sends its

output to the parent component; in between the data is

transformed by the ﬁlter’s processing algorithm. The

interface between the ﬁlter and the component con-

sists of groups of logically interrelated data variables

called channels: input channels that the ﬁlter requires

to be supplied and output channels that the ﬁlter sup-

plies. For example, the ﬁlter could implement a sen-

sor fusion algorithm, in which case each input channel

would deliver the data generated by one sensor.

Finally, a sink is an instance of a class that imple-

ments the

Sink

interface. A sink acts as an abstraction

layer between a component and a persistent data stor-

age system, typically a relational database. When a

component has a sink, all of the output produced by

the component’s ﬁlter is delivered to it, and the sink

executes the operations required to write the data into

persistent storage. Alternatively, the component can

retrieve data previously stored into the sink, in which

case the sink assumes the role of the ﬁlter, producing

data that becomes the component’s output.

Components are linked together with pipes to

form a directed graph with a single root node. Raw in-

put data is inserted into the graph at the root and then

pushed through each component in turn in an order

dictated by the inter-component links (pipes). Thus

the data is reﬁned in a stepwise manner until the de-

sired result of the application is obtained. This style of

software architecture is natural for data mining appli-

cations, which generally progress in steps following

a process where each step brings the data closer to a

form that can be interpreted and applied.

D2K (Automated Learning Group, 2003), KN-

IME (Berthold et al., 2006) and RapidMiner (for-

merly YALE) (Mierswa et al., 2006) are data min-

ing frameworks that, like Smart Archive, employ the

paradigm of assembling independentcomponents into

solutions. Each provides a graphical interface for se-

lecting, conﬁguring and coupling components, which

is a convenient way of creating processing sequences.

Smart Archive trades off some of this convenience for

the ability of developers to decide exactly which parts

of the framework are used in an application and how,

allowing applications to be tailored in a greater vari-

ety of ways. The greater degree of control implies a

greater number of things to learn and to keep track of,

but it still leaves the option of hiding the greater com-

plexity behind a comparably simple interface, so in a

sense this is a best-of-both-worlds approach.

Another major difference is that only Smart

Archive is implemented and maintained in more than

one programming language, making cross-language

portability of applications a consideration that the

other frameworks do not need to address. An ex-

ception to this is the XELOPES library (Prudsys

ICAART 2009 - International Conference on Agents and Artificial Intelligence

334

AG, 2007), which uses a single platform-independent

model, mapped to multiple platform-speciﬁc models,

to achieve portability. However, the focus and un-

derlying paradigm of XELOPES differ so much from

the frameworks discussed above that it does not really

make sense to think of it as a parallel to them.

As reported in (Tuovinen et al., 2008), Smart

Archive has been successfully used to develop data

mining applications in diverse application domains.

Thus the framework is not merely a proven concept

but a practical foundation for the implementation of

real-world applications.

3 TYPE MANAGEMENT IN

SMART ARCHIVE

In Smart Archive, every data type used for parameters

or variables in applicationsmust have a corresponding

entry in the type dictionary. The type dictionary is an

XML document containing the following information

about each type:

• abstract type name

• type renderer class

• concrete type names

• type parameters (if applicable)

An excerpt from the dictionary is shown in Figure 1.

The abstract type name is what is used in appli-

cation code to identify the type when specifying pa-

rameters and variables. An abstract type is realized

by one or more concrete types, each in a particular

context. One context is the framework itself, and con-

crete types in this context are types in the implemen-

tation language of the framework. In the other cur-

rently implemented context—relational databases—

concrete types are SQL data types. It is possible to

have more than one concrete type per context by mak-

ing the choice of type dependent on type parameters.

Besides translating type names between different

contexts, the type engine also converts values be-

tween different representations. Each type has an in-

ternal representation, dictated by the framework im-

plementation language, and two external representa-

tions, SQL string and XML element. Conversions

between representations are handled by classes that

implement a special interface, known as type render-

ers. The framework includes a default renderer that

provides a number of commonly needed basic data

types. For types not supported by the default renderer

it is possible to build an extension by coding a new

renderer and writing entries for the new types into

the type dictionary. The framework, when started,

Figure 1: An excerpt from the type dictionary, showing en-

tries for some of the types supported by the default renderer.

The

ctype

elements indicate the concrete Java and SQL

data types corresponding to each abstract type.

will load all type renderers and place them into a data

structure where they can be accessed quickly.

Figure 2 illustrates what happens when the frame-

work needs to convert a value from one representation

to another. First, a ‘render’ message is passed to the

type engine; the message contains the abstract type

name, the current and target representation formats

and the value itself, an

Object

in Java. The type en-

gine uses the type name to look up the type renderer in

charge of the type in the type dictionary. It then passes

the ‘render’ message to the renderer, which returns

the convertedvalue. Finally, the return value is passed

to the routine that generated the original ‘render’ mes-

sage. The return value, another

Object

, can then be

cast into its runtime type—in this case,

String

, as the

purpose of the rendering is to obtain a value that can

be appended to an SQL query string.

In the case of compound values such as lists and

maps the rendering mechanism is applied recursively:

a render operation invoked on such a value triggers

nested operations to render the elements of the com-

pound value. Figure 3 shows the result of this when a

map

object is rendered as XML. The initial render op-

eration generates the hierarchy of XML elements and

then uses recursion to ﬁll in the contents of the

key

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK

335

Figure 2: The processing sequence of a rendering operation.

The type engine receives a message (1.) and does a type

dictionary lookup (2.) to ﬁnd out which type renderer to

call. The message is then forwarded to the renderer (3.),

which performs the conversion. Finally, the converted value

is returned (4.) to the routine that invoked the type engine.

Figure 3: A string-to-boolean map object rendered as XML

by the type engine.

and

value

nodes. The XML representation is also

used as the SQL form of these data types.

D2K, KNIME and RapidMiner each manage the

data types of variables in such a way that, while the

implementation approach varies, the overall effect is

largely the same. A concrete difference, however, is

that D2K, KNIME and RapidMiner each delegate the

responsibility for handling variables in external con-

texts to the components that interface with those con-

texts. For example, while each framework provides

a component whose function is similar to that of the

data sink in Smart Archive, the conversion of values

from internal to external representation is up to the

components themselves. In Smart Archive the con-

version functionality is provided by the framework

and available also to ﬁlters and other application el-

ements that employ some form of persistent storage.

Another notable difference is the treatment of pa-

rameters (values that control the behavior of com-

ponents): in Smart Archive, unlike the other frame-

works, the same mechanisms are used to handle both

parameters and variables. Thus in Smart Archive ev-

ery data type that is legal for parameters is also legal

for variables and vice versa. On the other hand, if one

knows that a type will only be used for a particular

purpose, extending the framework to support the type

can be simpliﬁed considerably; in the simplest case it

is sufﬁcient to just insert an entry into the type dictio-

nary without writing any new code.

4 THE TYPE ENGINE IN

PRACTICE

To see how the type engine affects Smart Archive

application development in practice, let us consider

a simple example application with three components

that are executed sequentially. The ﬁrst component

takes the raw input data, cleans and normalizes it

and outputs the preprocessed data in a format under-

stood by the second component, which extracts fea-

tures from the data. The third component takes the

extracted features and uses them for cluster analysis.

The outputs of the ﬁlters of the three components

are written into a database using a data sink. Creat-

ing the sinks requires generating database table struc-

tures suitable for storing the outputs of the ﬁlters.

The type engine makes this highly convenientbecause

it can translate the output variable speciﬁcations of

the ﬁlters into the equivalent SQL column speciﬁca-

tions. All Smart Archive ﬁlters are required to de-

clare their output variables, so given a ﬁlter, the ap-

propriate sink can be generated with little extra work.

In fact, in Smart Archive a single method call is

enough, whereas setting up the table structures man-

ually would easily require tens of lines of code.

When a batch of data is fed into the application, it

is progressively reﬁned by each ﬁlter and then written

into a database by the associated sink. The sinks use

the type engine to render as strings the variable values

produced by the ﬁlters. The rendered strings are for-

matted such that they can be directly appended to the

VALUES

clause of an SQL

INSERT

query. The queries

are automatically generated by the sinks so that the

entire process is almost completely transparent; the

only database-related concepts the programmer needs

to address are the connection parameters.

Once the dataset has been processed by all compo-

nents, the database contains not only the ﬁnal results

(from cluster analysis) but also an image of the data

after each intermediate step (preprocessing and fea-

ture extraction). Supposing that the application oper-

ator now wants to repeat the clustering with different

parameters, there is no need to start the whole process

over: the results of the feature extractor can be re-

trieved from its sink, leaving only the clusterer to be

ICAART 2009 - International Conference on Agents and Artificial Intelligence

336

re-executed. Similarly, if the operator wants to try out

a different feature set, the results of the preprocessor

can be retrieved, leaving only the other two compo-

nents to be re-executed. Thus Smart Archive can be

used to implement the stepwise approach to data min-

ing discussed in (Laurinen et al., 2004).

Finally, supposing that the operator wants to trans-

mit the application to someone else, it is enough to

send a description of the application conﬁguration

provided that the recipient also has an installation of

Smart Archive. Another necessary condition is that

the recipient has access to all classes used by the ap-

plication. Application classes that are not part of the

framework are mainly ﬁlter classes, so the latter con-

dition is generally true if the application uses only

algorithms found in common libraries. Interfaces to

such libraries are among the planned future exten-

sions to Smart Archive.

Components, ﬁlters and sinks in Smart Archive

may have any number of parameters that control

their functions. The parameters, like variables, may

be of any data type supported by the type engine.

Smart Archive allows the values of the parameters,

along with all other information that deﬁnes a Smart

Archive application, to be written into an XML-based

conﬁguration ﬁle. The type engine is used to render

the parameter values as XML elements.

Generating the conﬁguration ﬁle proceeds from

the top of the application hierarchy to the bottom. The

top-level routine generates the outline of the conﬁgu-

ration document and iterates through the components

of the application, requesting each in turn to describe

itself as an XML element. Each component then re-

quests the same from its ﬁlter and sink and inserts

the results into the description it returns to the top-

level routine. The top-level routine collects the com-

ponent descriptions and inserts them into the conﬁg-

uration document. Loading a conﬁguration into the

framework is basically a reversal of this process: ev-

ery component, ﬁlter and sink class also knows how

to generate an application object from an XML ele-

ment. The framework therefore simply needs to parse

the XML ﬁle, ﬁnd the speciﬁed classes and have them

instantiate themselves by giving them as input the ap-

propriate sections of the conﬁguration description.

The generation of component descriptions, like

the generation of data sinks, takes place in an entirely

transparent fashion. This is because the only parts of

a component description that need to be generated dy-

namically are the class names and parameter values of

the component and its ﬁlter and sink. The former are

trivial to ﬁnd and the latter are handled by the type en-

gine, so the XML operations can be implemented in

a common superclass without any knowledge of the

classes that are derived from it. As long as applica-

tion programmers derive their component, ﬁlter and

sink classes from these standard abstract superclasses,

their classes are automatically able to perform conﬁg-

uration export and import operations as required.

5 DISCUSSION

The interchangeability of ﬁlters and sinks is one of

the most important design ideas in Smart Archive.

Whether a component is transforming data with its

ﬁlter or retrieving data from its sink makes no differ-

ence to the components that follow it: in either case

the component is producingoutput that the other com-

ponents process further. With sinks providing access

to all the data that has passed through the application,

the effects of a changed parameter or algorithm can be

tested by rerunning only the affected part of the pro-

cessing sequence. Compared to the tightly integrated

data sink, an external database interface would not

provide an adequate solution, but working with sinks

could easily become a nuisance if the mappings be-

tween variables and database columns had to be done

manually. The type engine prevents this.

The recursive approach adopted by the type en-

gine in rendering compound types can also be used

by application developers writing renderers for data

types they want to use in their Smart Archive appli-

cations. Provided that the state of a class consists

entirely of types already supported by the type en-

gine, a type renderer for the class mostly just needs

to invoke already implemented render operations and

compile their output. Conceivably this could even be

performed automatically by the framework, but at this

time the detailed speciﬁcation and implementation of

such a mechanism must be deferred to future work.

A possible weakness in the Smart Archive type

rendering scheme is the restriction that there can be

only one renderer for each type. This could prove in-

convenient, in the future if the framework is extended

with a new representation for values. In this case one

might want to write a new renderer to handle the new

representation rather than modify existing ones, espe-

cially if one does not have access to the source code

of some of the renderers one is using. The restriction

may be lifted in future revisions of the framework.

About conﬁguration documents it is worth not-

ing that they describe applications in a format that

does not depend on a particular implementation of

the framework. A conﬁguration exported from a

Smart Archive Java application could therefore be im-

ported into the C++ version of the framework. The

one language-speciﬁc feature of the format—class

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK

337

names—means, however, that an additional dictio-

nary would be needed, mapping the names of Java

classes to those of the equivalent C++ classes. This

aspect of the C++ version of Smart Archive has not

yet been fully developed, so testing the portability of

conﬁgurations has not been possible.

6 CONCLUSIONS

This paper addressed the problem of managing the

data types of parameters and variables in a data min-

ing application framework. This problem is inter-

esting and nontrivial because, on the one hand, ar-

tiﬁcially limiting the set of types a programmer can

choose from is undesirable, but on the other hand, not

controlling types at all may lead to problems in inter-

operability. Smart Archive, a data mining framework

for Java and C++, employs a solution based on a type

dictionary and one or more type renderers, allowing

developers to extend the selection of available types

while ensuring that the framework knows how to han-

dle each type in various situations relating to conver-

sion of values between representations.

The type dictionary is an XML document contain-

ing information on each data type, including the name

of the type renderer that handles the type. A type ren-

derer, in turn, is a class that implements a special ren-

dering interface through which the framework uses its

services. New types are created by editing the dictio-

nary and coding a renderer. The dictionary and ren-

derers allow the framework to keep track of any num-

ber of concrete types and representations associated

with a given type, while the application programmer

only needs to be aware of a single abstract type name

and a single representation, namely the one used in

the implementation language of the framework.

The main principles of Smart Archive application

development were ﬁrst introduced, along with other

systems intended for the same purpose. The Smart

Archive solution to data type management was then

examined in detail. The practical implications of this

type engine were explored by walking through a hy-

pothetical case study. Finally, some notable strengths,

weaknesses and open issues were identiﬁed and an-

alyzed. The type engine has proved useful for im-

plementing framework features that make application

coding considerably quicker and more convenient.

ACKNOWLEDGEMENTS

The authors would like to thank the Finnish

Funding Agency for Technology and In-

novation (http://www.tekes.ﬁ), Rautaruukki

(http://www.ruukki.com) and Polar Electro

(http://www.polar.ﬁ) for funding the research on

Smart Archive in the SAMURAI project. L.

Tuovinen wishes to thank the Graduate School in

Electronics, Telecommunications and Automation

(http://signal.hut.ﬁ/geta/) for fundinghis postgraduate

work.

REFERENCES

Automated Learning Group (2003). D2K Toolkit

User Manual. Technical manual, available at

http://alg.ncsa.uiuc.edu.

Berthold, M. R., Cebron, N., Dill, F., di Fatta, G., Gabriel,

T. R., Georg, F., Meinl, T., Ohl, P., Sieb, C., and

Wiswedel, B. (2006). Knime: The Konstanz infor-

mation miner. In Proceedings of the 4th Annual In-

dustrial Simulation Conference, Workshop on Multi-

Agent Systems and Simulation.

Fayad, M. E. and Schmidt, D. C. (1997). Object-oriented

application frameworks. Communications of the

ACM, 40(10):32–38.

Laurinen, P., Tuovinen, L., Haapalainen, E., Junno, H.,

R¨oning, J., and Zettel, D. (2004). Managing and im-

plementing the data mining process using a truly step-

wise approach. In Proceedings of the Sixth Interna-

tional Baltic Conference on Databases & Information

Systems, pages 246–257.

Laurinen, P., Tuovinen, L., and R¨oning, J. (2005). Smart

Archive: a component-based data mining application

framework. In Proceedings of the Fifth International

Conference on Intelligent Systems Design and Appli-

cations (ISDA 2005), pages 20–25.

Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and

Euler, T. (2006). YALE: Rapid prototyping for com-

plex data mining tasks. In Proceedings of the 12th

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pages 935–940.

Prudsys AG (2007). Xeli’s Intro. Introduction to

XELOPES. Technical manual, available at

http://www.prudsys.com.

Tuovinen, L., Laurinen, P., Juutilainen, I., and R¨oning, J.

(2008). Data mining applications for diverse indus-

trial application domains with Smart Archive. In Pro-

ceedings of the IASTED International Conference on

Software Engineering (SE 2008), pages 56–61.

ICAART 2009 - International Conference on Agents and Artificial Intelligence

338