EXTERNAL TOOL INTEGRATION WITH PROXY FILTERS

IN A DATA MINING APPLICATION FRAMEWORK

Lauri Tuovinen, Perttu Laurinen and Juha R¨oning

Department of Electrical and Information Engineering, University of Oulu, P.O. Box 4500, FIN-90014, Finland

Keywords:

Data mining, Application framework, Pipes and ﬁlters, Tool integration, Proxy pattern.

Abstract:

An important phase in the development of a data mining application is algorithm selection: for any given

data mining task there is likely to be a range of different model types available, as well as a number of

different methods for constructing the models. Choosing the one that best accomplishes the task is not trivial

and generally involves trials and comparisons of different conﬁgurations. It is often convenient to perform

the trials on a platform other than the ultimate implementation technology of the application; for example,

the application may be implemented in a general-purpose programming language such as C++ while model

prototyping is carried out in a scientiﬁc computing environment such as MATLAB. For this we propose a data

mining application framework that allows MATLAB and other external tools to be integrated into applications

via proxies known as gateway ﬁlters. To the framework the gateway ﬁlters appear no different from algorithms

implemented on the framework platform, so it is possible to build a full application prototype early on and then,

once the algorithms to be used have been selected, to turn the prototype into a deployable application simply by

replacing proxies with natively implemented ﬁlters. Thus the framework comprehensively supports the various

steps of application programming, from algorithm design and prototype building to ﬁnal implementation.

1 INTRODUCTION

Various software packages have been created to sup-

port the development of data mining applications.

Different solutions support the development process

in different ways and at different stages. For instance,

a graphical tool for building pipe-and-ﬁlter style data

ﬂows may be useful for rapid prototyping but not for

implementing the ﬁnal product, assuming that a stan-

dalone, fully tailored application is desired. A library

of data mining algorithms, on the other hand, does not

provide a graphical interface but can be included in

the ﬁnal product to speed up the programming phase.

Smart Archive (SA) (Laurinen et al., 2005; Tuovi-

nen et al., 2008) is a software framework for data min-

ing applications implemented in Java and C++. As a

traditional application framework it differs in a sig-

niﬁcant way from both algorithm libraries and graph-

ical application builders. Like a library, it speeds up

programming, but instead of providing a vast collec-

tion of reusable algorithms it provides a generic ker-

nel into which the algorithms required by a speciﬁc

application are inserted.

SA can also be used to speed up prototype build-

ing, but instead of offering a graphical interface for

coupling algorithms it helps the development of the

algorithms themselves. A new extension to the frame-

work allows applications to include programs imple-

mented with specialized tools that are more appro-

priate for the task of algorithm design than general-

purpose programming languages like Java. SA thus

makes it possible to write an algorithm in, for exam-

ple, MATLAB, and then to test it immediately in the

context of a fully functional application prototype.

Processing algorithms in Smart Archive are

known as ﬁlters. In a deployed application these are

usually written entirely in the implementation lan-

guage of the framework (Java or C++). However,

when an algorithm is implemented using a different

tool, it is represented in SA by a proxy instead. The

proxy sends the input data to the external tool and re-

ceives the results from it while presenting to SA the

same interface that all other ﬁlters implement. The

frameworktherefore sees no difference between prox-

ies, known as gateway ﬁlters, and regular ﬁlters.

Since all ﬁlters appear similar to the framework,

developers can put together a complete application at

a relatively early stage—complete in the sense that

all of the ﬁlters to be included in the application are

present as regular ﬁlters, proxies or simple dummies.

249

Tuovinen L., Röning J. and Laurinen P. (2009).

EXTERNAL TOOL INTEGRATION WITH PROXY FILTERS IN A DATA MINING APPLICATION FRAMEWORK.

In Proceedings of the 4th International Conference on Software and Data Technologies, pages 249-255

DOI: 10.5220/0002256102490255

 SciTePress

The developers can then replace the proxies and dum-

mies with regular ﬁlters once the algorithms they rep-

resent have been designed, tested and coded in the

language of the framework. Some proxies may even

be left in the ﬁnal application, for example if a gate-

way ﬁlter has been used to build an interface to a

third-party algorithm library.

This paper proposes gateway ﬁlters as a means of

supportingalgorithm design and prototype implemen-

tation in data mining software development. Gate-

ways provide an encapsulated and open-ended inter-

face that allows functionality provided by other tools

to be included in SA-based applications. Encapsula-

tion makes using the interface convenient by allow-

ing regular ﬁlters to be replaced with gateways and

vice versa without modifying other parts of the appli-

cation. Openness ensures that the range of supported

tools is not limited to a few or just one. A positive side

effect of openness is that gateway ﬁlters have several

potential applications besides the principal one.

Section 2 discusses related work, most impor-

tantly other data mining frameworks. The main con-

tribution is given in Section 3, which introduces the

concept of gateway ﬁlters, and Section 4, which

presents a gateway ﬁlter implementation and docu-

ments some tests performed on it. Section 5 discusses

the ﬁndings, and Section 6 concludes the paper.

2 RELATED WORK

Weka (Witten and Frank, 2005) is a library of ma-

chine learning algorithms developed at the Univer-

sity of Waikato, New Zealand. The library provides

a collection of algorithms for regression, classiﬁca-

tion, clustering and association rule mining as well as

I/O interfaces and preprocessing, evaluation and vi-

sualization methods. The algorithms can be invoked

from Java code or operated using a set of graphical

user interfaces bundled with the library.

D2K (Automated Learning Group, 2003) is a

data mining environment developed by the Auto-

mated Learning Group of the University of Illinois.

It provides a graphical toolkit for arranging applica-

tion components, called modules, into processing se-

quences, called itineraries, but developers also have

the option of using the core components of D2K with-

out the toolkit. In addition to a selection of reusable

modules for various purposes (I/O, data preparation,

computation, visualization) there is an API for pro-

gramming new ones.

KNIME (Berthold et al., 2006), developed at the

University of Konstanz, Germany, is another data

mining environment with a graphical user interface

for manipulating components,called nodes. Although

KNIME and D2K are superﬁcially different, there is

a close analogy between the nodes and workﬂows

of KNIME and the modules and itineraries of D2K.

Based on the Eclipse integrated development environ-

ment, KNIME bears more resemblance to a program-

ming tool and allows arbitrary code snippets to be in-

serted into workﬂows.

RapidMiner, formerly known as YALE (Mierswa

et al., 2006), was originally developed at the Univer-

sity of Dortmund, Germany. Like D2K and KNIME,

it is a graphical environment for building data min-

ing applications from components, called operators.

However,unlike D2K and KNIME, RapidMiner mod-

els the operator sequence as a tree rather than an arbi-

trary directed graph. The operator tree can be manip-

ulated either graphically or by editing the XML rep-

resentation, which is always available in its own tab

in the user interface.

Smart Archive shares with Weka, D2K, KNIME

and RapidMiner the principle of composing applica-

tions from components, with the output of one com-

ponent becoming input for the next one. The se-

quence of components, analogous to a D2K itinerary,

is called the execution graph. Another layer of func-

tionality built upon the execution graph, called the

mining kernel, is responsible for gathering data from

data sources (via interface objects known as input re-

ceivers) and feeding it into the execution graph.

A general difference between Smart Archive and

the other frameworks discussed above is the greater

emphasis that SA puts on openness and ﬂexibility.

In practice this means that SA offers greater freedom

to work directly with application code and to use se-

lected parts of the framework to build a precisely tai-

lored application. Another signiﬁcant difference is

found in the approach to using databases in applica-

tions: Smart Archive, with its data sinks embedded

in components, provides a tighter form of integration

than the other frameworks, where database inputs and

outputs are just types of components among others.

From a software engineering perspective, the inte-

gration of SA with external tools falls into the scope

of software interoperability, on which there is a con-

siderable body of literature. Since SA is a component-

based framework, research on component-based sys-

tems is of particular interest. Broadly speaking, SA

belongs to the category of open systems (Oberndorf,

1997) and this is the feature that allows external tools

to be connected to it. The concrete tool integration

approach of SA is similar to that of the generic archi-

tecture presented in (Grundy et al., 1998): the external

tools are wrapped in component-based interfaces.

Although the idea of gateway ﬁlters is not new,

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

250

applying it to tool integration does not appear to have

spread to data mining frameworks. The other frame-

works do provide components that could conceivably

be used to implement something like the

RDBGateway

ﬁlter described in Section 4, but the extra effort re-

quired would be considerable and the result would be

awkward compared to the case where all the details

of exchanging data and instructions with the external

tool are taken care of by a single component. In SA

applications this connection can be established simply

by instantiating a proxy and setting its parameters.

3 THE GATEWAY FILTER

CONCEPT

Filters are the core elements of any Smart Archive

application. Each ﬁlter implements a transformation

of data, usually reducing its volume and increasing

its degree of abstraction. The desired knowledge

is extracted from raw data through a progression of

transformations. Filters are packaged in components,

which may also contain a data sink for storing the

transformed data produced by the ﬁlter.

Each ﬁlter has one or more named input channels

through which it receives data to be processed. An

input channel, in turn, has one or more input vari-

ables, which contain the actual data values. Similarly,

each ﬁlter has one or more output channels. A chan-

nel may be deﬁned as a cohesive variable group that

plays a distinct role in the operation carried out by the

ﬁlter. When a ﬁlter is joined with a pipe to another

application element, the pipe is always attached to a

particular input or output channel.

Although a ﬁlter conceptually represents a data

transformation, it is not necessary for it to implement

the transformation. Instead, the task of executing the

transformation may be delegated to another applica-

tion by employing a gateway ﬁlter. Gateway ﬁlters

are derived from an abstract class that combines two

Gang-of-Four design patterns (Gamma et al., 1994):

proxy and template method. The proxy pattern sepa-

rates the implementation of the ﬁlter algorithm from

the ﬁlter object so that the implementation may reside

in a different environment while to the framework it

appears as if the proxy itself were implementing the

algorithm. Within the proxy, the template method de-

ﬁnes the steps of relaying inputs to the implementa-

tion and receiving outputs from it.

Smart Archive ﬁlter classes are usually derived

from the class

AbstractFilter

. This class declares

an abstract method,

executeTransformation

which constitutes the core of the ﬁlter, transforming

inputs into outputs. Regular ﬁlters override this

method with a concrete implementation of the

ﬁlter algorithm. With gateway ﬁlters, however, an

additional layer of abstraction is added.

The base class of all gateway ﬁlters is

ExternalToolGateway

, a direct subclass of

AbstractFilter

ExternalToolGateway

overrides

executeTransformation

with a template method,

the essential parts of which are shown in Figure 1.

In the body of the template method there are calls

to four subroutines, each of which is an abstract

method in

ExternalToolGateway

. Thus, in order

to create a gateway ﬁlter, a programmer must create

a subclass of

ExternalToolGateway

that overrides

these four methods with concrete implementations of

the corresponding steps of the template algorithm.

Each of the four abstract methods returns a

boolean

value where

true

indicates that the opera-

tion was successfully completed, in which case the

algorithm proceeds to the next step. In the case of a

false

return value the algorithm terminates instantly

and

executeTransformation

returns a

null

output

object. In any event a ﬁfth method,

cleanUp

, will be

called before returning, allowing the ﬁlter to reset any

states that the other four might have modiﬁed.

The gatewayﬁlter can be thought of as a client that

requests services from a server residing on the con-

crete implementation platform. The ﬁrst subroutine

call in the template method,

sendInputToExternal

delivers a batch of input data to the server. The sec-

ond call,

triggerExternal

, causes the server to for-

ward the data to an implementation the ﬁlter algo-

rithm. The third call,

waitForResponse

, blocks the

client until the server announces either that it is ready

to send output or that the operation failed. In the for-

mer case, the fourth call,

getOutputFromExternal

pulls in the output data generated by the server. This

process is illustrated in Figure 2.

It is worth noting that the template method as-

sumes that it is not possible for the client to simply

make a single function call to the server, with the in-

put data as argument and the output data as return

value. This choice was made because it places fewer

restrictions on the implementation of the connection

between client and server. The next section presents

a gateway ﬁlter implementation where asynchronous

client-server communication is the only option.

4 IMPLEMENTATION

AND TESTING

There are many different ways to implement a

gateway ﬁlter, distinguished mainly by the means

of client-server communication. Perhaps the most

EXTERNAL TOOL INTEGRATION WITH PROXY FILTERS IN A DATA MINING APPLICATION FRAMEWORK

251

straightforward option would be to use remote

method invocation, but this option is viable only with

server platforms capable of running a Java virtual

machine. Instead, a more open-ended approach was

adopted in the implementation of the ﬁrst gateway ﬁl-

ter: indirect communication via a repository.

Because SA powerfully supports interaction with

relational databases (RDBs), they were chosen as the

repository technology to be used with the gateway ﬁl-

ter. Thus, a subclass of

ExternalToolGateway

was

written that writes the ﬁlter inputs into a set of RDB

tables and reads the outputs from another set of tables.

The new class was named

RDBGateway

Figure 3 shows the passage of signals and data

when

RDBGateway

is used. The means of passing sig-

nals is a status table, which both the client and the

server can access. There are four states that occur in

sequence when the ﬁlter is executed: ‘idle’ when the

ﬁlter is inactive, ‘input’ when the client is passing in-

put to the server, ‘process’ when the server may begin

processing the data, and ‘output’ when the server has

ﬁnished and the client may read the output. Alterna-

tively, the server may set the state to ‘abort’ in the case

of an error that prohibits it from generating output. In

either case the ﬁlter will then go back to the idle state

and remain there until it is invoked again.

The server, while not executing an algorithm,

stays in a loop where it polls the status table for the

trigger signal (state change to ‘process’). Besides the

state, the status table contains information on where

to ﬁnd the input data, which algorithm to invoke and

where to write the outputs. While the server is pro-

cessing, the client waits in a loop, polling the status

table for the ready signal (state change to ‘output’).

When the signal is received, the outputs are available

in tables speciﬁed to the server via the status table.

An implementation of the server has been written

for the scientiﬁc computing environment MATLAB.

Figure 1: The template method that implements

executeTransformation

in the

ExternalToolGateway

class.

To test the connection to MATLAB, an application

that ﬁrst trains and then applies a k nearest neigh-

bors (kNN) classiﬁer was devised. The data used

was synthetic and randomly generated. The classi-

ﬁer was implemented as a MATLAB function and an

RDBGateway

was used to represent it in the applica-

tion. The same algorithm was also implemented in

Java as a regular ﬁlter so that the performance over-

head incurred by the gateway could be estimated.

The classiﬁer was tested with datasets of differ-

ent sizes. The results of the tests are documented in

Tables 1 and 2. These ﬁgures compare poorly to the

performance of the Java implementation, for which

the training times were consistently zero within the

accuracy of the timer and the classiﬁcation task was

completed in less than 100 milliseconds even in the

= N

= 500 case. However, the performance of

the MATLAB implementation comprises several ele-

ments, not all of which depend on the gateway. Its

effect on classiﬁcation performance can be estimated

by looking at the training performance ﬁgures.

Since training a kNN classiﬁer involves simply

storing the training set, the training times give a fair

idea of how long it takes to transfer a dataset between

the framework and an external tool. The ﬁgures imply

that in the tests performed, the time to transfer inputs

to the classiﬁer and the results back to the gateway

was a small fraction of the time it took to process the

inputs. In an application required to respond rapidly

to new data this overhead could be a problem, but in

an application that takes minutes to process a single

dataset, an overhead of seconds should be acceptable.

Figure 2: The operating principle of gateway ﬁlters. The

ﬁlter receives a batch of data (1), which it forwards to the

external tool (2). The ﬁlter then triggers the external tool

(3), which invokes the appropriate algorithm (4). Once the

algorithm has returned its output (5), the external tool sends

a ready signal (6) to the ﬁlter, which accepts the output (7)

and forwards it to the framework (8).

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

252

It should be noted that this example is not intended

as a realistic demonstration of how

RDBGateway

might be used. A library implementation would be

a more likely choice if one were to use kNN classiﬁ-

cation in an SA application. Also, the number of vari-

ables in the data used for the tests was considerably

smaller than what one would normally encounter in

a data mining task, so the measured processing times

do not accurately reﬂect the performance of the tested

implementations in a real-world situation.

Table 1: Training performance of the kNN classiﬁer, in mil-

liseconds. N

denotes the size of the training dataset. The

times are averaged over three test runs.

100 300 500

280 ms 590 ms 1060 ms

5 DISCUSSION

The chief purpose of the gateway ﬁlter concept is to

support connections to external computing tools, pri-

marily for building functional application prototypes.

It is assumed here that developing the application in-

volves designing new algorithms, not simply ﬁnding

Figure 3: The operating principle of the RDB gateway ﬁl-

ter. Upon receiving a batch of data (1), the ﬁlter ﬁrst sets

the status in the status table to ‘input’ (2) and then writes

the data into the designated input tables (3). The status is

then changed to ‘process’ (4), which triggers the external

tool to read the input data (5) and invoke the algorithm (6).

When the algorithm returns (7), the tool writes its output

into the designated output tables (8) and changes the status

to ‘output’ (9), signalling to the ﬁlter that it may read the

data (10). Finally, the ﬁlter resets the status to ‘idle’ (11)

and forwards the output data to the framework (12).

suitable parameters for off-the-shelf ones. This is

more laborious than using library algorithms, but on

the other hand it gives developers more freedom and

leverage, which is characteristic of the SA approach.

Experience from various research projects sug-

gests that algorithm developers, especially ones with

a background in mathematics, prefer to work with

scientiﬁc computing packages such as MATLAB or

R rather than directly with the implementation lan-

guage of the application to be created. Research is

also the context where the need to design custom al-

gorithms for a data mining application is most likely

to arise. This makes SA particularly useful for cre-

ating new experimental results and transferring them

into production use; this capacity of SA has already

been demonstrated (Tuovinen et al., 2008).

The MATLAB test shows that the connectionfrom

SA to external tools through a proxy works and

does not cause an unreasonable transmission over-

head compared to the time it takes to process a moder-

ately large dataset. It is therefore possible to develop

an algorithm in MATLAB and include it in a func-

tional prototype of an SA-based data mining applica-

tion. Since the function implementing the algorithm

is completely separated from the proxy representing

it in SA and since the MATLAB programming lan-

guage is interpreted, one can make changes to the al-

gorithm without recompiling anything, even while the

main Java program is running. This makes it conve-

nient to try different variants and conﬁgurations of the

algorithm at the developer’s will.

As required, the interface between the proxy and

the external tool is completely encapsulated by the

gateway ﬁlter, which limits the scope of the changes

required to remove the dependency on the external

tool: the only differences between the two application

conﬁgurations in the kNN example are in the code for

setting up the ﬁlter for the classiﬁer component. This

code accounts for less than ten percent of the code

lines in the main routine of the application. Since the

Java implementation of the classiﬁer takes the same

inputs and gives the same outputs as the MATLAB

implementation, they are fully interchangeable.

The other major requirement, openness, is also

Table 2: Classiﬁcation performance of the kNN classiﬁer,

in seconds. N

denotes the size of the training dataset and

the size of the classiﬁcation dataset. Each time is the

result of a single test run.

100 300 500

100

2.3 s 5.7 s 9.3 s

300 6.2 s 18.3 s 29.7 s

500 11.0 s 29.8 s 49.2 s

EXTERNAL TOOL INTEGRATION WITH PROXY FILTERS IN A DATA MINING APPLICATION FRAMEWORK

253

fulﬁlled because

RDBGateway

can be used to interface

with any computing tool from which it is possible to

access a relational database. Since RBD technology

is ﬁrmly established and widely employed, this is a

fairly loose constraint: any tool designed to be used

for processing structured data is likely to have the op-

tion of importing the data from a database, either as

a built-in function or via some kind of add-on. The

external tool could even be one of the other frame-

works reviewed in Section 2, conﬁgured to run an ap-

plication that reads its original inputs from the input

tables of

RDBGateway

and writes its ﬁnal outputs into

the output tables. It would also be possible to use

the MATLAB server with another framework, since it

does not depend on SA; however, this would still be

more awkward than using SA since a suitable client

component would ﬁrst have to be implemented.

If none of the available gateway classes are sat-

isfactory, it is possible to derive a new one from

ExternalToolGateway

, which places no restrictions

on the means of communication with the external

tool. The solution is therefore genuinely open-ended,

but openness also has its disadvantages. To keep the

set of possible external tools as inclusive as possible,

the implementation of the gateway concept makes de-

liberately few assumptions about the server side, but

this also means more work for a developer imple-

menting a new server because it limits what the frame-

work can do for the developer. On the other hand, the

absence of many restricting assumptions makes gate-

way ﬁlters potentially suitable for a variety of pur-

poses that were never considered as primary design

goals. We discuss some notable possibilities below.

Before an application is deployed, gateway ﬁlters

used in the prototype will probably need to be re-

placed with regular ﬁlters: it is likely that some of the

tools used by the algorithm developers are not avail-

able in the production environment, and in any event

the overhead from communicating with them would

be undesirable. However, a gateway could also be

used as an interface to a free algorithm library, in

which case the overhead would probably be accept-

able and the library could be distributed with the ap-

plication. For example, a gateway that can serve as

a proxy for any Weka algorithm would signiﬁcantly

increase the effective number of reusable ﬁlters avail-

able. Developers could then focus on designing those

ﬁlters that can not be implemented simply by picking

the right library algorithm.

Another purpose that gateway ﬁlters could serve

in the ﬁnished application is distributed computing.

This would involve creating a gateway that can in-

voke a function residing on a remote server; unlike

an external tool, the remote code could work directly

with the data structures used by SA for internal com-

munication. In an application that includes two or

more computationally heavy components that can be

executed in parallel, gateway ﬁlters could be used to

distribute the workload of these components. Alter-

natively, one could write a gateway that divides the

batch of input data into work units and sends each

unit to a different remote machine to be processed.

It is worth noting that SA has already been used to

implement a distributed application, but in this case it

was entire SA instances that were executedin parallel,

not individual components (Tuovinen et al., 2008).

Finally, since it makes no difference to the frame-

work how the external tool generates its results, a

gateway could be used to employ a human expert in

lieu of a ﬁlter algorithm. In this case the external tool

would be an interactive application, not just a pas-

sive platform for algorithm execution. The role of the

framework would be to prepare data for the applica-

tion and possibly to process its output further. For ex-

ample, one might want to use a tool like SignalPlayer

(Siirtola et al., 2007) for exploratory analysis of time

series data when the problem at hand is not under-

stood well enough to allow a non-interactivesolution.

The principal problem identiﬁed in the current

RDBGateway

implementation is that although inputs

and outputs are automatically transferred between the

proxy and the external tool, there is no similar mech-

anism for transferring algorithm parameters. There-

fore, when working on an application containing an

external algorithm, a developer has to switch from the

main application to the external tool to change the pa-

rameters of the algorithm. This has proven somewhat

inconvenient, so it would be a signiﬁcant improve-

ment if proxy ﬁlters could mediate parameter values

to externally implemented algorithms.

6 CONCLUSIONS

In this paper we described a solution to the problem of

including algorithms implemented with external tools

in data mining applications built using a software

framework. The framework, called Smart Archive,

is an object-oriented application framework based

on the pipes-and-ﬁlters architectural style. Smart

Archive applications can interface with external tools

through special ﬁlters known as gateways. A gateway

ﬁlter serves as a proxy that represents an externally

implemented algorithm to the framework and trans-

fers data between the frameworkand the external tool.

A typical example of an external tool that one

might want to use in conjunction with Smart Archive

is the scientiﬁc computing environment MATLAB.

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

254

The proxy ﬁlter concept was demonstrated by estab-

lishing a gateway to MATLAB and using it to per-

form a classiﬁcation task on synthetic data. The per-

formance of the classiﬁer was compared to that of the

same classiﬁcation algorithm implemented as a reg-

ular Smart Archive ﬁlter. The performance of the

MATLAB version proved substantially worse, but the

effect of data transfer overhead caused by the gateway

was found to be comparably small.

To the framework, a gateway ﬁlter appears no dif-

ferent from any other kind of ﬁlter, so it can be con-

veniently replaced with a ﬁlter implemented in the

language of the framework once the details of the al-

gorithm have been settled. In the classiﬁcation test,

for instance, replacing the MATLAB gateway with a

regular ﬁlter could be accomplished with minor and

bounded modiﬁcations to the application code. An-

other desirable feature of the gateway mechanism is

that it is highly open-ended: the gateway used to com-

municate with MATLAB could be used with many

other tools as well, and developers are also free to

create their own gateway implementations to suit their

particular purposes.

Proxy ﬁlters were introduced into Smart Archive

primarily to allow developers of data mining applica-

tions to build a functional application prototype from

algorithms created with specialized tools that are bet-

ter suited to such work than general-purposeprogram-

ming languages. This ability was demonstrated by

the MATLAB test. Other possible uses of proxies in-

clude interfacing with algorithm libraries and devel-

oping distributed applications. There could also be a

gateway to an external tool operated interactivelyby a

human expert, who would effectively assume the role

of a ﬁlter by using the tool to produce the output to be

sent back to the framework.

ACKNOWLEDGEMENTS

The authors would like to thank the Finnish

Funding Agency for Technology and In-

novation (http://www.tekes.ﬁ), Rautaruukki

(http://www.ruukki.com) and Polar Electro

(http://www.polar.ﬁ) for funding the research on

Smart Archive in the SAMURAI project. L.

Tuovinen wishes to thank the Graduate School in

Electronics, Telecommunications and Automation

(http://signal.hut.ﬁ/geta/) and the Tauno T¨onning

Foundation (http://www.tonninginsaatio.ﬁ) for

funding his postgraduate work.

REFERENCES

Automated Learning Group (2003). D2K Toolkit

User Manual. Technical manual, available at

http://alg.ncsa.uiuc.edu.

Berthold, M. R., Cebron, N., Dill, F., di Fatta, G., Gabriel,

T. R., Georg, F., Meinl, T., Ohl, P., Sieb, C., and

Wiswedel, B. (2006). KNIME: The Konstanz infor-

mation miner. In Proceedings of the 4th Annual In-

dustrial Simulation Conference, Workshop on Multi-

Agent Systems and Simulation.

Gamma, E., Helm, R., Johnson, R., and Vlissides, J.

(1994). Design Patterns: Elements of Reusable

Object-Oriented Software. Reading, MA: Addison-

Wesley.

Grundy, J., Mugridge, R., Hosking, J., and Apperley, M.

(1998). Tool integration, collaboration and user inter-

action issues in component-based software architec-

tures. In Proceedings of TOOLS 28: Technology of

Object-Oriented Languages and Systems, pages 299–

312.

Laurinen, P., Tuovinen, L., and R¨oning, J. (2005). Smart

Archive: a component-based data mining application

framework. In Proceedings of the Fifth International

Conference on Intelligent Systems Design and Appli-

cations (ISDA 2005), pages 20–25.

Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and

Euler, T. (2006). YALE: Rapid prototyping for com-

plex data mining tasks. In Proceedings of the 12th

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pages 935–940.

Oberndorf, P. A. (1997). Facilitating component-based

software engineering: COTS and open systems. In

Proceedings of the Fifth International Symposium on

Assessment of Software Tools and Techniques, pages

143–148.

Siirtola, P., Laurinen, P., and R¨oning, J. (2007). Sig-

nalPlayer - a time series visualization system. In Pro-

ceedings of the Finnish Signal Processing Symposium

(FINSIG 2007).

Tuovinen, L., Laurinen, P., Juutilainen, I., and R¨oning, J.

(2008). Data mining applications for diverse indus-

trial application domains with Smart Archive. In Pro-

ceedings of the IASTED International Conference on

Software Engineering (SE 2008), pages 56–61.

Witten, I. H. and Frank, E. (2005). Data Mining: Practi-

cal machine learning tools and techniques. San Fran-

cisco: Morgan Kaufmann, 2nd edition.

EXTERNAL TOOL INTEGRATION WITH PROXY FILTERS IN A DATA MINING APPLICATION FRAMEWORK

255