
The developers can then replace the proxies and dum-
mies with regular filters once the algorithms they rep-
resent have been designed, tested and coded in the
language of the framework. Some proxies may even
be left in the final application, for example if a gate-
way filter has been used to build an interface to a
third-party algorithm library.
This paper proposes gateway filters as a means of
supportingalgorithm design and prototype implemen-
tation in data mining software development. Gate-
ways provide an encapsulated and open-ended inter-
face that allows functionality provided by other tools
to be included in SA-based applications. Encapsula-
tion makes using the interface convenient by allow-
ing regular filters to be replaced with gateways and
vice versa without modifying other parts of the appli-
cation. Openness ensures that the range of supported
tools is not limited to a few or just one. A positive side
effect of openness is that gateway filters have several
potential applications besides the principal one.
Section 2 discusses related work, most impor-
tantly other data mining frameworks. The main con-
tribution is given in Section 3, which introduces the
concept of gateway filters, and Section 4, which
presents a gateway filter implementation and docu-
ments some tests performed on it. Section 5 discusses
the findings, and Section 6 concludes the paper.
2 RELATED WORK
Weka (Witten and Frank, 2005) is a library of ma-
chine learning algorithms developed at the Univer-
sity of Waikato, New Zealand. The library provides
a collection of algorithms for regression, classifica-
tion, clustering and association rule mining as well as
I/O interfaces and preprocessing, evaluation and vi-
sualization methods. The algorithms can be invoked
from Java code or operated using a set of graphical
user interfaces bundled with the library.
D2K (Automated Learning Group, 2003) is a
data mining environment developed by the Auto-
mated Learning Group of the University of Illinois.
It provides a graphical toolkit for arranging applica-
tion components, called modules, into processing se-
quences, called itineraries, but developers also have
the option of using the core components of D2K with-
out the toolkit. In addition to a selection of reusable
modules for various purposes (I/O, data preparation,
computation, visualization) there is an API for pro-
gramming new ones.
KNIME (Berthold et al., 2006), developed at the
University of Konstanz, Germany, is another data
mining environment with a graphical user interface
for manipulating components,called nodes. Although
KNIME and D2K are superficially different, there is
a close analogy between the nodes and workflows
of KNIME and the modules and itineraries of D2K.
Based on the Eclipse integrated development environ-
ment, KNIME bears more resemblance to a program-
ming tool and allows arbitrary code snippets to be in-
serted into workflows.
RapidMiner, formerly known as YALE (Mierswa
et al., 2006), was originally developed at the Univer-
sity of Dortmund, Germany. Like D2K and KNIME,
it is a graphical environment for building data min-
ing applications from components, called operators.
However,unlike D2K and KNIME, RapidMiner mod-
els the operator sequence as a tree rather than an arbi-
trary directed graph. The operator tree can be manip-
ulated either graphically or by editing the XML rep-
resentation, which is always available in its own tab
in the user interface.
Smart Archive shares with Weka, D2K, KNIME
and RapidMiner the principle of composing applica-
tions from components, with the output of one com-
ponent becoming input for the next one. The se-
quence of components, analogous to a D2K itinerary,
is called the execution graph. Another layer of func-
tionality built upon the execution graph, called the
mining kernel, is responsible for gathering data from
data sources (via interface objects known as input re-
ceivers) and feeding it into the execution graph.
A general difference between Smart Archive and
the other frameworks discussed above is the greater
emphasis that SA puts on openness and flexibility.
In practice this means that SA offers greater freedom
to work directly with application code and to use se-
lected parts of the framework to build a precisely tai-
lored application. Another significant difference is
found in the approach to using databases in applica-
tions: Smart Archive, with its data sinks embedded
in components, provides a tighter form of integration
than the other frameworks, where database inputs and
outputs are just types of components among others.
From a software engineering perspective, the inte-
gration of SA with external tools falls into the scope
of software interoperability, on which there is a con-
siderable body of literature. Since SA is a component-
based framework, research on component-based sys-
tems is of particular interest. Broadly speaking, SA
belongs to the category of open systems (Oberndorf,
1997) and this is the feature that allows external tools
to be connected to it. The concrete tool integration
approach of SA is similar to that of the generic archi-
tecture presented in (Grundy et al., 1998): the external
tools are wrapped in component-based interfaces.
Although the idea of gateway filters is not new,
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
250