Development Studio. Majority of the studies
conducted on those tools tend to primarily focus on
the functions of the tools rather than the
performance of the tools. Our study mainly focuses
on the performance of the tools based on five
attributes: memory shortages, excess paging with a
disk bottleneck, paging file fragmentation, memory
leaks, and cache manager efficiency.
Based on our study, data mining tools such as
Oracle Data Miner and Microsoft Business
Intelligence Development Studio cache a certain
percentage of both unmined and mined data in the
application tier. Such a strategy off-loads computing
cycles from the backend systems (for example,
Microsoft Business Intelligence Development
Studio, and Oracle Data Miner). However, both the
unmined and mined data are not fully persisted or
cached in the backend systems. As such, we might
have cases whereby two users might be mining the
same data set and this causes redundancy in terms of
work performed.
Our study also reveals that data mining at the
memory level will lead to better performance. For
example, IBM Intelligent Miner consumes only 15%
of Physical Disk\Disk Time and 100MB of
Memory\Available Bytes. This explains that there is
a trade-off between memory and disk. If we spend
more time at the memory level, then we should
spend less time on disk activity (also referred to as
Disk I/O). Disk I/O is often a major bottleneck to
data mining performance.
To improve the performance of data mining, our
study reveals that major data mining activities
should be performed in-memory at the server level.
Therefore the proposed memory repository of the
middleware will be adopted from SQL Server
Analysis Services. In the transition of 32-bit
computing to 64-bit computing, we believe the
proposed middleware will be able to leverage at the
memory level. In the near future, we believe major
data mining tools like Microsoft Business
Intelligence Development Studio and Oracle Data
Miner, which are almost vendor dependent, will
leverage at the memory level.
At the time of our study, tools like Microsoft
Business Intelligence Development Studio, Oracle
Data Miner, and SAS Institute Enterprise Miner only
support a predefined set of data sources. For
example, Microsoft Business Intelligence
Development Studio only supports ODBC, OLEDB
and other types of predefined data sources. Oracle
Data Miner, on the other hand, only supports JDBC
compliant driver such as OCI-based drivers.
Implementing new data sources into such tools are
difficult and often require understanding of the
specified data source API specification. For
example, in the case of Oracle Data Miner,
implementers need to understand the JDBC API
specification.
A data mining tool might face the constraint of
platform dependent (Sanjiv, 2006). Tools such as
Microsoft Business Intelligence Development Studio
and SPSS Clementine are not platform independent.
Microsoft Business Intelligence Development Studio
depends on .NET Framework which currently only
supports the Windows platform. In order to support
other platforms such as Linux, tedious
customizations are needed. SPSS Clementine, on the
other hand, releases different binaries on different
platforms. Oracle Data Miner uses the same binaries
on different platforms, and as such, is platform
independent.
3 PROPOSED DATA MINING
MIDDLEWARE
This paper discusses our proposed architecture for a
data mining middleware to be developed which
employs the strengths and eliminates the weaknesses
of other data mining tools available in the market.
We will refer this middleware as Java-Based Data
Mining Middleware (JDMM). This proposed
architecture is a server centric middleware that
provides the flexibility in which data mining
techniques are unlimited. New data mining
techniques are allowed to be plugged into the
middleware. In addition, JDMM will be a platform-,
data source-, and data mining technique-independent
middleware which is accessible from front-, back-
and web-office environments. JDMM is designed to
minimize the level of disk activity (Disk I/O) over
time during data mining by introducing the concept
of memory-optimized repository and other
technology. Disk I/O is an important performance
metric during data mining as disks are often a major
bottleneck attribute to data mining performance.
Performance of applications with any I/O will be
limited, further CPU performance improvements
will be wasted (Peter & David, 1993). This is
particularly true for a database driven data mining.
Hence, JDMM architecture needs to be designed to
address the issue of I/O throughput of disks to
enable a highly scalable and an almost instantly
responsive server-centric data mining middleware.
ARCHITECTURE-CENTRIC DATA MINING MIDDLEWARE SUPPORTING MULTIPLE DATA SOURCES AND
MINING TECHNIQUES
225