(partitioning) process can be a critical part of the
upcoming ontology matching process performance.
Regularized Markov Clustering still suffers from
the scalability and storage issues, resulting from
high time and space complexity of the clustering
process (Bustamam et al., 2012). At the core of the
Regularized variants of Markov clustering, we can
notice that its complexity is dominated by the
iteratively sparse matrix-matrix multiplication and
normalization steps, which is an intensive and time-
consuming component. Thus, for scalability and
performance-gain reasons, we need to develop
computationally efficient algorithm that performs
parallel sparse matrix-matrix computations and
parallel sparse Markov matrix normalizations in
order to improve the MCL performance.
To overcome the above MCL’s issues, we
propose a parallel Markov clustering approach that
tries to improve the performance of the Markov
Clustering by implementing parallel tasks for
expansion, inflation and pruning operators using
Spark framework (See Figure 1), which is a fast and
general-purpose cluster computing system.
3.3.1 Apache Spark Framework
Apache Spark is a unified programming
environment that provides computing framework
designed for scheduling, distributing, and
monitoring applications that consist of many
computational tasks across many commodity worker
machines (computing cluster) (Shanahan and Dai,
2015).
The computational engine Spark Core provides
the basic functionality of Spark, including
components for memory management, task
scheduling and fault recovery mechanisms,
interacting with storage systems. Furthermore, it
offers the ability to run in-memory computations to
provide faster and expressive processing and
increase the speed of system. In-memory processing
is faster since no time is spent in moving the data
and processes, in and out of the hard disk.
Accordingly, Spark caches much of the input data on
memory for further explorations leading to more
performance iterative algorithms that access the
same data repeatedly.
3.3.2 Our Implementation
In order to improve the MCL performance, the
parallel implementation of the Regularized Markov
Clustering algorithm constitutes an important
challenge. Thereby, we introduce a very fast Markov
clustering algorithm using Spark framework to
perform parallel sparse matrix-matrix computations
operations, which are at the heart of Markov
clustering.
First of all, the resulting graphs from large scale
ontology are generally sparse; thus, Markov
clustering storage issues can be resolved using
distributed sparse matrix data structures offered by
the resilient distributed datasets (RDD) (Bosagh et
al., 2016), which are partitioned collections of
objects spread across many compute nodes that can
be manipulated in parallel. These matrices are
distributed across computational nodes by entries
using CoordinateMatrix implementation, by rows
via RowMatrix or via blocks by implementing
BlockMatrix.
Moreover, the distributed sparse matrices offered
by Spark ignore the large amount of zero entries,
found in the initial stochastic matrices of Markov
clustering algorithms. Hence, the distributed
matrices can reduce the computational load by
avoiding the additions and multiplications with null
entries, since the zero values do not influence any of
the Markov clustering operations. Furthermore,
Spark comes with a library containing common
computational functionality (Spark MLlib) (Meng et
al., 2016), especially for matrix-matrix
computations. These properties are the primary
source of acceleration of our system.
Besides, Spark provides a way to parallelize our
Markov clustering strategy across clusters, while
hiding the complexity of distributed clustering
programming, network communication, and fault
tolerance in order to ensure the effective and reliable
use of cache memory and the process balanced load.
These latter is enhanced by dividing data according
to performance and scalability-friendly data
structures, aiming to dynamically adapt the system
to fine-grained allocation of both resources and
computations based on the workload. The pseudo-
code for our parallel Markov Clustering
implementation is given in the following:
Input: An ontology O, Balance
parameter b, Inflation rate r
Output: Clusters Set C={C
1
,C
2
,…,C
n
}
{//Phase 1: Ontology Parsing}
//get ontology concepts
Nodes:= Concepts(O)
//taxonomic and non-taxonomix
Edges:= Relations(O)
//create associated graph
G := CreateGraph(Nodes,Edges);
//Graph adjacency matrix
A := Adjacency(G)
{//Phase 2: Markov Clustering}