In this work, we propose a solution for ontology
matching that initially partitions two ontologies using
Community Detection techniques (Fortunato, 2010).
In the sequence, we consider three different aspects
of the ontology partitions: terminological content,
topological features and instance-based aspect, also
known as extensional aspect. For each aspect, we
apply Independent Component Analysis (ICA)
(Honkela, Hyvärinen and Väyrynen, 2010) for
dimensionality reduction. ICA is a technique inspired
in the problem of blind signal separation that applies
linear transformations on data to obtain statistically
independent components, reducing data to its most
relevant features. After applying ICA, we cluster
ontology partitions according to each aspect,
considered separately, and find a consensus clustering
applying Bayesian Cluster Ensembles (BCE) (Wang,
Shan and Banerjee, 2011). Finally, we match classes
and properties of ontology partitions that belong to
the same consensual cluster.
This paper has the following structure: in section
2, we explain BCE; in section 3, we review the related
works; in section 4, we explain our methods; in
section 5, we outline the expected results. Since this
work is an ongoing project, we plan to present its
results and conclusions in future publications.
2 BAYESIAN CLUSTER
ENSEMBLES
Cluster Ensembles techniques combine clustering
solutions (base clusterings), obtained by different
algorithms, into a consensual clustering, which
captures different assumptions of the algorithms,
making the solution more accurate and more robust
(Wang, Shan and Banerjee, 2011).
In BCE, given n data points to be clustered, Wang,
Shan and Banerjee (2011) assume that each data point
participates in all consensual clusters, in different
proportions, given by probabilities. BCE is based on
a probabilistic generative process, which considers
that the consensual clusters generate the base
clusterings. Figure 1 illustrates the generative process
of BCE. Matrix B represents the base clusterings and
matrix C refers to the consensual clusters. In matrix
B, the lines represent seven data points, given by x
i
(i
= 1,…,7). There are three base clusterings, given by
λ
i
, which are the columns of B. The entries of B are
the base clusterings’ labels. In the generative process,
these labels are drawn from probabilistic distributions
related to the consensual clusters (matrix C). The
labels of the base clusterings follow discrete
distributions.
In the example of figure 1, let us consider that λ
1
for x
1
was generated by the consensual cluster 2.
Then, according to column 1 and line 2 of C, we have
a probability of 0.1 that x
1
is in cluster 1. Following
the same discrete distribution, the probability that x
1
is in cluster 2 is 0.1 and the probability of x
1
being in
cluster 3 is 0.8. Given that 0.8 is the highest
probability for λ
1
(column 1 of C), considering the
two consensual clusters, we conclude that the
consensual cluster 2 generated x
1
and that x
1
has label
3.
The goal of BCE is to infer the consensus
clustering with Bayesian Inference, such that the base
clusterings are the observed data. As figure 1 shows,
the inference process of BCE follows the inverse
direction of the generative process. BCE infers the
degree of membership Ɵ of each data point to the
consensual clusters and infers the consensual label z
assigned to the data points, considering α and β as
probabilistic parameters of the model. Wang, Shan
and Banerjee (2011) made an experiment with
scientific datasets to compare BCE to other Cluster
Ensembles techniques and clustering algorithms, e.g.
Hypergraph Partitioning Algorithm and K-means.
Wang, Shan and Banerjee (2011) measured the
clustering accuracy, considering the number of data
points correctly assigned to a cluster, based on a gold
standard. BCE outperformed the other techniques and
algorithms in most of the cases.
3 RELATED WORKS
Algergawy, Massmann and Rahm (2011) and
Moawed et al. (2015) used the Vector Space Model
(VSM) (Manning, Raghavan and Schütze, 2009) and
clustered ontology partitions solely based on their
terminological content, not considering different
aspects that partitions have. Considering multiple
aspects for clustering can help finding additional
correct clusters, which can increase the number of
similar ontology elements grouped in the same
cluster, helping to improve the precision of the
matching results.
Ferrara et al. (2015) found a consensus clustering
based on the co-occurrence of ontology elements in
the same cluster in different clustering solutions.
Ferrara et al. (2015) did not apply BCE, which can
provide more accurate clustering results than other
Cluster Ensembles techniques (Wang, Shan and
Banerjee, 2011). This accuracy can improve the
ontology clustering result, influencing on the
ontology alignment by increasing its precision.