On the other hand, thinking in the context of the
Spark architecture (cf. Figure 3), we can map our ap-
proach to the distributed setting as follows:
• data: feature sets (P
i
’s) residing as distributed
data (i.e. RDDs of model-feature pairs),
• cache: maximal feature set (F) precomputed and
distributed to each worker node to be held in me-
mory cache,
• tasks: feature comparison as the atomic unit of
parallel execution.
While feature-to-feature comparison is the atomic
unit for parallelization in this setting, for practical re-
asons we aim for a coarser granularity: we perform
a single pass for each feature, comparing it with the
maximal set. Each parallel task in return consists of
(1) pair-wise comparing a feature p in P
i
versus F and
computing an intermediate vector, and (2) computing
the final VSM vector for the corresponding model,
e.g. via summing the intermediate ones (frequency
setting of SAMOS (Babur and Cleophas, 2017)). To
exemplify, a model consisting of m features is pro-
cessed m-way and eventually integrated to calculate a
single row of the VSM corresponding to that model.
Note that a great deal of the necessary functiona-
lity for distributed operation are provided by Spark:
partitioning and distribution (shuffling) of the data,
synchronisation of the tasks and the workflow, data
collection and I/O, and so on. The necessary modifi-
cations for SAMOS were mostly wrapping the related
building blocks (e.g. parsing and extraction, feature
comparison) into parallel Spark RDD operations with
minimal glue code around them.
5 PRELIMINARY RESULTS AND
DISCUSSION
We performed some preliminary experiments for our
technique. First of all, we used SURFSara
5
, the com-
putational infrastructure for ICT research in the Net-
herlands. SURFSara provides a Hadoop cluster with
Spark support, which consists of 170 data/compute
nodes with 1370 CPU-cores for parallel processing
and a distributed file system with a capacity of 2.3
PB.
Next, as for SAMOS, we chose bigrams of attri-
buted nodes (for the clone detection scenario (Babur,
2018)), as one of the more computationally intensive
setting (e.g. compared with extracting simple word
features for domain analysis (Babur et al., 2016)).
5
https://www.surf.nl/en/about-
surf/subsidiaries/surfsara/
As for the dataset, we mined GitHub for (1) a limi-
ted set of 250 Ecore
6
metamodels, and (2) a large
set of 7312 Ecore metamodels (after exact duplicates
and files smaller than 2KB removed). Table 1 shows
some details on the sizes of the two datasets. A furt-
her SAMOS framework setting to mention is that we
have turned off expensive NLP checks for semantic
relatedness and synonymy for this preliminary expe-
riment.
Normally, we have a simplistic all or none stra-
tegy for NLP-caching; for small datasets we iterate
over all the model elements to compute and keep in
memory the word-to-word similarity scores (i.e. full
caching). For the distributed execution we have disa-
bled this feature as we cannot fit the relevant data for
very large model sets (the goal is to process tens of
thousands of models) into the memory, so we comple-
tely disabled NLP-caching. As future work, we plan
to investigate various more sophisticated approaches
to caching to circumvent this issue.
Table 1: Description of the datasets: number of metamo-
dels, total file size and number of model elements.
dataset #models file size #model elem.
1 250 4.8MB ∼50k
2 7312 133.6MB ∼1 million
Performance for Dataset 1. On dataset 1, we ran
the single-core local version of SAMOS, with and
without NLP-caching, and the distributed version
with 1, 10, 50, 100 and 250 and 500 executors wit-
hout NLP-caching. Figure 4 depicts the results. For
the single-core case, local execution has the best
performance, especially with NLP-caching enabled.
We have included the single-core distributed case to
roughly assess the overhead: 17.1 hours (distributed)
versus 13.8 hours (local). It is evident that as the num-
ber of executors increase, the performance increases
as well, though with diminishing returns.
Performance for Dataset 2. As a bigger challenge
for our approach, we made an attempt to run data-
set 2 with the same (expensive) settings as above.
We could argue going for more approximate, hence
cheaper, settings (unigrams instead of bigrams, igno-
ring instead of including attributes, etc. ) for such
a large dataset but we performed this experiment in
order to load-test and assess the limits of our techni-
que. We successfully calculated the VSM using a to-
tal of ∼1500 executors (215 executors with 7 cores
and 8GB memory each) processing the 5000-way par-
titioned data on SURFSara and obtained the resulting
6
https://www.eclipse.org/modeling/emf/
MOMA3N 2018 - Special Session on Model Management And Analytics
770