3.4 Discussion
In our evaluation, we trained different neural net-
works based on varying parameters and assessed their
precision with respect to the suggestions of benefi-
cial data sources. First of all, it should be noted that
the expected value, i.e., the baseline, was always ex-
ceeded. For the setup of our evaluation, only 5 dif-
ferent domains were used, which is why the expected
value is comparatively high. In reality, it can be as-
sumed that there are many more different domains
available to a domain expert, and with each additional
domain, the expected value drops significantly. How-
ever, the comparison of the first two scenarios S
1
and
S
2
shows that our approach is not noticeably affected
by an additional unfamiliar domain available. Thus,
we can expect that similarly promising results can be
achieved if additional domains are provided and that
the selection of beneficial data sources can be aided
tremendously for a domain expert.
A more detailed analysis of the various parameters
shows that the models that were trained on a larger
number of features achieved better results on the test
datasets. Therefore, we conclude that context groups
become more apparent with more features. In addi-
tion, the fact that context groups are detected less ef-
fectively with less features could also be related to the
fact that we relied on combinations instead of permu-
tations in order to reduce training complexity.
The number of domain-foreign features also has a
noticeable effect. Here, we found that the achieved
mean precision decreases, in particular for the top-1
evaluation. However, a beneficial data source would
still be suggested in approximately every second at-
tempt and the expected value is significantly ex-
ceeded. Since our approach targets domain experts,
they can manually check the suggestions. We expect
the top-4 performance to be more relevant for this rea-
son, since four data sources can be easily reviewed for
suitability and the most beneficial data source can be
manually selected. In this case, there is hardly any
influence by domain-foreign features.
With regard to the number of instances, no partic-
ular influence of the different parameterization could
be found. Yet, this is not surprising and meets our
expectation, since for the vectorization of a feature
mainly a fraction of metrics is used, which does not
change much by including more data. Similarly, we
did not find a stronger influence of the number of rep-
etitions. Indeed, the precision increases slightly, but
at the expense of a longer training time.
In terms of the evaluated transfer learning, it
should be noted that the achieved mean precision de-
creases notably and the results on unfamiliar domains
are lower than those on familiar domains. However,
even data sources from unfamiliar domains are pre-
dicted successfully and the expected value is still eas-
ily outperformed.
In summary, the detailed results show that our ap-
proach provides significant benefits for domain ex-
perts. Moreover, different models were compared
during the evaluation and no matter which parame-
terization was used, the expected value, i.e., the base-
line, was always exceeded. Likewise, the same trends
are evident for all parameterizations (with the excep-
tion of n=100 000 and rep=20). This indicates a high
robustness of our approach.
4 RELATED WORK
A common approach to identify related data is the
use of similarity metrics. In general, similarity met-
rics aim to measure the similarity between instances.
Common metrics applied for this purpose are, for
instance, euclidean distance (O’Neill, 2006), Man-
hattan distance (Craw, 2017; Krause, 1975) or co-
sine similarity (Han et al., 2012). These metrics
work on numeric values only, for text, we can ap-
ply metrics like Levensthein distance (Levenshtein,
1966) or Hamming distance (Hamming, 1950). How-
ever, these metrics are limited to pairwise compar-
isons and do not allow comparisons between text
and numeric values. To overcome this issue, often-
times a vectorization of instances or features is ap-
plied. Hereby, various characteristics of the instance
or feature are measured, e.g., occurrences of charac-
ters or number of lowercase characters. As a result,
an instance or feature is transformed into a vector of
meta-features, typically required for deep learning ap-
proaches. However, this kind of meta-features are
limited to statistical information about the data. To
integrate a semantic information, a state-of-the-art ap-
proach is the use of Word2Vec (Mikolov et al., 2013).
Word2Vec calculates a vector for each word based on
a neural network trained on large text corpora. Thus,
a text value is represented by an n-dimensional nu-
meric vector, a so-called embedding. This approach
allows to perform calculations on the representations,
e.g., king-man+woman = queen. Even if similarity
metrics can be applied to measure the similarity be-
tween two vectors, the underlying semantic relation
is not considered, i.e., in the example above it is not
specified if a vector represents a title (queen/king) or
gender (man, woman) or cards in a game. One ap-
proach to add semantic knowledge to data is the use
of so-called context clusters (Rekatsinas et al., 2015),
which structure the semantics in a knowledge base.
SDRank: A Deep Learning Approach for Similarity Ranking of Data Sources to Support User-Centric Data Analysis
427