6 CONCLUSION
In this paper, we provided a novel task for recom-
mending scientific datasets. This task recommends
similar dataset based on a target that is either a query
or another dataset. Based on this task, we intro-
duced several approaches, using some popular sim-
ilarity methods. Also, we executed experiments to
evaluate these approaches on biomedical datasets.
There are a number of lessons that we can draw
from our experiments: 1) it is notable that the task
of recommending similar datasets based on only the
meta-data from these datasets, and possibly a query,
is much harder than one might expect, with even the
best performing methods rarely scoring higher than
80%. 2) we see that the semantic, ontology-based
methods are not capable of solving this task, and that
the statistical methods from machine learning or in-
formation retrieval far outperform the semantic meth-
ods. Even boosting the ontology-based methods with
a machine learning method did not give an accept-
able result. 3) our results show that the BM25-based
approach performs well on the task of dataset recom-
mendation from both a query target and a dataset tar-
get, reaching 70% accuracy in our experiments.
In this paper, we have used only textual meta-data
for dataset recommendation. Given the high vari-
ability in the syntax and semantics of the content of
datasets (ranging from gene sequences to geograph-
ical maps to spreadsheets with financial data), it is
near impossible to use this dataset contents. Never-
theless, there are other signals that could be consid-
ered for similar dataset recommendation. Authors of
datasets would be one of these signals: the co-author
network could be used for dataset recommendation
if we could match dataset’s author into this network.
The Open Academic Graph (OAG) is a very popular
and large knowledge graph that unifies two separate
billion-scale academic graphs MAG and AMiner (Mi-
crosoft, 2021; Sinha et al., 2015; Tang et al., 2008).
We will investigate in future work if this resource can
be exploited to improve the task of recommending
similar datasets that we defined in this paper.
ACKNOWLEDGEMENT
This work has been funded by the Netherlands Sci-
ence Foundation NWO grant nr. 652.001.002 which
is also partially funded by Elsevier. The first author
is funded by by the China Scholarship Council (CSC)
under grant number 201807730060.
REFERENCES
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of machine Learning re-
search, 3(Jan):993–1022.
Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G.,
Ibáñez, L.-D., Kacprzak, E., and Groth, P. (2020).
Dataset search: a survey. The VLDB Journal,
29(1):251–272.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Ellefi, M. B., Bellahsene, Z., Dietze, S., and Todorov, K.
(2016). Dataset recommendation for data linking: An
intensional approach. In ESWC 2016, pages 36–51.
Springer.
Google Developers (2021). Dataset | google search
central. https://developers.google.com/search/docs/
data-types/dataset.
Kunze, S. R. and Auer, S. (2013). Dataset retrieval. In 7th
IEEE Int. Conf. on Semantic Computing, ICSC ’13,
page 1–8.
Le, Q. V. and Mikolov, T. (2014). Distributed representa-
tions of sentences and documents. In ICML.
Leme, L. A. P. P., Lopes, G. R., Nunes, B. P., Casanova,
M. A., and Dietze, S. (2013). Identifying candidate
datasets for data interlinking. In Web Engineering,
pages 354–366. Springer.
Microsoft (2021). Open academic graph - microsoft re-
search. https://www.microsoft.com/en-us/research/
project/open-academic-graph/.
Patra, B. G., Roberts, K., and Wu, H. (2020). A
content-based dataset recommendation system for re-
searchers—a case study on Gene Expression Omnibus
(GEO) repository. Database, 2020.
Resnik, P. (1995). Using information content to evaluate se-
mantic similarity in a taxonomy. In IJCAI, IJCAI’95,
page 448–453.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu,
M. M., and Gatford, M. (1995). Okapi at trec-3.
In Overview of the Third Text REtrieval Conference
(TREC-3), pages 109–126.
Singhal, A., Kasturi, R., Sivakumar, V., and Srivastava, J.
(2013). Leveraging web intelligence for finding inter-
esting research datasets. In WI-IAT 2013, pages 321–
328. IEEE.
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-
J. P., and Wang, K. (2015). An overview of microsoft
academic service (mas) and applications. In WWW
Conference, page 243–246. ACM.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z.
(2008). Arnetminer: Extraction and mining of aca-
demic social networks. In SIGKDD, page 990–998.
ACM.
Wang, X., Huang, Z., and van Harmelen, F. (2020). Evalu-
ating similarity measures for dataset search. In WISE
2020, pages 38–51. Springer.
Wu, Z. and Palmer, M. (1994). Verb semantics and lexical
selection. In 32nd Annual Meeting of the Association
for Computational Linguistics, pages 133–138.
Biomedical Dataset Recommendation
199