5.1 Experiments Setup
Dataset.
The datasets we used in our experiments
are from Mendeley Data
7
. We choose from Mendeley
Data 960 datasets, which are associated with a pub-
lished paper in a known journal. The distribution of
research domains across all the datasets are:
• 60 datasets from the biomedical domain.
• 33 datasets from the computer science domain.
• 180 datasets from the physics domain.
• 683 datasets from the finance domain.
• 4 datasets from the environment domain.
The URI’s of all of these datasets are made available
by us
8
.
We chose these these 960 datasets for the following
reason. First, all of these datasets are retrieved from
Mendeley, which means these are scientific datasets
actually shared by scientists; secondly, these datasets
are all annotated with a link to an associated paper
in their metadata, which means we can retrieve the
gold standard label through the link to the journal of
the paper associated with the dataset. The first reason
ensures the ecological validity of our benchmark, the
second reason ensures that we have a gold standard to
evaluate our results against.
On inspection of the 960 datasets in our gold stan-
dard, we find that there is a strong bias on the distri-
bution of the domains of these datasets, with 70% la-
belled with ”finance”. To compensate for this, we add
a balanced-distribution experiment to check whether
this bias influences experiment results or not. In
the balanced-distribution experiment, we choose 217
datasets (60 from biomedical, 60 from physics, 60
from finance, 33 from computer science and 4 from
environment).
i d : 10 9 5 3 1 2 1 1 2 2 5 4 1 1812111 9 1 1 9 9 8 :MENDELEY DATA,
t i t l e : ” D a t a f o r : D i s t r i b u t i o n n e t w o r k p r i c e s an d
s o l a r PV: R e s o l v i n g r a t e . . . ” ,
d e s c r i p t i o n : ” A b s t r a c t o f a s s o c i a t e d a r t i c l e :
1−in −4 d e t a c h e d h o u s e h o l d s i n . . . ” ,
s u b j e c t A r e a s : F i n a n c e ,
Keywords CSO : [ a r t i c l e , h o u s e h o l d , r a t e , . . . ] ,
K e y w o r d s P h y s i c s : [ a r t i c l e , r a t e , d i s t r i b u t i o n , . . . ] ,
Keywords FINANCE : [ a r t i c l e , h o u s e h o l d , r a t e , . . . ] ,
Ke yw ords Envo : [ a r t i c l e , s o l a r , r a t e , d i s t r i b u t i o n , . . . ] ,
Ke y w o rds Bio : [ s i g n a l r e c o g n i t i o n p a r t i c l e 7 s r n a , . . . ] ,
Keywor ds : [ r a t e , t a r i f f , h o u se h o l d , n e t w o r k , . . . ] ,
d a t a s e t u r l : h t t p s : / / d a t a . mendel e y . com / d a t a s e t s / bwwyv6zy5m ,
DOI : 1 0 . 1 7 6 3 2 / bwwyv6zy5m . 1 ,
l i c e n c e : ”CC BY NC 3. 0 ”
Figure 1: Meta-Data of Mendeley Dataset in JSON.
7
https://data.mendeley.com/
8
https://github.com/eva01wx/WISE
ClassifiactionPaper Datasets
We give an example of the meta-data of a Mendeley
dataset in Figure 1. There are the descriptive metadata
(id, title, description, etc.) and administrative metadata
(licence) in the Mendeley collection. The metadata
”id” is the unique identifier used to index the dataset.
The metadata-fields ”title” and ”description” give a
description of the content and usage of the dataset. We
use these to extract keywords of datasets and in order
to compute the ontology specific view. The metadata-
field ”extractedKeywords” is the set of keywords of
a dataset given in Mendeley. We compute five other
metadata fields for ontology-specific ”extractedKey-
words”, such as ”extractedKeywords CSO”, where for
example ”extractedKeywords CSO” is the ontology
specific view of the computer science ontology CSO
for this dataset, and similar for the other ontologies.
The metadata-field ”dataset url” is the URL linked to
the Mendeley Data search engine. Through this URL,
one can find the description of the dataset (such as title,
associated paper, etc.).
We only use the title and description of datasets for
our classification task, without considering any other
information from the dataset itself. This is because
that we treat the dataset itself as a ”black box” from
which we cannot get any information except for the
title and description. Many scientific datasets have
highly specialised data formats (gene sequences, im-
ages, geo-coordinates, etc.), and these are not suitable
for extracting information in a general purpose search
engine. So we chose to take the hardest case possible,
namely assuming that no information can be gained
from the dataset itself, and all we have are the human
readable descriptions.
Ontology Candidates.
We use five ontology can-
didates from five domain for our ontology selec-
tion task: FIBO
9
(Finance), UMLS
10
(Biomedical),
CSO
11
(Computer Science), ENVO
12
(Environment)
and OPB
13
+physics
14
(Physics). We chose these on-
tology candidates because they are the richest or most
popular ontologies in their domain. For the physics
domain, since there is not any existing ontology that
can cover most concepts in physics domain, we com-
bined physics for biology ontology with a physics for
astronomy ontology.
9
https://spec.edmcouncil.org/fibo/
10
https://www.nlm.nih.gov/research/umls/
11
https://cso.kmi.open.ac.uk/home
12
http://environmentontology.org/
13
https://sites.google.com/site/
semanticsofbiologicalprocesses/projects/
the-ontology-of-physics-for-biology-opb
14
http://www.astro.umd.edu/
∼
eshaya/astro-onto/owl/
physics.owl
Ontology-based Methods for Classifying Scientific Datasets into Research Domains: Much Harder than Expected
157