Moving towards a General Metadata Extraction Solution for Research

Data with State-of-the-Art Methods

Benedikt Heinrichs

and Marius Politze

IT Center, RWTH Aachen University, Seffenter Weg 23, Aachen, Germany

Keywords:

Research Data Management, Metadata Extraction, Metadata Generation, Linked Data.

Abstract:

Many research data management processes, especially those deﬁned by the FAIR Guiding Principles, rely on

metadata for making it ﬁndable and re-usable. Most Metadata workﬂows however require the researcher to

describe their data manually, a tedious process which is one of the reasons it is sometimes not done. Therefore,

automatic solutions have to be used in order to ensure the ﬁndability and re-usability. Current solutions only

focus and are effective on extracting metadata in single disciplines using domain knowledge. This paper aims,

therefore, at identifying the gaps in current metadata extraction processes and deﬁning a model for a general

extraction pipeline for research data. The results of implementing such a model are discussed and a proof-of-

concept is shown in the case of video-based data. This model is basis for future research as a testbed to build

and evaluate discipline-speciﬁc automatic metadata extraction workﬂows.

1 INTRODUCTION

Research Data Management (RDM) is a central ﬁeld

in today’s universities and is something every re-

searcher should come across. With current state-of-

the-art methods institutions are trying to up their ﬁeld

in providing good ways for researchers to make their

data ﬁndable, accessible, re-usable and accessible ac-

cording to the FAIR Guiding Principles described by

(Wilkinson et al., 2016). Recent works like the cre-

ation of a draft for an interoperability framework by

the EOSC, described by (Corcho et al., 2020), show

the interest and necessity of standards. They describe

the need of fulﬁlling the FAIR Guiding Principles

even when focusing on semantic interoperability for

describing data in a way that other machines can un-

derstand its purpose and meaning. Especially looking

at re-usability, there is therefore a clear need of hav-

ing a description of the content of data, so that people

or machines ﬁnding it know what they are looking at.

The current state on what such a person will ﬁnd is

however administrative things like when the data was

created, to which organization it is assigned or who

holds the rights to it as schemas like Dublin Core, de-

scribed by (DCMI Usage Board, 2002), are widely

used standards for it. While this information can be

https://orcid.org/0000-0003-3309-5985

https://orcid.org/0000-0003-3175-0659

very useful, it does not help in making the data search-

able or understandable on a deeper level. The only

way of ﬁguring out what it is about is by looking at the

data directly, contacting the creator or with some luck

stumbling across some documentation ﬁle or paper

that describes it. This further creates an issue when

looking at the research data-life-cycle research data

moves through shown in ﬁgure 1 and described by

(Schmitz and Politze, 2018). Since data moves with-

out trace between some of those phases it becomes

even more important to have a good representation of

the data whenever the opportunity presents itself for

having some chance to try and understand the path it

took. Therefore, there is a clear need for descriptive

research data that has a description of the content for

tackling these problems. Since this is furthermore a

very tedious or near impossible task to do manually,

automatic ways would greatly reduce the workload on

a researcher and improve the quality of research data.

However, research data is not uniformly, so while

domain-dependent automatic methods might work on

some research data, a general automatic method that

can include domain-dependent methods is necessary

in tackling the issues for every data type. This paper

therefore will focus on the general automatic creation

of so-called metadata that also describes the content

in hopes of being a start to solve the mentioned is-

sues.

Heinrichs, B. and Politze, M.

Moving towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods.

DOI: 10.5220/0010129502270234

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 227-234

ISBN: 978-989-758-474-9

227

Figure 1: Research data-life-cycle of the RWTH Aachen

University.

2 CURRENT STATE AND

RESEARCH GOAL

The creation of metadata is a signiﬁcant task to make

the described research data re-usable and ﬁndable.

Metadata can mean many things, however in this case

it is meant as the description of an entity in the re-

search data or the accumulation of any number of

such descriptions. In this paper, metadata is divided

into three distinct types: administrative, technical and

descriptive. The deﬁnitions are described by (Lubas

et al., 2013) and detailed in the following:

• Administrative Metadata.

Description of the administrative elements that are

necessary in determining e.g. who is responsible

for a certain data entity.

• Technical Metadata.

Description of the necessary information that de-

scribes how a data entity came about, where it is

currently located and how to use it.

• Descriptive Metadata.

Description of the elements that describe a data

entity directly. This means that these elements can

be ﬁle attributes like ﬁle size, ﬁle name and other

delivered properties like creation date or more de-

tailed information like creator and title. Further-

more, for this paper especially the content infor-

mation as metadata is signiﬁcant to be described

in logical triples like subject “Sample AS001”,

predicate “uses chemical” and object “Tetraethy-

lorthosilicate” which are derived from a data en-

tity and currently manually described in projects

like (Politze et al., 2019b).

Following the discussed problems, the metadata type

being focused on is descriptive metadata. However,

it can be noted that certain created information as de-

scriptive metadata can help the creation of adminis-

trative or technical metadata.

2.1 Current State on Metadata

Extraction

Current methods of metadata extraction more than

often focus on the researchers to describe their re-

search data manually as described in (Politze et al.,

2019a). Such values can be discipline-speciﬁc and

include what title a research project has, who has

the rights to it or what subject area it belongs to but

can also specify e.g. the microscope which was be-

ing used, depending on the schema. In (Jane Green-

berg, 2004) and (Mattmann and Zitting, 2011) the re-

searchers even create descriptive metadata automat-

ically based on the attributes of data as ﬁles or us-

ing META tags. This however leads to the content

of research data being largely ignored in the process.

Therefore, a lot of information that could be derived

by a description of the content as metadata is lost. For

this reason, researchers in single domains try to look

into the issue of extracting more descriptive metadata

from their research data by analyzing the content. In

(Burgess and Mattmann, 2014), (Rodrigo et al., 2018)

and (Grunzke et al., 2018) the researchers present

their success in their speciﬁc domains by creating spe-

ciﬁc models and generating descriptive metadata for

their use-case. The problem with this however is, that

this works as long as the research data being analyzed

is uniformly similar. In the broad context of research

data management this however is not the case, which

is why the case for general metadata extraction is be-

ing made, so that metadata can be extracted from ev-

ery research data entity.

2.1.1 Current State on Metadata Representation

Metadata can be represented in widely different ways

and be in different shapes or forms. However, looking

speciﬁcally at research data, a trend can be detected

in the usage of linked data and representing metadata

in the RDF (Resource Description Framework) for-

mat, described by (Cyganiak et al., 2014). Standards,

like Dublin Core described in (DCMI Usage Board,

2002) or EngMeta which is speciﬁcally created for

research data from the Engineering Sciences and de-

scribed in (Iglezakis and Schembera, 2019), are orga-

nized in this format, giving an additional beneﬁt for

using it. Furthermore, metadata in this format can be

seen as a knowledge graph that can be searched and

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

228

contains information about the represented data en-

tity. In this paper, metadata will be therefore looked

as represented by RDF.

2.2 Automatic Metadata Extraction

Since there are multiple methods metadata can be

extracted from different types of data, some general

methods are discussed in the following.

2.2.1 Generic Metadata Extraction

Generic metadata extraction is here currently under-

stood as the extraction of descriptive metadata based

on ﬁle attributes. This means the focus lies on ﬁelds

like creator, creation date and many more. One tool

for extracting this kind of metadata is called Apache

Tika. It was ﬁrst presented by (Mattmann and Zitting,

2011) and is a tool that can extract the attributes of

a ﬁle and represent them in a JSON format. Further-

more, given a text-based format, or a suitable adapter

for a format, it can extract the text content of a ﬁle.

The output however does not conform to the RDF for-

mat and has to be transformed to be represented as

linked data.

2.2.2 Metadata Extraction from Text

Since the representation of a ﬁle can be at many times

a text or at least understood as text, methods which

create metadata based on text are signiﬁcant and one

is called Pikes. It is a tool which was ﬁrst presented by

(Corcoglioniti et al., 2016) and uses natural language

processing techniques to convert text to a rich meta-

data representation in RDF as linked data. This rep-

resentation includes the content relation using NLP

(natural language processing) techniques, links from

entities to DBPedia described by (Lehmann et al.,

2015) and much more. Such metadata can make the

text understandable by machines and can be seen as a

representation of descriptive metadata which contains

the content information.

2.3 Text Extraction

Since a metadata extraction method that can create

the wanted descriptive metadata from text is estab-

lished, methods that represent other data types as text

are looked at to retrieve further descriptive metadata

from them. Research and work has already been done

on certain types and is presented in the following.

2.3.1 Image Text Extraction

Images, for example from microscopes, might con-

tain hints on the current state of the microscope or the

looked entity during an experiment. Therefore, it is

important to extract the text from such images. This

is not a new task and solutions exist for this prob-

lem. One solution is a tool called Tesseract which

was ﬁrst presented by (R. Smith, 2007) and provides

a so called OCR engine that can extract text from

images. Training this engine correctly can result in

a good method for this and further use cases. Fur-

thermore, Tesseract can be integrated in Apache Tika,

which was described in section 2.2.1, and be used to

get the text representation of an image ﬁle.

2.3.2 Audio Text Extraction

With a lot of focus on text-to-speech services, the

other speech-to-text way is also being explored

more and more. In (Kumar and Singh, 2019)

the researchers look at different models and repre-

sent the current state-of-the-art. Furthermore, with

cloud providers like Google (https://cloud.google.

com/speech-to-text) presenting the option, it was

never so simple to use such a method and work with

the results.

2.4 Challenges

The aim of this paper is to build a general workﬂow

to extract machine-readable metadata from the con-

tent of a research data entity. From the description of

the current state-of-the-art and this aim, the following

challenges could be detected:

• Manual metadata creation for every data entity is

not a feasible task

• Solutions for some domain-dependent metadata

extraction exist, but there is no solution which

tries to combine them

• There are many data formats where no speciﬁc

metadata extraction method exists

• Even if text can be represented as descriptive

metadata, the challenge still remains to see how

some data formats can be represented as text

• Generic metadata extraction solutions like

Apache Tika do not produce an output which is

represented as linked data

Moving towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods

229

3 APPROACH

Based on the current state-of-the-art and identiﬁed

research gaps in this chapter a highly conﬁgurable

model is proposed for extracting descriptive meta-

data. The idea is that a pipeline is created which

receives a data entity, represented as any number of

ﬁles, as an input and stores the collected metadata

as the output directly in a metadata store. Since re-

search data in particular can be unique and many data

types are necessary to consider, even application spe-

ciﬁc data types, the model is intended to be as conﬁg-

urable and generic as possible so that custom extrac-

tors can be used to extend the pipeline with custom

data formats. This in particular tackles the issue of

previous research only tackling very speciﬁc use cases

by opening this method up to any data type imagin-

able. Furthermore, since a lot of data can be repre-

sented as text, a custom extraction method does only

have to provide the text representation and not even

deal with the metadata representation since a meta-

data from text extraction method can always be added

to the pipeline. However, if in any step metadata is

produced, it is important the corresponding RDF for-

mat is added as an output as well, so that it can be

processed in the last step. The proposed model goes

through the following steps:

1. The data entity starts with the generic extraction

step, extracting metadata based on ﬁle attributes

and text, if possible, by any number of conﬁg-

urable extractors.

2. Based on the data type any number of conﬁg-

urable MIME-Type-based extraction methods are

called for extracting metadata and a text represen-

tation of the input.

3. The text representation runs through any number

of conﬁgurable text-based metadata extractors for

getting the ﬁnal content-based metadata.

4. The produced metadata gets mapped to a deﬁned

format and veriﬁed for correctness (e.g. missing

entries).

A visualization of this model can be seen in ﬁgure 2

and the concrete steps are described in the following.

Figure 2: Data Entity Extraction Pipeline.

3.1 Generic Extraction

The generic extraction step is initially the most impor-

tant step in the pipeline, since this step shapes all fur-

ther steps. The requirements for a generic extraction

method are that it accepts as input a ﬁle, an identiﬁer

and a collection of conﬁguration values. Furthermore,

a generic extraction method has to fulﬁll multiple re-

quirements for the outputs it has to return, which are:

• Extract the descriptive metadata from a ﬁle using

the ﬁle attributes.

This means that elements which a ﬁle might al-

ready provide information for, like creator, cre-

ation date, ﬁle name and many more, are collected

and represented in a metadata schema.

• Detecting the MIME-Type of a ﬁle.

The MIME-Type of a ﬁle represents in what for-

mat the data is structured. Detecting this is es-

sential for continuing the process since it deter-

mines which MIME-Type-based extractors will

be called next.

• Extracting the text from text-based MIME-Types.

This is optional, however it greatly reduces com-

plexity in the next steps when the generic extrac-

tion method already extracts the textual elements

it ﬁnds from text-based MIME-Types.

3.2 MIME-Type-based Extraction

The speciﬁc custom implementations all follow the

same interface which requires them to register their

targeted MIME-Types to the pipeline. Depending on

if any of their targeted MIME-Types are represented

by the current data entity, they are executed. They re-

ceive as input a ﬁle, an identiﬁer and a collection of

conﬁguration values. The main goal of an extraction

method is either to extract text from the data by using

domain knowledge or extracting metadata directly.

As examples there can be text in images which can

be extracted, an image can be described with the dis-

played objects or audio which contains speech can be

converted to text by state-of-the-art methods. These

results, especially the text representation of the cer-

tain input type, are used and moved to the last step

and therefore that representation becomes even more

important to be extracted by the MIME-Type-based

extraction method.

3.3 Text-based Extraction

The ﬁnal extraction step is the text-based metadata

extraction. The idea here is that after the data en-

tity has passed through the other steps, a text repre-

sentation of some kind for the content exists, even if

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

230

the input was not textual data before. Therefore, ev-

ery text-based extraction method receives additionally

to the ﬁle, an identiﬁer and a collection of conﬁgura-

tion values also the currently extracted text. A lot of

research has been done on extracting metadata, espe-

cially as linked data, from text, therefore this property

can be exploited in this step. Such methods make use

of natural language processing and in general use of

the relations and grammar in a text. An implementa-

tion here should use the text and convert it to a meta-

data representation that describes what important in-

formation the content contains. An example can be to

retrieve the most important topics using topic extrac-

tion or describing the relations of certain nouns in the

text by utilizing the natural language grammar.

3.4 Metadata Representation

The ﬁnal resulting metadata has to be represented uni-

formly, so that any use can be drawn out of it by ma-

chines. The input for this ﬁnal step are a number of

metadata sets in any RDF format that ﬁrst have to be

merged into a singular metadata set. As previously

discussed, the metadata is represented as linked data

and therefore as triples which have a subject, predi-

cate and object. Every entry in the metadata is linked

to the subject that describes the incoming ﬁle. How-

ever, a ﬁlename is not a unique identiﬁer, so the model

requires an additional identiﬁer to be present, other-

wise a random identiﬁer will be generated. Using a

combination of some speciﬁed preﬁx and the iden-

tiﬁer, the subject is created. This creates a linked

data knowledge graph about the input data entity. It

is however not possible for every extraction method

to create metadata with the same quality or quantity

since different data types and extraction methods pro-

duce vastly different results. Therefore, it is necessary

for every extraction method to produce and describe

their representation of data following commonly ac-

cepted structures. Furthermore, certain ﬁelds can be

deﬁned in there as requirements for the metadata set,

so that some common ground can exist between cre-

ated metadata sets. An option for such a require-

ment enforcement is the Shapes Constraint Language

(SHACL) which was speciﬁed by (Knublauch and

Kontokostas, 2017). This language can be used to

form a so-called application proﬁle which deﬁnes the

requirements on the metadata set and with it, the re-

sulting graph can be veriﬁed. Since the graph might

not be in the needed structure, an implementation can

be provided to restructure and map the triples in the

graph to the needed ones. There are a couple of

entries from the Dublin Core schema which should

be always present like http://purl.org/dc/elements/1.

1/creator, http://purl.org/dc/elements/1.1/title or http:

//purl.org/dc/elements/1.1/identiﬁer so it is checked,

if they exist in the graph. Then furthermore, a conﬁg-

ured ruleset as application proﬁle can be taken and ap-

plied to the graph for verifying if the resulting meta-

data conforms to what is required. If this is true, the

metadata should be stored.

4 PRELIMINARY RESULTS

This section will discuss the ﬁrst implementation of

the proposed model as a proof-of-concept and the pre-

liminary results of it. The used tools for each step

will be discussed and the usage of them and modiﬁ-

cations on them will be explained. It is to be noted

that this will not cover every single implementation,

only the most general one and one concrete example

in the case of video ﬁles.

4.1 Current Concrete File Extraction

Pipeline

Figure 3: Concrete File Extraction Pipeline.

In ﬁgure 3 the concrete implementation of the pro-

posed model is shown. There are two concrete imple-

mentations listed which replace the proposed steps,

“Apache Tika” for “Generic Extractors” and “Pikes”

for “Text-Based Extraction” since both of them follow

the requirements. The MIME-Type-based Extraction

still remains open, since this ensures compatibility to

every data type.

4.2 Case Study - Video

Figure 4: Video File Extraction Pipeline.

As a proof-of-concept a video ﬁle is used which con-

tains text and spoken audio for showing the steps in

a realistic scenario. For choosing a video, research

Moving towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods

231

results from a recent published work is being used.

The concrete video ﬁle being used is from the work

of (Heilmann et al., 2020) and is available as the sec-

ond supplement material at https://doi.org/10.23641/

asha.12456719.v1. The resulting pipeline of using a

video ﬁle can be seen in ﬁgure 4. The MIME-Type-

based Extraction step utilizes Tesseract and Speech

Recognition. Each step and what the output is, will

be discussed in the following.

4.2.1 Usage of Apache Tika

As described in section 2.2.1, Apache Tika can make

an ideal ﬁt for the generic extraction. How it performs

regarding the requirements is discussed in the follow-

ing:

• Extract the descriptive metadata from a ﬁle using

the ﬁle attributes.

When passing a ﬁle to Apache Tika, it lo-

cates all descriptive metadata a ﬁle has to of-

fer using the ﬁle attributes and returns them.

This is in this proof-of-concept information

like “dcterms:created” with its value “2014-05-

30T16:13:18Z”.

• Detecting the MIME-Type of a ﬁle.

Apache Tika returns the MIME-Type of a ﬁle,

even if it was not speciﬁed by the ﬁlename exten-

sion. In this proof-of-concept it is “video/mp4”.

• Extracting the text from text-based MIME-Types.

One of the return values from Apache Tika is the

content which represents the ﬁle as text. Further-

more, with the use of technologies like Tesser-

act, text can also be retrieved from images with

Apache Tika. Since for this proof-of-concept a

video ﬁle is being used, there is however no text

extracted.

The requirements are therefore all fulﬁlled. One thing

which however still needs to be done is to trans-

form the output. Apache Tika produces a JSON

structure which does not completely conform to the

RDF format. There are however some links which

are done by some keys that represent an ontol-

ogy like Dublin Core. For utilizing this, the val-

ues containing such a key are linked to the sub-

ject of the ﬁle directly as predicate (key) and object

(value). In this proof-of-concept one such triple can

look like: “ﬁle:{identiﬁer} dcterms:created ’2014-05-

30T16:13:18Z”’ where “ﬁle” is an example preﬁx.

A challenge is to convert every other key-value-pair

so that it can be described in the RDF format and

stored in a metadata store. One idea here for the

proof-of-concept is that the keys which do not con-

form to an ontology and are not URLs are added to a

constant preﬁx, e.g. “http://example.org/tika/{key}”.

This can then be used in the proof-of-concept with

a triple like: “ﬁle:{identiﬁer} tika:Content Type

’video/mp4”’. Further work still has to be done on

mapping the results from Apache Tika accurately to

an ontology and verifying the results using an applica-

tion proﬁle. After the conversion, a concrete metadata

entry can be pushed further to the metadata store.

4.2.2 Usage of Tesseract and Speech Recognition

It is a difﬁcult task to extract metadata from video

ﬁles normally, since the information can only be dis-

tinguished frame by frame, but even then there is the

issue of audio and images which need to be analyzed.

For this proof-of-concept, the model gets split into

an audio and image part so that the input can still

be processed. First the audio gets analyzed by spe-

ciﬁc audio ﬁle extractors which transform the audio

into a text representation of the spoken text by using

speech-to-text methods. This results in sentences be-

ing extracted like “This short video is an example of

[...]”. Next, the images of every frame are passed

to image ﬁle extractors which extract e.g. the text

shown in an image with Tesseract and describe the

objects shown in an image in a metadata representa-

tion. With respect to frames being similar over the

cause of time, the metadata and text representations

of the images are combined. In this proof-of-concept

this results in extracted text like “CONVERSATION -

SCHOOL AGE” as seen in ﬁgure 5 and metadata like

“ﬁle:{identiﬁer} foaf:depicts imageobject:clock” and

“imageobject:clock imageobject:count 1” where “im-

ageobject” is an example preﬁx and links to the frame

shown in ﬁgure 6.

Figure 5: Video frame with the text “‘CONVERSATION -

SCHOOL AGE”.

Figure 6: Video frame with a clock.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

232

4.2.3 Usage of Pikes

As a text-based extraction method, the tool Pikes

is being used as described in section 2.2.2. The

accumulated text gets pushed to Pikes either all at

once, or depending on the number of sentences and

a pre-conﬁgured threshold per batch. The received

output contains the content relations, however also

includes a lot of grammar information and deﬁnitions

on where the words are placed in the text. Since

the interest is here on describing the content and

the value is low on the natural language semantics

and concrete position of something, the output

is ﬁltered to only include the content relations,

e.g. subject “http://pikes.fbk.eu/#child”, predicate

“http://dkm.fbk.eu/ontologies/knowledgestore#mod”

and object “attr:school-age”, and deﬁnitions for

certain entities e.g. for “http://pikes.fbk.eu/#math”

a DBPedia entry (“dbpedia:Mathematics”) exists

and is linked to it. Every relation is represented

by Pikes in fact graphs, however storing them as

numerous fact graphs and linking them to a ﬁle is

creating a large number of triples without much

beneﬁt. Therefore, the fact graphs are merged into

one fact graph with the preﬁx being a static one e.g.

“http://example.org/factgraph/” combined with the

ﬁle identiﬁer. This factgraph is then linked to the ﬁle,

completing text-based extraction.

4.2.4 Metadata Representation

In this proof-of-concept the created metadata does

not have to be further modiﬁed and can be validated

against an application proﬁle which requires certain

Dublin Core values to be present but is open to any

number of further triples. This succeeds and the meta-

data is stored. Therefore, the model allows describ-

ing information extracted from a video as structured

metadata that could be indexed by semantic search en-

gines.

5 CONCLUSION

This paper describes the need of a general way to ex-

tract descriptive metadata which works regardless of

the type and can be extended. As a starting point,

a pipeline is proposed that allows the extraction of

RDF-based descriptive metadata to represent a pro-

vided data entity. It offers extension points to become

agnostic to the representation of the information in

the data entity, ranging from textual data to audiovi-

sual representations. The proposed model provides

a solution in generalizing the ways metadata can be

extracted and clear requirements, necessary modules

and a structure for making it work are discussed. A

ﬁrst implementation of the model is shown with state-

of-the-art technology and the case of video ﬁles is

shown as a proof-of-concept.

5.1 Identiﬁed Research Gaps

The created pipeline was an experimental setup which

can now be tried in practice. This paper describes the

setting and provides the model as a ﬁrst way of deal-

ing with the current challenges. However, some fur-

ther gaps which need to be looked into were discov-

ered and are formulated as questions in the following:

1. Following the generic extraction methods, how

should the resulting metadata look like and how

should the results be represented or mapped?

2. How can such a proposed pipeline be evaluated?

3. There are many data types which such a proposed

pipeline could support, but which data types really

are the ones which are needed for strictly working

with research data?

4. The here used proof-of-concept extracts metadata

from a video ﬁle. During this process the pipeline

extracts images and extracts text and metadata

from them. How can such a video ﬁle be com-

pared to an image using the metadata and how can

it be detected that they are similar and one is stem-

ming from the other?

5. Since there is a large variety of data, how can a

mapping of the resulting metadata to an applica-

tion proﬁle be created?

5.2 Future Work

Building on this work, the extraction methods will be

enhanced so that more and more data types are sup-

ported. Deep learning could furthermore be used if a

model can be created which represents research data

as an input and metadata about it as an output. In

an analysis it shows that speciﬁcally source code and

numerical data are signiﬁcant aspects for current re-

search data and therefore should be included. Fur-

thermore, the nature of the created pipeline allows

for previous discipline-speciﬁc methods to be inte-

grated. Looking at the resulted metadata, methods to

map them more accurately to application proﬁles and

specifying a concept in which often occurring pred-

icates are mapped using speciﬁed rules to common

standards will greatly increase the quality of the out-

put. This can help to utilize the produced metadata in

other domains like checking with such a representa-

tion the similarity to another data entity without the

Moving towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods

233

need of looking at the whole content. Furthermore,

reducing the produced metadata to only the most rele-

vant and uniquely representing parts will increase the

interpretability and comparability. Lastly, collecting

such metadata on research data for a research project

could be used to act as a knowledge graph of the

whole research to identify what the research is actu-

ally about and discovering the novelty of it.

REFERENCES

Burgess, A. B. and Mattmann, C. A. (2014). Automati-

cally classifying and interpreting polar datasets with

apache tika. In Joshi, J., editor, 2014 IEEE 15th Inter-

national Conference on Information Reuse and Inte-

gration (IRI), pages 863–867, Piscataway, NJ. IEEE.

Corcho, O., Eriksson, M., Kurowski, K., Ojster

sek,

M., Choirat, C., van de Sanden, M., and Cop-

pens, F. (2020). Eosc interoperability frame-

work (v1.0): Draft for community consultation.

https://www.eoscsecretariat.eu/sites/default/ﬁles/eosc-

interoperability-framework-v1.0.pdf.

Corcoglioniti, F., Rospocher, M., and Aprosio, A. P. (2016).

Frame-based ontology population with pikes. IEEE

Transactions on Knowledge and Data Engineering,

28(12):3261–3275.

Cyganiak, R., Lanthaler, M., and Wood, D. (2014). RDF

1.1 concepts and abstract syntax. W3C recommenda-

tion, W3C. http://www.w3.org/TR/2014/REC-rdf11-

concepts-20140225/.

DCMI Usage Board (2002). Dcmi metadata terms.

https://www.dublincore.org/speciﬁcations/dublin-

core/dcmi-terms/.

Grunzke, R., Hartmann, V., Jejkal, T., Kollai, H., Prabhune,

A., Herold, H., Deicke, A., Dressler, C., Dolhoff,

J., Stanek, J., Hoffmann, A., M

uller-Pfefferkorn, R.,

Schrade, T., Meinel, G., Herres-Pawlis, S., and Nagel,

W. E. (2018). The masi repository service — com-

prehensive, metadata-driven and multi-community re-

search data management. Future Generation Com-

puter Systems.

Heilmann, J., Tucci, A., Plante, E., and Miller, J. F. (2020).

Assessing functional language in school-aged chil-

dren using language sample analysis. Perspectives of

the ASHA Special Interest Groups, 5(3):622–636.

Iglezakis, D. and Schembera, B. (2019). EngMeta - a Meta-

data Scheme for the Engineering Sciences.

Jane Greenberg (2004). Metadata extraction and harvesting.

Journal of Internet Cataloging, 6(4):59–82.

Knublauch, H. and Kontokostas, D. (2017). Shapes

constraint language (SHACL). W3C recommenda-

tion, W3C. https://www.w3.org/TR/2017/REC-shacl-

20170720/.

Kumar, Y. and Singh, N. (2019). A comprehensive view

of automatic speech recognition system - a systematic

literature review. In 2019 International Conference on

Automation, Computational and Technology Manage-

ment (ICACTM), pages 168–173.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kon-

tokostas, D., Mendes, P. N., Hellmann, S., Morsey,

M., van Kleef, P., Auer, S., and Bizer, C. (2015). Db-

pedia – a large-scale, multilingual knowledge base ex-

tracted from wikipedia. Semantic Web, 6(2):167–195.

Lubas, R. L., Jackson, A. S., and Schneider, I. (2013). In-

troduction to metadata. In Lubas, R. L., Jackson,

A. S., and Schneider, I., editors, The Metadata Man-

ual, Chandos Information Professional Series, pages

1–15. Chandos Publishing.

Mattmann, C. and Zitting, J. (2011). Tika in action.

Politze, M., Bensberg, S., and M

uller, M. (op. 2019a).

Managing discipline-speciﬁc metadata within an inte-

grated research data management system. In Filipe, J.,

editor, Proceedings of the 21st International Confer-

ence on Enterprise Information Systems ICEIS 2019,

Heraklion, Crete - Greece, May 3 - 5, 2019, ICEIS

(Set

ubal), pages 253–260, [S. l.]. SciTePress.

Politze, M., Schwarz, A., Kirchmeyer, S., Claus, F., and

uller, M. S. (2019b). Kollaborative Forschungsun-

terst

utzung : Ein Integriertes Probenmanagement. In

E-Science-Tage 2019 : data to knowledge / heraus-

gegeben von Vincent Heuveline, Fabian Gebhart und

Nina Mohammadianbisheh, pages 58–67, Heidelberg.

E-Science-Tage, Heidelberg (Germany), 27 Mar 2019

- 29 Mar 2019, Universit

atsbibliothek Heidelberg.

R. Smith (2007). An overview of the tesseract ocr engine.

In Ninth International Conference on Document Anal-

ysis and Recognition (ICDAR 2007), pages 629–633.

Rodrigo, G. P., Henderson, M., Weber, G. H., Ophus, C.,

Antypas, K., and Ramakrishnan, L. (2018). Science-

search: Enabling search through automatic metadata

generation. In 2018 IEEE 14th International Confer-

ence on e-Science (e-Science), pages 93–104.

Schmitz, D. and Politze, M. (2018). Forschungsdaten man-

agen – bausteine f

ur eine dezentrale, forschungsnahe

unterst

utzung. o-bib. Das offene Bibliotheksjournal /

Herausgeber VDB, 5(3):76–91.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J.,

Appleton, G., Axton, M., Baak, A., Blomberg, N.,

Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E.,

Bouwman, J., Brookes, A. J., Clark, T., Crosas, M.,

Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T.,

Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G.,

Groth, P., Goble, C., Grethe, J. S., Heringa, J., ’t Hoen,

P. A. C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher,

S. J., Martone, M. E., Mons, A., Packer, A. L., Pers-

son, B., Rocca-Serra, P., Roos, M., van Schaik, R.,

Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T.,

Strawn, G., Swertz, M. A., Thompson, M., van der

Lei, J., van Mulligen, E., Velterop, J., Waagmeester,

A., Wittenburg, P., Wolstencroft, K., Zhao, J., and

Mons, B. (2016). The fair guiding principles for sci-

entiﬁc data management and stewardship. Scientiﬁc

data, 3:160018.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

234