A Scalable Framework for Dynamic Data Citation of Arbitrary

Structured Data

Stefan Pr

oll

1,2

and Andreas Rauber

SBA Research, Vienna, Austria

Institute of Software Technology and Interactive Systems, Technical University of Vienna, Vienna, Austria

Keywords:

Research Data Citation, Subset Versioning, Subset Authenticity, Data Sharing.

Abstract:

Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data

driven experiments. Also exchanging data allows scientists to reuse data in different contexts and gather new

knowledge from available sources. But with increasing volume of data, researchers need to reference exact

versions of datasets. Until now access to research data often based on single archives of data ﬁles where

versioning and subsetting support is limited. In this paper we introduce a mechanism that allows researchers

to create versioned subsets of research data which can be cited and shared in a lightweight manner. We

demonstrate a prototype that supports researchers in creating subsets based on ﬁltering and sorting source

data. These subsets can be cited for later reference and reuse. The system produces evidence that allows

users to verify the correctness and completeness of a subset based on cryptographic hashing. We describe a

replication scenario for enabling scalable data citation in dynamic contexts.

1 INTRODUCTION

Having access to data sources is crucial for our so-

ciety and recently many initiatives promote the use,

reuse, sharing and open access to data. Data shar-

ing portals

gain popularity and public institutions no

longer hide and protect their valuable data. In the sci-

entiﬁc domain we also face strong encouragement for

sharing data and provide access to data sources which

enables peers to reuse data in completely new con-

texts.

Research is an iterative process that requires to re-

run experiments with different input data in order to

verify results. Researchers often have to modify their

datasets. They need to track different versions and

have access to them on demand. Therefore scientists

require a mechanism that allows them to work with

dynamic datasets. Still they need to reference any ver-

sion for later reinspection.

Sharing data without providing additional authen-

ticity and provenance metadata for its correctness is

insufﬁcient. We need to be sure that the data is accu-

rate, unchanged, not manipulated and complete. Ev-

idence that allows to evaluate whether a dataset is

available in the correct version, is complete and not

e.g. http://www.ﬁgshare.com/

manipulated is essential.

In this paper we present a framework that allows

researchers to create, reference and cite subsets of

dynamically changing data without the need for full

sized data exports. We address solutions for a range

of disciplines which do not yet have the tradition of

using sophisticated data management tools such as

large scale structured databases. Our framework for

dynamic data citation allows researchers to generate

and retrieve secure evidence for the integrity of their

smaller-scale datasets, i.e. wide spread data formats

such as CSV ﬁles.

The remainder of this paper is structured as fol-

lows. Section 2 describes the problem of dynamic

data citation. Section 3 provides an overview of ex-

isting work in the area and shows its application on

evolving databases which our work uses as a ba-

sis. Section 4 provides a sample use case. Sec-

tion 5 provides an overview of the server side im-

plementation and Section 6 describes a novel hash-

ing scheme that we use for result set identiﬁcation

and integrity veriﬁcation. Furthermore, we discuss

security improvements of our approach with regards

to tamper-resistance of datasets. Section 7 introduces

the client component which allows researchers as-

sembling, storing, referencing and citing datasets and

sharing them with peers. Section 8 covers the query

223

Pröll S. and Rauber A..

A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data.

DOI: 10.5220/0004991802230230

In Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA-2014), pages 223-230

ISBN: 978-989-758-035-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

store which serves as the repository for the queries

that are used to retrieve the datasets. The paper closes

with a conclusion and a future outlook in Section 9.

2 MOTIVATION

Sharing data and referencing it is an important step

towards a more open scientiﬁc community that actu-

ally can reproduce experiments from peers instead of

relying on the paper based publication only. Classi-

cal paper publications without available data are in

many cases not sufﬁcient to reproduce experiments.

Many journals now require data audits, data deposits

and data descriptors for their submissions. For this

reason researchers need tools that allow them tracing

different versions of their datasets. Data citation deals

with the question how data can be identiﬁed and thus

strongly supports the culture of sharing data between

scientists by allowing to precisely identify, reference,

cite and credit their research data.

Researchers use various kinds of data formats

as input for their experiments, e.g. CSV, JSON,

XML, RDF and various ofﬁce software formats. Dur-

ing their research scientists often need to adapt their

datasets according to their requirements, they might

need to update or delete values and create new sub-

sets of their data. Each of these iterations results in a

new version of the dataset. For text based data ﬁles

many researchers do not use any version control soft-

ware. Recently source code versioning software such

as subversion

or git

have gained popularity outside

the computer science departments, but still the ﬁnal

dataset which eventually gets published on the repos-

itory does not contain previous versions any more.

Additionally datasets are mainly published as archive

ﬁles on institutional Web sites where a hash string

serves as a digital ﬁngerprint. If datasets are large

and consume considerable amounts of storage space,

providing multiple versions with the appropriate his-

tory metadata causes a burden to the data provider.

Therefore, many institutions only provide the latest

version of a dataset without any previously accessible

versions available.

The information about earlier versions is valuable

for several reasons. First of all, versions contribute

to the provenance track of a dataset and therefore are

evidence how a dataset was obtained. Even if sev-

eral versions of the datasets are available, without the

appropriate metadata it is not possible to understand

how they have been created. This is especially true for

https://subversion.apache.org/

www.git-scm.com/

scientiﬁc disciplines where large scale database man-

agement systems are not available that would allow

tracing the creation of a dataset. Secondly, the pro-

cess of creating a dataset consumes resources in the

form of time and money. Thus preserving informa-

tion about previous datasets is economically reason-

able and other peers would beneﬁt from the historical

data. Thirdly, the attribution of datasets to their au-

thors is not possible without referencing a precisely

deﬁned dataset.

Citing subsets of datasets is still a challenge, es-

pecially when the source data is still evolving. What

is needed is a solution that allows researchers to

cite arbitrary subsets of their research datasets and

share them with peers without the need to copy large

datasets.

We developed a prototype that supports scientists

in referencing, citing and sharing their datasets in a

lightweight fashion. In the scientiﬁc domain there ex-

ists a huge variety of data formats that are used for

specialised research in diverse domains. The band-

width of data formats is highly diverse and the fo-

cus of research is often directed towards highly spe-

cialised data formats used in big data processing or

structured databases as they are used in large scale sci-

entiﬁc facilities such as CERN

or STFC

. But many

disciplines still rely on simpler solutions such as CSV

(coma separated value) (Shafranovich, 2005). Al-

though the format was originally only used for small

scale datasets, now bigger sets also need to support

proper data citation (Bertin-Mahieux et al., 2011).

CSV data recently receives more attention

as it

is a very clear and simple data format that is widely

used in different domains. Also there exist a lot of

different tools that allow export and import facilities

in order to transform the data into a different data for-

mat that might be more complex. Therefore we chose

CSV data as a reference data type for our prototypes,

but our methods can be applied to any structured data

format. The system that we propose in this paper can

be used to create citable subsets of versioned CSV

data by only storing the queries that were used to as-

semble a subset. In order to preserve the knowledge

how this datasets have been constructed, we need to

collect dataset metadata.

Independently from the actual data format used,

there are only two axiomatic methods that researchers

need to apply in order to compile their datasets. They

need to select those attributes of a dataset which they

are interested in (projection) and they have to ﬁlter

these by some criteria (selection).

http://home.web.cern.ch/about/computing

http://pan-data.eu/STFC

http://www.w3.org/2013/csvw

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

224

For tracing the creation of a dataset, we need to

store the ﬁltering and sorting operations that were

used in order to generate the dataset. If we apply these

operations again on versioned data, the result set will

caeteris paribus return the same result set again. We

provide mechanisms that ensure dataset integrity and

dataset stability, i.e. stable sorting of the dataset for

further processing.

In order to create and use such datasets, we require

basic data deﬁnition and manipulation languages.

Due to the compatibility of relational SQL databases

and CSV ﬁles, we can use the powerful features of

modern relational database management systems in

order to manage the data.

3 RELATED WORK

Data citation is an urgent topic (Lawrence et al.,

2011), yet the citation landscape is fragmented (Par-

sons et al., 2010) and various approaches exist

(CODATA-ICSTI, 2013). In February 2014 the Data

Citation Synthesis Group published their Joint Dec-

laration of Data Citation Principles

which address 8

core concepts for accessible research data. The docu-

ment highlights the importance of data as a ﬁrst class

research object which requires the same attention with

regards to citations as paper publications (principle

1). Not only stimulates data citation correct attri-

bution among peers (principle 2), but it serves even

more importantly as evidence (principle 3). When-

ever a statement within a publication is based upon the

foundation of data, the corresponding dataset needs

to be referenced. Obviously each dataset requires a

unique identiﬁer (principle 4) that allows resolving a

dataset and its accompanying materials such as meta-

data (principle 5). These identiﬁers need to be avail-

able for the long term, i.e. persistent (principle 6).

Data citation needs to facilitate means for researchers

to assess the data’s provenance and ﬁxity ( principle

7). The guidelines encourage an interoperable design

that can be applied across research domain and com-

munity boundaries.

The authors of (Pr

oll and Rauber, 2013a) propose

a model for citing evolving data from SQL databases

which is based on time stamp annotated queries and

versioned data. The authors describe a query cen-

tric citation approach that augments SQL queries with

timing information, which can then be utilised in or-

der to retrieve a the same data again at a later point

in time. Also they present a generalised model for

rendering dynamic data citable and deﬁne basic re-

http://www.force11.org/datacitation

quirements for citable subsets. In (Pr

oll and Rauber,

2013b) the same authors applied their model on a

use case and described a reference implementation.

They propose several implementation strategies and

describe how an existing relational database schema

can be extended to support data citation. They dis-

cuss the advantages and disadvantages of several im-

plementation variations and show how previous ver-

sions data can be retrieved from dynamically chang-

ing source data. Our work is based on these two pa-

pers and extends the model by applying a chained

hashing approach which ensures data integrity of sub-

sets. Furthermore, we apply the model to a client

server infrastructure that enables researchers to create

and reference arbitrary subsets from ﬂat data ﬁles.

Fingerprinting and watermarking relational

databases is a common technique to detect tampering

(Li et al., 2005). The authors of (Narasimha and

Tsudik, 2006) describe a method for validating the

completeness and correctness of queries. In our

work, we use a similar method applying a row based

hash on the result set returned from a query.

4 A SCIENTIFIC USE CASE FOR

CITABLE CSV FILES: SYSTEM

OVERVIEW

In order to implement dynamic data citation for CSV

ﬁles we use a client-server approach. The server com-

ponent (see Section 5) is responsible for handling

the data, managing the metadata and ensuring the in-

tegrity of the raw data. The server component is also

responsible for interacting with data citation queries

that return previously created subsets with the appro-

priate evidence of authenticity. The client (see Sec-

tion 7) serves as frontend and allows users to assemble

datasets. The prototype that we developed supports

researchers in managing their datasets during the fol-

lowing exempliﬁed experiment based on the Million

Song dataset (Bertin-Mahieux et al., 2011).

4.1 Use Case Description

Music classiﬁcation is a widely used method which

has applications in many areas such as genre or style

classiﬁcation, recommender services, playlist genera-

tion etc. These systems are based on feature extrac-

tion from a potentially large set of audio ﬁles. In

order to train the machine learning algorithms, spe-

ciﬁc sets of audio ﬁles and their features are required.

For interoperability reasons, many tools from that do-

main support the CSV ﬁle format and store the fea-

AScalableFrameworkforDynamicDataCitationofArbitraryStructuredData

225

tures in this simple textual representation(Hall et al.,

2009). Whenever the audio ﬁles which serve as input

for such experiments change, e.g. due to an increased

ﬁdelity, detected errors in source ﬁles or newly avail-

able material, the features need extracted again and

the datasets need to be updated. Researchers have to

be able to differentiate between these versions on or-

der to analyse the effects of the updated source mate-

rial on their results. Therefore precise citation which

considers subsets and different versions are an essen-

tial requirement. Also, researchers need tools which

allow them to exchange and share data without hav-

ing to deal with the overhead of managing potentially

large ﬁle dumps of different versions of several sub-

sets which they need for their work.

4.2 Supporting Automated Dynamic

Data Citation

The prototype we developed allows researchers to up-

load their CSV data (e.g. from feature extraction ap-

plication) to a Web service. The server component

ingests the data and automatically transforms the data

into a relational database model and bulk inserts the

data, see Section 5. During the ingestion phase, the

database server annotates every new record with a

time stamp. As each song as a unique identiﬁer, up-

dates of existing song data can be detected. In this

case, the server adds versioning additional metadata.

No data gets overwritten or deleted, markers are used

to indicate the version number and record status.

Our prototype provides a frontend which allows

researchers to select speciﬁc subsets based on their

personal requirements, sortings and ﬁlters. The fron-

tend submits all selection and ﬁltering operations to

the backend, which records them in a sequential man-

ner. The server stores adds metadata to the query such

as query execution time and sorting sequences. This

mechanism allows retrieving a speciﬁc version of any

dataset at a later point in time. Instead of being based

on speciﬁc database log ﬁle formats, this metadata re-

mains human readable and can be utilised in any other

database system in a similar fashion.

After a researcher has created a dataset, he con-

ﬁrms the new subset to the system. The server then

iterates over the result set and computes a hash value

as evidence for the integrity of the subset. The query

is stored and a persistent identiﬁer is attached to the

dataset. This persistent identiﬁer serves as a handle

which can be shared with other peers and be used in

publications. A resolver service may point users who

enter the PID to the speciﬁc version of a dataset where

it can be retrieved. Landing pages can provide addi-

tional information about the dataset such as previous

versions, query text and ﬁlter terms. As the system is

aware of updates and evolving data, researchers have

transparent access to speciﬁc versions. There is no

need of storing multiple versions of a dataset exter-

nally for the long term as the system can reproduce

them on demand. As hashing methods are in place,

the integrity of the datasets can be veriﬁed.

5 SCALABLE BACKEND FOR

DATA CITATION OF DATA

The prototype we developed provides a Web service

for uploading data. Although many RDBMS natively

support importing CSV ﬁles, we used a CSV parser

library in order to analyse the data ﬁles and perform

data cleansing, header data generation and escaping

of special keywords and characters. As the content of

the ﬁles is previously unknown, the database schema

is generated based on the column metadata on the

ﬂy. In this simple scenario we utilise VARCHAR

ﬁelds with the longest encountered ﬁeld length that

is gathered during the upload process. Future ver-

sions may support more speciﬁc column data types

in order to increase search and indexing performance

of the system. The server automatically deploys the

table schema and appends columns for maintenance

metadata such as the sequence in which the data was

inserted and a timestamp. After the Web service has

populated the database, the researcher can utilise the

frontend we propose in Section 7. The server’s API

currently supports sorting of arbitrary columns of the

dataset in either ascending or descending order. As

we currently only implemented CSV data, only text

based ﬁltering is supported. More complex ﬁltering

options will be available future versions of the proto-

type.

The goal of our data citation approach is to re-

duce the overhead for data citation to a minimum for

data providers and hide them transparently from re-

searchers. We use a second database instance denoted

DCDB that replicates the primary database denoted

DB. The DCDB server implements the data citation

functionality. Hence the database DB only contains

the latest version of the records whereas the replicated

database DCDB is used for managing historical data.

Introducing a separate data citation system has sev-

eral advantages. Firstly it is possible to introduce data

citation without interfering with the primary database

DB. All additional metadata that is required in order

to facilitate data citation can be moved to the DCDB

database instance. The replicated server DCDB im-

plements triggers which react on updates or deletes.

Any operation that alters the original table is reﬂected

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

226

in history tables on the replicated server. Figure 1

shows an overview of the setup.

Figure 1: Replication Scheme.

The user interacts with database DB which han-

dles the current state of the data. Whenever a re-

searcher creates new subset, the query which was

used is stored together with the timestamp of the

query execution time in the query store. All table

operations on DB are automatically replicated to the

data citation database instance DCDB and annotated

with versioning information. If records are updated

or deleted, these changes are immediately visible at

the DB server and replicated to the DCDB server in-

stance, which maintains the historical data.

For retrieving a speciﬁc dataset, the persistent

identiﬁer allows to retrieve the query again, which

then is issued against the historical data. The data

citation capabilities in this setup does not cause any

burden on the data server DB as the queries that gather

historic data are only issued against the data cita-

tion server. Advantages of this scenario are horizon-

tal scalability, which allows to introduce several data

citation instances that can operate on the same data

or the possibility to implement different replication

strategies. Delayed replication can be used for in-

stance to time the replication to off peak hours for

further reduction of required performance of the DB

instance.

The replicated server holds a full copy of the pri-

mary database DB. The overhead in terms of storage

in this setup depends on the frequencies of updates on

the original table in the DB instance. In terms of ad-

ditional data the replication server automatically ap-

pends at least three columns for insertion and modiﬁ-

cation timestamps as well as the ﬂags which indicate

the record status.

6 HASHING DATASETS

Researchers may not rely the assumption that the

cited data itself has not changed ever since it has been

published, therefore evidence is needed. For ensuring

authenticity and integrity of these archives, a check-

sum of the ﬁle is calculated and provided as a refer-

ence. Traditionally, checksums of datasets are based

on a hash of the complete content of a data ﬁle. Tools

such as md5sum or sha1sum iterate over the content

of a ﬁle and compute a checksum according to a hash

value. Researchers would then download the ﬁle, cal-

culate the checksum themselves locally and compare

the resulting hash with the one provided by the data

publisher. The resulting hash string differs for new or

changed result sets. The problem with this approach

is that it requires the complete dataset as a ﬁle for each

new version, hence it does not scale well for large sub-

sets.

6.1 A Chained Approach

There are two main requirements for result set hash-

ing: identiﬁcation of content changes and recognising

deviations in the sorting sequence of subsets. Several

strategies exist to create a result set hash with regards

to these requirements. The most obvious approach is

using the complete result set as a basis for hashing.

This does not scale well for huge datasets. Another

approach would be to generate a hash by appending

all primary keys of the result set’s rows. This allows

tracing the sequence, but not the content integrity. A

further approach is the creation of row hashes by ap-

pending all columns of each row and computing the

hash value row wise. In this case all row hashes can

get sequentially appended and then serve as the basis

of the hash. This approach however does not reﬂect

the column selection that was made by the researcher.

In order to enable result set hashing with respect

to integrity and sorting we only use those columns

which been matched by the projection and those rows

which have been included by the selection during the

ﬁltering. These rows are used for the hash value cal-

culation of the result set. In the next step, we calculate

the checksum of the particular subset by constructing

a hash chain. For each row r

, where the subscript

n denotes the sequence number of the row in the re-

sult set, the system appends the data from the pro-

jected columns and concatenates the selected values

to one string per row. We denote this string of ap-

pended values as rv

. Then we calculate the hash of

each concatenated string by using a one-way crypto-

graphic hash function denoted h(rv

). For maintain-

ing the sequence of the results in the subset, we use a

chained hash approach as in the equation in Equation

h(rv

) =

(

h(rv

), if n = 0.

h(h(rv

(n−1)

) + rv

), if n > 0.

(1)

AScalableFrameworkforDynamicDataCitationofArbitraryStructuredData

227

We prepend each row with the hash value of the

previous row in order to reﬂect the sequence of the

records in the result set. The system iterates in this

fashion over all rows of a speciﬁc subset and sequen-

tially computes a hash of the complete result set in

the correct order. As each row (except the ﬁrst rv

)

calculates its hash value based on its predecessor, the

sorting that was applied to the original dataset can be

maintained. Figure 2 shows this scheme.

The proposed hashing method allows to detect er-

rors in the data, i.e. insertions, deletions or modiﬁca-

tions. Furthermore the sequence of the data and also

its alignment (e.g. the sequence of the columns) of

each result set can be checked against the original re-

sult set. As we calculate the hashes individually per

row, the storage demand for each row is constant (e.g.

SHA1 uses 160 bits per hash). The overall check-

sum for the result set is computed by chaining hashes.

This keeps the storage demand for the creation of the

hash low as it never exceeds the length |rv

| + |h| for

each hashing operation. Also the scheme is agnostic

regarding the hashing algorithm used.

Figure 2: Hashing Scheme.

Therefore our model provides evidence that a

dataset is authentic, complete and unchanged in a dy-

namic setting. This allows researchers to ensure that

they are using the correct dataset and they can ref-

erence an explicit version of a subset with a persis-

tent identiﬁer. The authors of (Bakhtiari et al., 1995)

present an overview of hash functions, popular hash

functions which are widely used are MD5, the SHA-

family of hash functions. As MD5 has been found to

be vulnerable for collision attacks (Wang et al., 2004;

Klima, 2005), we utilised the SHA1 hash function for

generating row based hashes in our prototype imple-

mentation.

6.2 A Secure System for Storing Data

Citation Metadata

The presence of hash keys alone is no guarantee for

security as they can be recomputed for manipulated

content. In order to harden our prototype implemen-

tation for data manipulation, we considered several

mechanisms which are described in the following sec-

tions.

It is clear that the data citation database which

holds the history data and their metadata requires pro-

tection from intended sabotage and unintended mis-

use. The same is true for the primary database. In

contrast to the primary database DB, data is never

deleted or updated from DCDB. Only new records

along with the event type (e.g. INSERTED, UP-

DATED or DELETED) need to be inserted. Therefore

the historic tables do not allow updates or deletes in a

permission level.

As an additional level of security, an archival

database storage mechanism needs to be deployed.

The MySQL RDBMS that we use in our prototype

implementation provides the ARCHIVE storage en-

gine that does not support DELETE, REPLACE and

UPDATE operations by design. This specialised stor-

age engine only supports INSERT and SELECT state-

ments which are the only two operations needed in

this scenario. Additionally, the rows are automatically

compressed upon data insertion which reduces the

storage footprint of the versioned data and its meta-

data used for data citation. In the particular case of

the ARCHIVE storage engine, it needs to be consid-

ered that this system does not support indexes, hence

there is a trade-off between storage demand and data

citation query performance.

6.3 Server and Client Side Dataset

Validation

The queries which are stored in the query store (as

described in Section 8) contain all information that is

required in order to rerun the query. Hence a dataset

integrity watch can be implemented by using stored

procedures that periodically recompute the hash value

of the datasets and compare it with the original hash

value. This server side data integrity check can be

provided by the API and called from the client in or-

der to assess the integrity of a previously obtained

dataset. As the hash value computation of a dataset

is kept simple, it can be easily computed on the client

side as well, which enhances transparency and in-

creases the trust in research data.

7 A FRONTEND FOR DATASET

ASSEMBLY

We developed a simple browser based frontend for

dynamic tabular data which documents the steps ap-

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

228

plied during the creation of the dataset. The structure

of the tables does not need to be known in advance

as the table conﬁguration can be loaded dynamically.

This renders our approach ﬂexible as it can be ap-

plied to any data format that can be represented in

tables. The client submits requests to the Web ser-

vice, which then queries the database and retrieves the

appropriate result set. All computationally intensive

operations such as ﬁltering and sorting are moved to-

wards the server side. Therefore the client becomes

very lightweight. Researchers can use the frontend

for browsing and creating even complex subsets from

large source data sources. As the API is generic, the

client can be replaced anytime by domain speciﬁc ap-

proaches. Plugins for specialised data editors can be

implemented that transparently hide the communica-

tion with the server, as long as the requests are in com-

patible with the API of the Web service.

Figure 3: The Frontend for Creating Subsets.

The server generates SQL queries based on the ﬁl-

ter criteria and applies the appropriate sorting trans-

parently to the user. The researcher can conﬁrm each

ﬁltering step and therefore apply different ﬁlter com-

binations and sortings that are applied sequentially on

the dataset. Each of these operations is traced on the

server side and stored in the query store, see Section 8.

When the user is ﬁnished with compiling the dataset,

the data can be exported as a CSV ﬁle, as JSON or

other formats. The server computes the hash as de-

scribed in Section 6 and attaches a persistent identiﬁer

to the query. This identiﬁer can be used later for re-

trieving the same data. Updates of the data on a record

level need to be reﬂected in the data citation database

instance. Hence the CSV data either needs a unique

primary key (e.g. the track id in the million song

dataset) or a frontend needs to be used, which allows

utilising the automatically generated record sequence

number. Updates can be detected by an altered hash

key of the set. If a query which already exists in the

query store delivers a different hash, the query gets

a new persistent identiﬁer assigned and constitutes a

new version of the same set.

8 AN ALL-PURPOSE QUERY

STORE

The query store collects the queries that have been

used in order to create a dataset, thus preserving the

information about the construction of subsets of data.

Whenever a researcher uses the frontend for assem-

bling a dataset, a query object is instantiated. This

object maintains a list of all operations the researcher

executed in their appropriate sequence. Each query

can handle multiple ﬁlters with arbitrary ﬁlter proper-

ties and it maintains the sorting direction for each of

the database columns that have been involved in the

query. This knowledge is important in order to pre-

serve the sorting sequence of the dataset, which needs

to be preserved whenever a technology change forces

a migration to a new data store. Figure 4 shows an ER

diagram of the query store.

Figure 4: The Query Store Holds the Metadata.

The information about the queries can be seen

as provenance data as it describes how a dataset has

been constructed. Each query contains a timestamp

that allows to map the query to a speciﬁc state of

the database, hence only those records are fetched

which have been valid during the execution time of

the query. Additionally a persistent identiﬁer such as

a DOI (Paskin, 2010) can be assigned to each dataset.

This allows scientists to reference a speciﬁc version

of a dataset e.g. in their publication.

9 CONCLUSIONS AND

OUTLOOK

In this paper we presented a framework and a pro-

totype implementation enabling dynamic data cita-

tion for a general purpose data format. We chose

the CSV format as it is used across domain bound-

aries, simplistic yet ﬂexible and therefore highly pop-

ular increasingly in settings involving larger volumes

AScalableFrameworkforDynamicDataCitationofArbitraryStructuredData

229

of data and in dynamic data that is released in sub-

sequent batches and integrated across versions of up-

dated data. We thus deem it essential to facilitate con-

venient and transparent citation capabilities for such

types of data. We presented the steps necessary for

scientists to create citable subset of dynamic CSV

data. We proposed a solution which consists of a

server and a client component. The server side is re-

sponsible for data management, versioning, data se-

curity and citation facilities. It exposes an API via a

Web service for ﬁltering, sorting and creating datasets

of arbitrary complexity that can be queried by clients.

Users can upload their datasets via a Web service to

the server which automatically migrates the ﬁle into a

relational database.

The data is annotated with extra metadata such

as original sequence of insertion, timestamps and

the row hash. The client component is a simple

browser based frontend which allows scientists to

create citable subsets from the previously uploaded

datasets. The frontend transmits each sorting or ﬁl-

tering operation to the server component which stores

them in the query store. When the user concludes the

creation of a dataset, the server rewrites the ﬁltering

and sorting information into a single SQL query and

appends timing metadata. A persistent identiﬁer can

be assigned to the query and serves as reference infor-

mation for the speciﬁc subset.

We presented a novel hashing scheme which al-

lows verifying the integrity of the data and providing

result sets of provably correct sorting sequences. The

hashing mechanism is based on row based hashes and

concatenated row hashes. For enhancing the scalabil-

ity, we introduced a new replication scheme, which

allows separating the live system from the data cita-

tion instance.

In future revisions of our prototype we will inte-

grate support for several interfaces that are natively

used by scientists for assembling datasets. We will

develop plugins for various data editors that transpar-

ently hide the provenance data collection for creating

secure datasets. Furthermore, we will develop proto-

types and tools for a much broader range of data for-

mats, hence enabling stable and secure data citation

within diverse ﬁelds of research.

ACKNOWLEDGEMENTS

Part of this work was supported by the projects

APARSEN, TIMBUS and SCAPE, partially funded

by the EU under the FP7 contracts 269977, 269940

and 270137.

REFERENCES

Bakhtiari, S., Safavi-Naini, R., Pieprzyk, J., et al. (1995).

Cryptographic hash functions: A survey. Centre for

Computer Security Research, Department of Com-

puter Science, University of Wollongong, Australie.

Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere,

P. (2011). The million song dataset. In Proceedings of

the 12th International Conference on Music Informa-

tion Retrieval (ISMIR 2011).

CODATA-ICSTI (2013). Out of cite, out of mind: The cur-

rent state of practice, policy, and technology for the

citation of data. CODATA-ICSTI Task Group on Data

Citation Standards and Practices.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The weka data min-

ing software: An update. SIGKDD Explor. Newsl.,

11(1):10–18.

Klima, V. (2005). Finding md5 collisions on a notebook pc

using multi-message modiﬁcations. IACR Cryptology

ePrint Archive, 2005:102.

Lawrence, B., Jones, C., Matthews, B., Pepler, S., and

Callaghan, S. (2011). Citation and peer review of

data: Moving towards formal data publication. Inter-

national Journal of Digital Curation, 6(2):4–37.

Li, Y., Swarup, V., and Jajodia, S. (2005). Fingerprinting re-

lational databases: Schemes and specialties. Depend-

able and Secure Computing, IEEE Transactions on,

2(1):34–45.

Narasimha, M. and Tsudik, G. (2006). Authentication of

outsourced databases using signature aggregation and

chaining. In Lee, M., Tan, K.-L., and Wuwongse, V.,

editors, Database Systems for Advanced Applications,

volume 3882 of Lecture Notes in Computer Science,

pages 420–436. Springer Berlin Heidelberg.

Parsons, M. A., Duerr, R., and Minster, J.-B. (2010). Data

citation and peer review. Eos, Transactions American

Geophysical Union, 91(34):297–298.

Paskin, N. (2010). Digital Object Identiﬁer (DOI) Sys-

tem. Encyclopedia of library and information sci-

ences, 3:1586–1592.

oll, S. and Rauber, A. (2013a). Citable by Design

- A Model for Making Data in Dynamic Environ-

ments Citable. In 2nd International Conference

on Data Management Technologies and Applications

(DATA2013), Reykjavik, Iceland.

oll, S. and Rauber, A. (2013b). Scalable Data Citation

in Dynamic, Large Databases: Model and Reference

Implementation. In IEEE International Conference

on Big Data 2013 (IEEE BigData 2013), Santa Clara,

CA, USA.

Shafranovich, Y. (2005). Common Format and MIME Type

for Comma-Separated Values (CSV) Files. RFC 4180.

Wang, X., Feng, D., Lai, X., and Yu, H. (2004). Colli-

sions for Hash Functions MD4, MD5, HAVAL-128

and RIPEMD.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

230