of data and in dynamic data that is released in sub-
sequent batches and integrated across versions of up-
dated data. We thus deem it essential to facilitate con-
venient and transparent citation capabilities for such
types of data. We presented the steps necessary for
scientists to create citable subset of dynamic CSV
data. We proposed a solution which consists of a
server and a client component. The server side is re-
sponsible for data management, versioning, data se-
curity and citation facilities. It exposes an API via a
Web service for filtering, sorting and creating datasets
of arbitrary complexity that can be queried by clients.
Users can upload their datasets via a Web service to
the server which automatically migrates the file into a
relational database.
The data is annotated with extra metadata such
as original sequence of insertion, timestamps and
the row hash. The client component is a simple
browser based frontend which allows scientists to
create citable subsets from the previously uploaded
datasets. The frontend transmits each sorting or fil-
tering operation to the server component which stores
them in the query store. When the user concludes the
creation of a dataset, the server rewrites the filtering
and sorting information into a single SQL query and
appends timing metadata. A persistent identifier can
be assigned to the query and serves as reference infor-
mation for the specific subset.
We presented a novel hashing scheme which al-
lows verifying the integrity of the data and providing
result sets of provably correct sorting sequences. The
hashing mechanism is based on row based hashes and
concatenated row hashes. For enhancing the scalabil-
ity, we introduced a new replication scheme, which
allows separating the live system from the data cita-
tion instance.
In future revisions of our prototype we will inte-
grate support for several interfaces that are natively
used by scientists for assembling datasets. We will
develop plugins for various data editors that transpar-
ently hide the provenance data collection for creating
secure datasets. Furthermore, we will develop proto-
types and tools for a much broader range of data for-
mats, hence enabling stable and secure data citation
within diverse fields of research.
Part of this work was supported by the projects
APARSEN, TIMBUS and SCAPE, partially funded
by the EU under the FP7 contracts 269977, 269940
and 270137.
