of data and in dynamic data that is released in sub-
sequent batches and integrated across versions of up-
dated data. We thus deem it essential to facilitate con-
venient and transparent citation capabilities for such
types of data. We presented the steps necessary for
scientists to create citable subset of dynamic CSV
data. We proposed a solution which consists of a
server and a client component. The server side is re-
sponsible for data management, versioning, data se-
curity and citation facilities. It exposes an API via a
Web service for filtering, sorting and creating datasets
of arbitrary complexity that can be queried by clients.
Users can upload their datasets via a Web service to
the server which automatically migrates the file into a
relational database.
The data is annotated with extra metadata such
as original sequence of insertion, timestamps and
the row hash. The client component is a simple
browser based frontend which allows scientists to
create citable subsets from the previously uploaded
datasets. The frontend transmits each sorting or fil-
tering operation to the server component which stores
them in the query store. When the user concludes the
creation of a dataset, the server rewrites the filtering
and sorting information into a single SQL query and
appends timing metadata. A persistent identifier can
be assigned to the query and serves as reference infor-
mation for the specific subset.
We presented a novel hashing scheme which al-
lows verifying the integrity of the data and providing
result sets of provably correct sorting sequences. The
hashing mechanism is based on row based hashes and
concatenated row hashes. For enhancing the scalabil-
ity, we introduced a new replication scheme, which
allows separating the live system from the data cita-
tion instance.
In future revisions of our prototype we will inte-
grate support for several interfaces that are natively
used by scientists for assembling datasets. We will
develop plugins for various data editors that transpar-
ently hide the provenance data collection for creating
secure datasets. Furthermore, we will develop proto-
types and tools for a much broader range of data for-
mats, hence enabling stable and secure data citation
within diverse fields of research.
ACKNOWLEDGEMENTS
Part of this work was supported by the projects
APARSEN, TIMBUS and SCAPE, partially funded
by the EU under the FP7 contracts 269977, 269940
and 270137.
REFERENCES
Bakhtiari, S., Safavi-Naini, R., Pieprzyk, J., et al. (1995).
Cryptographic hash functions: A survey. Centre for
Computer Security Research, Department of Com-
puter Science, University of Wollongong, Australie.
Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere,
P. (2011). The million song dataset. In Proceedings of
the 12th International Conference on Music Informa-
tion Retrieval (ISMIR 2011).
CODATA-ICSTI (2013). Out of cite, out of mind: The cur-
rent state of practice, policy, and technology for the
citation of data. CODATA-ICSTI Task Group on Data
Citation Standards and Practices.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The weka data min-
ing software: An update. SIGKDD Explor. Newsl.,
11(1):10–18.
Klima, V. (2005). Finding md5 collisions on a notebook pc
using multi-message modifications. IACR Cryptology
ePrint Archive, 2005:102.
Lawrence, B., Jones, C., Matthews, B., Pepler, S., and
Callaghan, S. (2011). Citation and peer review of
data: Moving towards formal data publication. Inter-
national Journal of Digital Curation, 6(2):4–37.
Li, Y., Swarup, V., and Jajodia, S. (2005). Fingerprinting re-
lational databases: Schemes and specialties. Depend-
able and Secure Computing, IEEE Transactions on,
2(1):34–45.
Narasimha, M. and Tsudik, G. (2006). Authentication of
outsourced databases using signature aggregation and
chaining. In Lee, M., Tan, K.-L., and Wuwongse, V.,
editors, Database Systems for Advanced Applications,
volume 3882 of Lecture Notes in Computer Science,
pages 420–436. Springer Berlin Heidelberg.
Parsons, M. A., Duerr, R., and Minster, J.-B. (2010). Data
citation and peer review. Eos, Transactions American
Geophysical Union, 91(34):297–298.
Paskin, N. (2010). Digital Object Identifier (DOI) Sys-
tem. Encyclopedia of library and information sci-
ences, 3:1586–1592.
Pr
¨
oll, S. and Rauber, A. (2013a). Citable by Design
- A Model for Making Data in Dynamic Environ-
ments Citable. In 2nd International Conference
on Data Management Technologies and Applications
(DATA2013), Reykjavik, Iceland.
Pr
¨
oll, S. and Rauber, A. (2013b). Scalable Data Citation
in Dynamic, Large Databases: Model and Reference
Implementation. In IEEE International Conference
on Big Data 2013 (IEEE BigData 2013), Santa Clara,
CA, USA.
Shafranovich, Y. (2005). Common Format and MIME Type
for Comma-Separated Values (CSV) Files. RFC 4180.
Wang, X., Feng, D., Lai, X., and Yu, H. (2004). Colli-
sions for Hash Functions MD4, MD5, HAVAL-128
and RIPEMD.
DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications
230