requirement covers the capability of citing dynami-
cally changing data. Data sources can potentially be
huge in size. Citing individual attributes and cells
would require enormous numbers of unique identi-
fiers and yield infeasible citations. Hence the third re-
quirement covers scalable solutions, feasible to deal
with large data sources. The fourth requirement re-
gards usability. Only if a solution is pragmatic and
transparent, it will be accepted. The proposed re-
quirements are valid for all kinds of research data for-
mats. We demonstrate and motivate the model that
we propose by uses relational databases for tackling
these four requirements. Section 3.3 then introduces
a generic model that can be used for other data such as
flat files, streaming data or various other data formats.
3.1 Dynamic Data Citation using
Relational Databases
Research data is often stored in relational database
management systems (RDBMS). The results that they
deliver are the basis for further processing. We con-
centrate on the queries and their results, not on the
large, indivisible data portions as a basis for refer-
ence. Our model increases the scalability of data cita-
tion by assigning unique identifiers only to the query
itself. Furthermore our model increases the preserva-
tion awareness or readiness of research projects. Our
model provides guidance on how to enhance the data
model used for processing research data, in order to
ensure it can be reliably cited and re-used in the fu-
ture.
Relational database management systems
(RDBMS) support many of our requirements off
the shelf. These databases can be used to retrieve
arbitrary subsets of data. Hence we concentrated on
this database model for a first pilot study before dis-
cussing the general applicability. Our model is based
on timestamped SELECT-Statements and versioned
data. Queries can be used in order to persistently
identify subsets of arbitrary complexity and size.
The dynamic nature of research databases requires
mechanisms that allow to trace and monitor all
changes that occurred during time. Hence, temporal
aspects have to be included in the model. This timing
information needs to be stored on each UPDATE,
INSERT or DELETE statement for the affected
records, enabling to trace all changes that occurred.
As relational database systems are set based, sorting
is not an inherent criteria automatically. Therefore,
we need to specify stable sorting criteria that are
automatically applied to the subsets. Depending
on the size of the data set, the schema and the
complexity of the query, the retrieval of the result set
can challenging. If these properties are met, citing
only the query persistently is sufficient to meet our
requirements. It guarantees not only consistent result
sets across time, but also consistent result lists even
in case of none or ambiguous result set sorting in
the initial query, even in the case of migration to a
different DBMS.
3.2 A Basic Model for Citing Data Sets
in Relational Databases
In timestamped RDBMS, timestamps are provided for
all records. This ensures that specific versions of data
can be retrieved without having to stall the database
tables for additional data. As records can change, they
need to be versioned, i.e. all changes that affect the
data need to be traceable. This entails that statements
such as DELETE or UPDATE must not to destroy the
data, but rather set markers that indicate that a record
has been marked for deletion or that it as has been
updated by a more recent version.
The construction of subsets of complex databases
can be easily be achieved by issuing SELECT-Queries
against the RDBMS. To enable the data citation facil-
ities, the SQL-Query has also to be augmented with a
timestamp. This timestamp maps the subset to a spe-
cific state of the data. As the records in the database
can be altered individually, it needs to be ensured that
the correct version that was valid at the query’s times-
tamp is selected for inclusion in the subset. Hence the
timestamp of the query can be used to retrieve arbi-
trary subsets of a specific version of the data.
There are several possibilities how this version
information can be implemented (Snodgrass, 1999).
The temporal timestamp contains the explicit date at
which the data has been changed. Suitable times-
tamps are dates that are granular enough to capture
the point in time that enables to differentiate between
two versions of data. The actual chronon to be picked
depends on the potential frequency of changes in data,
which is not a trivial task(Jensen and Lomet, 2001).
Thus granularity can range from days to milliseconds.
Snodgrass et al. differentiate between valid time and
transaction time (Jensen et al., 1993). Valid time
refers to the period until the data was considered a
true fact in the database. Transaction time refers to
the time when the change occurred on the system,
independent of its temporal meaning for the actual
data. The valid time concept is a reference to the
real world, the transaction time only refers to the sys-
tem time, at which a change of data was manifested.
Both concepts could be used for managing versions
in our model. As we are interested in the state of the
database at a given point in time, the transaction time
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
208