Data Curation Framework for Facilities Science
Vasily Bunakov and Brian Matthews
Scientific Computing Department, Science and Technology Facilities Council, Harwell OX11 0QX, U.K.
Keywords: Research Data, Research Lifecycle, Data Curation, Big Data, Linked Data.
Abstract: The trend in research data management practice is that the role of large facilities represented by particle
accelerators, neutron sources and other scientific instruments of scale extends beyond providing capabilities
for the raw data collection and its initial processing. Managing data and publications catalogues, shared
software repositories and sophisticated data archives have become common responsibilities of the research
facilities. We suggest that facilities can further move from managing data to curating them which implies
meaningful data enrichment, annotation and linkage according to the best practices which have emerged in
the facilities science itself or have been borrowed elsewhere. We discuss the challenges and opportunities
that are the drivers for this role transformation, and suggest a data curation framework harmonized with the
research lifecycle in facilities science.
1 INTRODUCTION
The growth of research complexity, the increased
costs of the advanced scientific instruments, and the
internationalization of science have led to the
emergence of research facilities that can be thought
of as well-equipped hubs where research teams
come to perform their experiments, often associated
with other experiments in the same or other research
centres.
The research facility core is typically represented
by a unique scientific instrument: a particle
accelerator, a neutron source, a powerful laser, a
telescope, or a supercomputer that allows detailed
simulation of natural phenomena, or by a few such
instruments that offer researchers different
experimental techniques. Examples would include
the Diamond Synchrotron Light Source
(www.diamond.ac.uk), ISIS neutron source
(www.isis.stfc.ac.uk) or the future Square Kilometre
Array (www.skatelescope.org). The exact boundary
between basic and applied research on such facilities
may be ill-defined, e.g. the same electron
synchrotron may be used part time to explore the
fundamental effects of particle collisions and part
time as the source of synchrotron radiation for
materials science, biology and pharmaceutics. For
the sake of clarity, we use the term “facilities
science” for the research performed on large-scale
scientific instruments by visitor teams or individual
researchers who obtain, via the application process,
access to the common facility resource in order to
conduct their experiments or observations, and to
collect the resulting data.
The instruments and experimental techniques
may be different between facilities; the purpose of
research may be more inclined to scientific inquiry,
or more practical in view of industrial applications.
What is common across facilities science is a
business model for servicing the facility users
(researchers); the users’ social habits, e.g. the
accepted modes of managing research output, are
less definitive but also important. These
commonalities lay a foundation for a generic data
lifecycle in facilities science, as well as for common
metadata models and information systems
architecture.
Our modelling and implementation effort in
respect to supporting the facilities’ data lifecycle is
mentioned in this paper but we concentrate on
challenges and opportunities that the facilities
science business model and the researchers’ social
attitudes present for data curators and technologists;
we then discuss a framework that should address
these challenges and opportunities.
211
Bunakov V. and Matthews B..
Data Curation Framework for Facilities Science.
DOI: 10.5220/0004593302110216
In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 211-216
ISBN: 978-989-8565-67-9
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
2 CHANGING LANDSCAPE
OF FACILITIES SCIENCE
The evolving changes in business model, technology
and facilities users’ behaviour are all interrelated and
result in new challenges and new opportunities for
the facility science stakeholders, specifically for data
curators and IT specialists.
2.1 Changes in Business Model,
Technology, and user Behaviour
A business model for user research on large facilities
that emerged more than 50 years ago has been
influenced by a few recent developments.
Instrumentation and data analysis have become more
user friendly than in early days of facilities science.
This has led, among other effects, to a lesser
significance of the instrumentation “gurus” with a
current trend of not including them as the authors of
research papers; the estimate e.g. for biology papers
is that about half of them do not now include any
facility staff members as co-authors (Mesot, 2012).
The advances of instrumentation and Internet
have also led to the emergence of specific services
for research and industry such as the UK National
Crystallography Service (Coles and Gale, 2012) that
allows users to send their samples for remote
investigation according to one of the service plans.
The sample exposure on a large facility like
synchrotron radiation source may be just one of the
experimental techniques included in the service plan
so that users have got a “seamless” interface for the
multi-aspect investigation of a crystal substance
submitted. The service provider then collects all the
experimental data and supplies them to the user in
pre-agreed formats. The facilities themselves have
also started offering this sort of “express” service
with the user presence not required for the conduct
of experiment.
The users’ attitude towards research may also
have a significant influence on the research lifecycle
and services in support of it. The user monitoring
exercise performed by PaNdata initiative showed
that about seven thousand of visitor researchers
across Europe, or 22 per cent of them have used
more than one neutron or synchrotron radiation
facility for their investigations (http://wiki.pan-
data.eu/CountingUsers). The reasons for this
substantial level of facilities sharing are often of a
research nature as the characteristics of the
experimental environment are different between
facilities. The facilities sharing is a strong incentive
for having a common infrastructure for data
management and user management which is now a
focus of PaNdata Open Data Infrastructure project
(see under http://pan-data.eu/).
Another driver for change in data management
and data curation is the emergence of new
experimental techniques like neutron tomography, or
using robots for manipulating multiple samples
exposed to a synchrotron beam, or studies of
dynamics of materials. The new techniques produce
larger volumes of data making Big Data bigger than
ever; they also raise potential opportunities for
researchers to perform comparative and multi-aspect
studies for the same samples using different
experimental techniques, or using the same
experimental technique for much wider variety of
different samples. These trends appeal to providing a
richer, well annotated and linked context for
experimental data across different facilities, different
experimental techniques and different sample types
so that the mentioned research opportunities for
comparative and multi-aspect studies could turn
reality.
2.2 Challenges and Opportunities
The challenge of Big Data in terms of more
processing power and more network bandwidth
required is imminent and well understood. We will
not detail it here apart from to note that addressing
particular parts of the data files and archives for their
inclusion in the research discourse, e.g. citing
granular parts of the immense datasets, requires an
adequate modelling of data, and scalability of data
management.
The change of the instrumentalists’ role who as
we mentioned do not always receive a due credit for
their job of preparing sophisticated experiments
requires re-thinking of the attribution methods for
research papers and other research outputs such as
datasets. Facilities science may look at the
developments such as role-based attribution in other
fields of research (Marcos et al., 2012); this is just
one example of how specialists in the facilities’
information departments could explore the new
information culture elsewhere, and promote the best
practices of it across facilities science. This example
also shows that data curation is in fact a
responsibility of everyone involved in research
lifecycle: the authors themselves, not any curation
unit down the research results distribution road,
should be able to add the structured description of
roles according to a reasonable metadata standard.
Information departments then can be seen as
hubs or centres of expertise which monitor, refine,
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
212
and communicate best practices of data curation for
other stakeholders (research papers authors in the
last example). The consistent and clearly formulated
framework will make a collaborative data curation
effort much better defined and communicated, and
the best data curation practices more readily adopted
by the research community. Supervision of various
kinds of information through the research lifecycle
will help then to create rich data aggregations and
reproducible research workflows with contributions
naturally made by different lifecycle stakeholders.
The next challenge and opportunity is presented
by the emergence of research services such as the
aforementioned UK National Crystallography
Service. This trend raises questions on the user
management, research proposals management and
data management in facilities science. Just one
example of that are the future role and the content of
data management policies which some facilities tend
to impose on their users as a pre-condition for
getting a facility resource for research. The policy
may ask users to agree with the public release of
their experimental data after a period of exclusive
access (typically a few years), or contain the
requirement to submit the list of resulting
publications back to the facility user office. This
works well in a traditional business model of
facilities science but does not take into account the
emergence of the service intermediaries who may
need to be a subject of the data management policy,
too, so that it becomes a multilateral agreement.
The data management policy format which is
now just plain text is also questionable as it is not
interpretable without a human; this will be likely not
enough for the automated research proposals
management and data release management across
different facilities. The development of licences for
data re-use, or the adoption of suitable ones could
alleviate the problem but licences might need a
proper machine-oriented modelling for policy
enforcement; the indication of what is possible in
respect to structured modelling and automation of
data licences can be seen in the recent formation of
the Linked Content Coalition
(www.linkedcontentcoalition.org) endorsed by the
European Commission and some national
governments. Again, information departments of
large research facilities might consider borrowing
the advanced practices and models of data licensing
for their re-use in facilities science.
Another important consideration is the
interoperability of metadata models and their actual
implementations for different research facilities. The
idealized metadata model for facilities science that
we call Core Scientific MetaData (CSMD)
(Matthews et al., 2012) is derived from a generic
research lifecycle in facilities science:
Figure 1: Generic research lifecycle in facilities science.
The different stages of research lifecycle produce
data artefacts (research proposals, user records,
datasets, publications etc.) that are similar across
research facilities so having a common metadata
model like CSMD seems sensible. However, it may
be applied differently by different facilities; there are
a few CSMD implementations in data catalogues
across Europe by virtue of the ICAT platform
(http://code.google.com/p/icatproject/) but the
model, and the actual use of its elements may vary
among implementations. This may result in extra
design and implementation overheads when we
consider federated services for a few facilities (even
when based on the same software platform), also
there is no guarantee that once we have the federated
solution agreed and implemented, it will be not
affected sooner or later by the diverging business
needs of different participants. The common data
curation framework for facilities science might help
to have these needs permanently monitored, properly
communicated and effectively reconciled thus
serving as a well-structured business analysis
wrapper for technology solutions.
An interesting development that may be
considered a part of the emerging data curation
framework but has exposed certain challenges, too,
is the recent effort of minting Digital Object
Identifiers for investigations performed on ISIS
neutron facility (Wilson, 2012). Having permanent
identifiers minted for particular investigations
(experiments) should be enough for linking them to
datasets and publications but in order to have a
structured and linkable representation of a facility
research environment, other parts of it such as
scientific instruments, experimental techniques,
people, organizations, software, derived data sets
etc. need minting or borrowing identifiers for them,
too. There is currently no sustainability model for
this activity, as well as for the steady production and
support of landing Web pages where the permanent
identifiers (all kinds of them) should ideally resolve
into. The different aspects – modelling,
technological, operational – of the permanent
identifiers management should be an important part
DataCurationFrameworkforFacilitiesScience
213
of the data curation framework for facilities science.
We should also mention organizational barriers
to sharing the content and the context of the research
discourse: grant applications, facilities beam time
applications (research proposals), the raw data
collected, the research outputs, the models and the
software used for data analysis or long-term digital
preservation – all these components tend to be
managed and published under separate ownership
but can and should be interlinked and navigable in
order to get the most of the impressive resource
spent on the preparation and the actual conduct of
facilities research. Linked Data might help here, and
it proved to be a productive methodology for
processing Big Data in some important research
fields with industrial output such as drug discovery
(Dumontier and Wild, 2012). There are even more
advanced data modelling techniques for sharing the
reproducible research workflows that are well
accepted in some research domains, e.g. biology
(Bechhofer, 2013). However, these techniques
typically cover only certain parts of the larger
research lifecycle that are immediately related to
research work, with the Researcher as a major target
of data linkage and data sharing. The needs of other
stakeholders residing in education, industry, research
management and funding, or policy making are
underrepresented and do not have a consistent
framework where all of them, along with the
researchers and intermediary services, could fit in.
In the absence of a structured data curation
framework, the information departments of large
facilities are often confined to supplying the
technology solutions and IT services when their next
role could be that of a conscious data curator helping
to increase data value across the entire research data
lifecycle for the variety of stakeholders (Wilson,
2012); information technologies and services would
be then a very important means to underpin the data
curation role but not the end in themselves.
In order to adopt this new role, the information
departments of large research facilities cannot
entirely rely on the existing organizational structure
as their role and actual influence in a larger research
context is inevitably limited. What they can do is
devise and elaborate a common framework for
sharing the existing best practices across different
organizational units and collaborative projects; the
framework will also serve to bring the best practices
from elsewhere for the adaptation to the needs of
facilities science. The projects, initiatives and
working groups that the information departments are
involved in will be a means to support certain
“themes” in the common data curation framework.
This should result in better opportunities for the
organizational units and collaborative projects to
interoperate, to reconcile their priorities, and to set
common (and commonly understandable) goals.
3 COMPONENTS OF DATA
CURATION FRAMEWORK
We consider some aspects of a suitable data curation
framework. It may take into account the actual
content and the stance of the existing frameworks in
the IT-relevant domains such as ITIL for service
management (www.itil-officialsite.com), or the
relevant project management frameworks. Those can
be to a certain extent “role models” of what may
constitute our own framework but there will be
substantial differences, too, owing to the specifics of
facilities science as business environment.
3.1 Data Curation Perspectives
The basis of the aforementioned mature frameworks
is typically two-fold: generalization of best practices
in the field and a consistent conceptual thinking
often represented by the notion of re-usable
“processes” and “functions” that reflect an
importance of the operational perspective in the
business world, and the functional nature of
management style in many business projects and
services. The framework for the research data
curation should include the operational perspective
and may develop a functional approach for certain
domains, too, OAIS reference model for digital
preservation (OAIS, 2012) being a good example of
it. However, owing to the cooperative nature of
scientific research (compared to the more direct
governance in business world) and to the need for
such a framework to be adaptive and comprehensive
enough, it should include more perspectives:
Business Analysis Perspective
The business case for data curation should be
well formulated and permanently updated
Modelling Perspective
Modelling may be applied to a variety of
artefacts: to data or metadata, or to the policies
and business processes
Technology Perspective
Technology is important and should be
consciously harnessed for data curation
Operation Perspective
Data curator should always keep in mind the
operational environment and issues that may
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
214
arise in it: scalability, sustainability etc.
Communication Perspective
Structured communication with various data
curation stakeholders should be a permanent
activity accompanying all the others.
3.2 Data Curation Themes
The outlined Perspectives allow considering all the
important aspects of a data curation problem or a
data curation solution; in addition to them, the
adaptive data curation framework will benefit from
having permanent Themes. One or more Themes
may be relevant to the scope of a particular project
or initiative hence they are the tool for mapping the
actual data curation effort (including development of
new approaches and techniques) to the rest of the
framework.
We list the Themes that are deemed important
according to our own experience in data curation
projects; as the framework evolves, they should be
refined through discussions with a variety of
stakeholders across different research facilities:
Table 1: Data curation themes.
Theme Comment
Identification of the
existing and emerging data
curation stakeholders
Also monitoring their needs
that may lead to the roles
change
Facility user management
practices and policies
Including comparative
studies across facilities
Data curation practices and
policies in facilities science
Analysing them for
different stages of facilities
research lifecycle
Data curation practices and
policies elsewhere
To adopt the best of them
in the facilities science
Permanent identifiers * Minting or re-using them
for instruments, techniques,
samples, papers, datasets
etc.
Data Context * Modelling and managing
various metadata and
Linked Data; monitoring
linkable data sources and
services
Data mining * Discovering data patterns;
data indexing and
classification
Data analysis and
visualization *
Including those in
collaborative environment
(“virtual labs”)
Data value and data cost How to model, measure,
and manage them
Standards and
recommendations
Adoption of the best and
opportunities to contribute
Star marked items may be considered particular
techniques of data curation but we reserved
dedicated Themes placeholders for them to
emphasize their importance.
Some of the Themes may look specific to certain
Perspectives but in fact, every Theme may require
many Perspectives applied. As an example, when we
consider minting DOIs we should employ the
Operation Perspective that will advise on the
feasibility and costs of exploiting the practice in a
sustainable manner, and the Communication
Perspective in order to educate stakeholders
concerned, and to get their feedback for the practice
improvement.
3.3 The Framework Application
The framework can be applied to the identified
Problems, or to Solutions in order to evaluate their
feasibility or quality. The recommended process can
be outlined as follows:
1) For a particular project aimed at management
or curation of facility science data, identify major
Problems or Solutions that seem viable.
2) Identify where the Problem or the Solution
applies in the facilities research lifecycle (see Figure
1); it may be one or a few stages.
3) Apply different Themes to the Problem or the
Solution, and decide which ones are most relevant or
most important in a particular case (prioritize
Themes for each Problem or Solution).
4) Consider each prioritized Theme from each of
the five Perspectives; decide which Perspectives are
most relevant or most important in a particular case.
5) Elaborate the prioritized Themes and
Perspectives against the Problem or Solution. If new
Problems or Solutions emerge whilst applying the
framework, apply it to them, too.
Figure 2: Data curation framework application.
DataCurationFrameworkforFacilitiesScience
215
As applying the framework will take into account
the significance of Themes and Perspectives in each
particular case, we expect that the entire number of
aspects to be considered (that is a multiplication of
the number of significant Themes by the number of
significant Perspectives) should not exceed a dozen
or so for a particular Problem or Solution. If this
reasonable limit is going to be exceeded, the
Problem or Solution should be decomposed, with the
framework applied to the identified components.
Applying the framework stops when all the
Problems or Solutions have been considered from all
significant Perspectives. The examples of particular
outputs resulted from the framework application will
be the IT solution quality assessment, or the data
management plan.
3.4 Further Works and Reference
Implementation
The core of the framework outlined in this paper
should be discussed with a variety of data curation
stakeholders in different research facilities, and
elaborated accordingly; PanData consortium and its
projects (www.pan-data.eu) will be a proper forum
for that. The resulted framework can be applied then
to a particular business case in the interests of a
certain research facility, or a few.
The case we are willing to consider is the long-
term digital preservation of the research outputs of
neutron and photon facilities; specifically, the
preservation of the more complex information
aggregations than just raw datasets. This will require
a more universal and multi-aspect approach than can
be found in particular digital preservation projects
that typically have their own specific agenda and use
the data samples of facilities research output only for
illustration purposes. One of the problems that as we
hope the framework will help to address in digital
preservation is the validated alignment of the system
architecture and technology to the actual data
preservation policies and procedures.
4 CONCLUSIONS
Large experimental facilities have a unique position
in the research landscape that allow them to evolve
from supplying the crude services (time slots and
experimental environment) through various modes
of managing research data to becoming the
researchers’ partners in meaningful data curation.
Sharing and refining the best practices across
organizational units and research centres should
result in birth and growth of a common data curation
framework for facilities science that covers the
entirety of the research lifecycle and takes into
account the business analysis, modelling,
technological, operational, and communication
perspectives. Such a framework will give a common
language for various case studies, system design and
implementation effort of different organizational
units and collaborative projects; it will be therefore a
valuable aid to the consistent and sustainable data
curation in large experimental facilities and
collaborations of them.
ACKNOWLEDGEMENTS
This paper is related to the projects of PaNdata
collaboration (www.pan-data.eu) supported by the
EU 7
th
Framework Programme for Research and
Technological Development. The authors would like
to thank their colleagues in PaNdata for their input
for this paper although the views expressed are the
views of the authors and not necessarily of the
collaboration.
REFERENCES
Bechhofer, S. et al., 2013. Why linked data is not enough
for scientists. Future Generation Computer Systems,
2013, 29(2), 599-611.
Coles, S. J. and Gale, P. A., 2012. Changing and
Challenging Times for Service Crystallography.
Chemical Science, 2012, 3 (3), 683-689.
Dumontier, M. and Wild, D., 2012. Linked Data in Drug
Discovery. IEEE Internet Computing, 2012, 16(6), 68-
71.
Matthews, B. et al., 2012. Model of the data continuum in
Photon and Neutron Facilities. PaNdata ODI,
Deliverable D6.1. http://pan-data.eu/sites/pan-
data.eu/files/PaNdataODI-D6.1.pdf.
Marcos, E. et al., 2012. Author order: what science can
learn from the arts. Communications of the ACM,
2012, 55(9),39-41.
Mesot, J., 2012. A need to rethink the business model of
user labs? Neutron News, 2012, 23 (4), 2-3.
OAIS, 2012. Reference Model for an Open Archival
Information System. CCSDS 650.0-M-2 (Magenta
Book) Issue 2, June 2012. http://public.ccsds.org/
publications/archive/650x0m2.pdf.
Wilson, M., 2012. Meeting a scientific facility provider's
duty to maximise the value of data. In DataCite
Summer Meeting, Digital Research Data in Practice
(DataCite2012), Copenhagen, Denmark. http://
epubs.stfc.ac.uk/work-details?w=62852.
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
216