concept in promoting good scientific principles of
accountability, who and what devices generated the
data, traceability, the origins of all data, including
external data sources, and reproducibility, all the
information and algorithms of the entire process is
available.
We start with a single data point. Most likely this
point has been included from a data files within a
repository with its associated information of who
entered the data, data of entry, data of data
production, origin of data, etc. The data point itself
represents some knowledge as to its context. This
context lies within a network of interconnected
concepts. The database ontological knowledge within
CHEMCONNECT provides this context.
If the data point is a ‘direct’ measurement, there is
a connection with all the associated information with
the device. First, there is the specific device found at
a specific place (institute, department, etc.), within a
specific organization (university, research center,
etc.), performed by a specific researcher (including
collaboration, supervisor, position, and other
information about the researcher). The device itself
has a description and can be viewed as a collection of
subsystems each of which has a purpose and its role
in producing the data point. Within the device there
could be the actual component which produced the
data with its specific properties, including accuracy,
reliability and dependence on other components in the
device. Part of the CHEMCONNECT database is a
device description. The specific device used to
produce the point is, of course, related to similar
devices with similar properties. The
CHEMCONNECT ontology knowledge base
provides templates of device descriptions and the
device’s relation and composition in relation to other
similar devices. In addition to the meta-data about the
device (parameterized description, abstract
description, references to publications, institutes,
researchers, etc., the device is viewed as a set of
interconnected subsystems and components.
Templates of these descriptions are found in the
database which also gives their role and purpose
within a larger scientific context.
Final data point results reported in publication are
seldom direct measurements. Usually, there is a flow
of data manipulations from the ‘raw’ data
measurement from the device to the final result
reported in a table in the publication. It is becoming
more critical within the scientific community,
especially for the chemical kinetics community, that
this data trace is included, particularly in error
analysis, for traceability, accountability and
reproducibility. For example, the computation of
propagation of errors can be done a variety of ways
and can range from the simple, which is usually done
by the primary data producer, to complex, which can
be done by researchers with data expertise.
The chain of data manipulations from ‘raw’
results to final published results is represented by a
protocol. The interconnectivity of data is further
promoted by each component in this chain. A
protocol essentially consists of the entire set of
algorithms, procedures, devices and intermediate data
produced from those algorithms. Within these
components are further connections to specific
organizations, researchers, publications, and other
external references. Within the CHEMCONNECT
knowledge base, templates for protocols are given,
meaning typical experimental procedures leading to
final results. Instantiation of a protocol into the
database is done by providing the specific
information regarding the specific experiment. This
instantiation supplements the general context
knowledge, within the broader knowledge base of
experimental procedures and devices.
Within the knowledge base of CHEMCONNECT,
templates for algorithms and their specific
implementation can be given to the database.
Algorithms can be range from simple algorithmic
calculation, to computer software. Within the
algorithm description would be further references
giving a broader context to the algorithm. An
‘algorithm’ can also be a specific experimental
procedure describing (with references) how the data
was produced. The ontological knowledge base
algorithms provide information about the role and
purpose of the algorithm within the large context of
data manipulation.
1.2 Structure of CHEMCONNECT
The general structure of CHEMCONNECT consists
of the interaction between these entities:
Knowledge Base: This is the heart of the
CHEMCONNECT system. It is an ontology
describing the data structures and domain
structures and concepts.
Repository: This is the data in the original
form of the researchers. These are the files that
are parsed and interpreted using the knowledge
base and stored in the database.
Database: This is the primary persistent
storage of individual pieces of interconnected
data. The database not only holds the data
itself, but also the data specifications and
templates used to input and interpret data.