In addition to general archive databases NCBI
has created specialized resources and tools that can
be utilized in metagenomics data analysis. These
resources are discussed in section 3.
2 PRIMARY DATA ARCHIVES
As a national resource for molecular biology infor-
mation, National Center for Biotechnology Informa-
tion develops, distributes, supports, and coordinates
access to a variety of databases and software for the
scientific and medical communities (Sayers, 2010).
2.1 Sequence Read Archive
The advent of massively parallel sequencing tech-
nologies has opened an extensive new vista of re-
search possibilities — elucidation of the human
microbiome, discovery of polymorphisms and muta-
tions in individual genomes, mapping of protein–
DNA interactions, and positioning of nucleosomes
— to name just a few. In order to achieve these
research goals, researchers must be able to effective-
ly store, access, and manipulate the enormous vo-
lume of read data generated from massively parallel
sequencing experiments.
In response to the research community’s call for
such a resource, NCBI, EBI, and DDBJ, under the
auspices of the International Nucleotide Sequence
Database Collaboration (INSDC), have developed
the Sequence Read Archive (SRA) data storage and
retrieval system (Shumway, 2010). The SRA not
only provides a place where researchers can archive
their sequence read data, but also enables them to
quickly access known data and their associated ex-
perimental descriptions (metadata).
Now that the archive has reached an initial state
of completion and is publically available at NCBI, it
is being deployed at EBI (under the name European
Read Archive, or ERA), and soon will also be dep-
loyed at DDBJ (under the name DDBJ Read Arc-
hive, or DRA). NCBI and EBI have already begun
exchanging data, and once the DRA is in place at
DDBJ, there will be a regular data exchange be-
tween all three INSDC members.
In order to store and retrieve the enormous
amount of data generated by massively parallel se-
quencing technologies, NCBI, EBI and DDBJ
needed to create a data repository that has much of
the power of a relational database while being
lightweight, transportable and flexible like flat-file
storage. The solution was to create a hybrid relation-
al database with a file-based and column-oriented
design.
Within SRA the data are organized into four
types of records: studies (SRP accessions), experi-
ments (SRX accessions), samples (SRS accessions)
and runs (SRR accessions). Studies contain one or
more experiments, each of which contains one or
more runs, each of which in turn may contain data
on tens of millions of individual reads. The various
record types representing data from a study are all
linked to one another within Entrez
(www.ncbi.nlm.nih.gov/sra/), allowing users to
browse the data easily on the web.
2.2 GenBank – Nucleotide Sequence
Archive Database
GenBank (Benson et al., 2010) is a comprehensive
database that contains publicly available nucleotide
sequences for more than 300 000 organisms named
at the genus level or lower, obtained primarily
through submissions from individual laboratories
and batch submissions from large-scale sequencing
projects, including whole genome shotgun (WGS)
and environmental sampling projects. NCBI builds
GenBank primarily from the submission of sequence
data from authors and from the bulk submission of
expressed sequence tag (EST), genome survey se-
quence (GSS) and other high-throughput data from
sequencing centers. GenBank data is available at no
cost over the Internet, through FTP and a wide range
of web-based retrieval and analysis services.
2.3 Metadata: BioProject and
BioSample
BioProject. New technologies have significantly
increased the volume of data that can be generated
and submitted to archival database resources. Ge-
nome project is no longer limited to the genome
sequencing, assembly, and annotation. New types of
experimental studies include epigenomics, proteo-
mics, metabolomics and more ’omics’. Advances in
sequencing technologies have also changed the
scope of genomic studies; it became possible to
sequence multiple genomes of many different organ-
isms starting from hundreds of bacterial strains to
1000 human individuals. It is also possible to se-
quence microbial populations in their natural envi-
ronment without growing them in culture but by
sequencing the samples collected from the environ-
ment. Our view on genomic, metagenomics and
biomedical projects is rapidly changing. That affects
the way the data is organized and represented in
BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms
358