INFRASTRUCTURE FOR METAGENOME DATA

MANAGEMENT AND ANALYSIS

Tatiana Tatusova

National Center for Biotechnology Information, National Library of Medicine

National Institutes of Health, 9600 Rockville Pike, Bethesda, MD, 20892, U.S.A.

Keywords: Database, Sequence analysis, Metagenomics.

Abstract: Metagenome sequencing projects are generating unprecedented amounts of data. Public sequence archive

databases are challenged with large-scale data management issues including data storage, quick search and

retrieval of the sequence data for further analysis. The sequence data is linked to the rich set of metadata

attributes such as geochemical and ecological parameters for environmental projects and clinical patient in-

formation for human microbiome studies. That complex collection of heterogeneous information has to be

integrated, organized and presented to the users in a meaningful and the most useful way. For the last 20

years The National Center for Biotechnology Information (NCBI) has been developing the infrastructure

that allows an easy storage and distribution of various types of bimolecular data as well as data integration

and easy navigation in complex information space. Here we describe NCBI resources that are used for me-

tagenomics data management.

1 INTRODUCTION

New generation sequencing technology made it

possible to study microbial communities in their

natural environment. By collecting samples directly

from the environment and sequencing DNA without

isolation and growing in the artificial conditions

researches are given an opportunity to understand

the role of microbial organisms in ecological sys-

tems. The questions the researches are usually ask

are:

1) what is the structure of microbial community and

relative abundance of different species?;

2) What is the functional role of the bacterial com-

munities in the ecosystem? In other words scientists

want to know “who they are?” and “what they do?”

One well established way to answer the first ques-

tion is to collect and sequence 16S RNA genes and

perform phylogenetic analysis. To answer the

second question genomic DNA, assembled and an-

notated. The analysis of the predicted proteins might

provide some insight into the functional role of bac-

terial communities in the regulation of biochemical

processes in ecosystems. More recently with the

RNA Seq technology metatranscriptome data be-

came available for the analysis the expression level

of functional activity of microbial communities.

Sequence data generated by metagenome

projects is made available to the research community

through public data achieves described in section 1.

Sequence data by itself doesn’t contain enough

information for the analysis, it is necessary to frame

the physic-chemical context within which the data is

to be interpreted. The initial step in any metagenom-

ics study requires the collection of samples destined

for analysis. The geo-chemical or medical characte-

ristics describing a sample constitute the "meta data"

intricately tied to a given sample and aid in interpret-

ing the biological significance of the genetic infor-

mation. Linking the metadata to the sequence data

extracted from the sample is one of the key elements

in further analysis. Computational analysis of the

metagenomics creates a great challenge due to the

huge volume of the metagenomics data requiring

extremely powerful computational resources and

novel approaches to sequence analysis and visualiza-

tion methods.

NCBI has recently developed new resources that

allow capturing some metadata associated with se-

quence submission such as the description of the

project, description of each sample and geochemical

and ecological data associated with the study. Sec-

tion 2 will provide the detailed description of the

specialized resources.

357

Tatusova T..

INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS.

DOI: 10.5220/0003333803570362

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (Meta-2011), pages 357-362

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

In addition to general archive databases NCBI

has created specialized resources and tools that can

be utilized in metagenomics data analysis. These

resources are discussed in section 3.

2 PRIMARY DATA ARCHIVES

As a national resource for molecular biology infor-

mation, National Center for Biotechnology Informa-

tion develops, distributes, supports, and coordinates

access to a variety of databases and software for the

scientific and medical communities (Sayers, 2010).

2.1 Sequence Read Archive

The advent of massively parallel sequencing tech-

nologies has opened an extensive new vista of re-

search possibilities — elucidation of the human

microbiome, discovery of polymorphisms and muta-

tions in individual genomes, mapping of protein–

DNA interactions, and positioning of nucleosomes

— to name just a few. In order to achieve these

research goals, researchers must be able to effective-

ly store, access, and manipulate the enormous vo-

lume of read data generated from massively parallel

sequencing experiments.

In response to the research community’s call for

such a resource, NCBI, EBI, and DDBJ, under the

auspices of the International Nucleotide Sequence

Database Collaboration (INSDC), have developed

the Sequence Read Archive (SRA) data storage and

retrieval system (Shumway, 2010). The SRA not

only provides a place where researchers can archive

their sequence read data, but also enables them to

quickly access known data and their associated ex-

perimental descriptions (metadata).

Now that the archive has reached an initial state

of completion and is publically available at NCBI, it

is being deployed at EBI (under the name European

Read Archive, or ERA), and soon will also be dep-

loyed at DDBJ (under the name DDBJ Read Arc-

hive, or DRA). NCBI and EBI have already begun

exchanging data, and once the DRA is in place at

DDBJ, there will be a regular data exchange be-

tween all three INSDC members.

In order to store and retrieve the enormous

amount of data generated by massively parallel se-

quencing technologies, NCBI, EBI and DDBJ

needed to create a data repository that has much of

the power of a relational database while being

lightweight, transportable and flexible like flat-file

storage. The solution was to create a hybrid relation-

al database with a file-based and column-oriented

design.

Within SRA the data are organized into four

types of records: studies (SRP accessions), experi-

ments (SRX accessions), samples (SRS accessions)

and runs (SRR accessions). Studies contain one or

more experiments, each of which contains one or

more runs, each of which in turn may contain data

on tens of millions of individual reads. The various

record types representing data from a study are all

linked to one another within Entrez

(www.ncbi.nlm.nih.gov/sra/), allowing users to

browse the data easily on the web.

2.2 GenBank – Nucleotide Sequence

Archive Database

GenBank (Benson et al., 2010) is a comprehensive

database that contains publicly available nucleotide

sequences for more than 300 000 organisms named

at the genus level or lower, obtained primarily

through submissions from individual laboratories

and batch submissions from large-scale sequencing

projects, including whole genome shotgun (WGS)

and environmental sampling projects. NCBI builds

GenBank primarily from the submission of sequence

data from authors and from the bulk submission of

expressed sequence tag (EST), genome survey se-

quence (GSS) and other high-throughput data from

sequencing centers. GenBank data is available at no

cost over the Internet, through FTP and a wide range

of web-based retrieval and analysis services.

2.3 Metadata: BioProject and

BioSample

BioProject. New technologies have significantly

increased the volume of data that can be generated

and submitted to archival database resources. Ge-

nome project is no longer limited to the genome

sequencing, assembly, and annotation. New types of

experimental studies include epigenomics, proteo-

mics, metabolomics and more ’omics’. Advances in

sequencing technologies have also changed the

scope of genomic studies; it became possible to

sequence multiple genomes of many different organ-

isms starting from hundreds of bacterial strains to

1000 human individuals. It is also possible to se-

quence microbial populations in their natural envi-

ronment without growing them in culture but by

sequencing the samples collected from the environ-

ment. Our view on genomic, metagenomics and

biomedical projects is rapidly changing. That affects

the way the data is organized and represented in

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

358

NCBI databases. BioProject database provides a

mechanism to access datasets that are otherwise

difficult to find. The definition of a set of related

data, a ‘project’ is flexible and supports the need to

define a complex project and various distinct sub-

projects using different parameters.

BioSample. The time, place and collection method

can profoundly affect the microbial composition in a

sample. Geographical location, biochemical charac-

teristic of the natural habitat, ecological and clinical

information needs to be captured and linked to the

sample data during the submission of the raw data. It

is highly important to develop a set of uniform stan-

dards for sample information to make future com-

parisons between different data sets easier and so

provide greater biological insight.

NCBI new BioSample database (http://www.

ncbi.nlm.nih.gov/biosample) is meant to support

sample descriptions and standard attributes for all

biological samples.

The new database provide a good infrastructure

for future submissions of sample information but a

common set of standard attributes is yet to be devel-

oped.

Example of BioSample record in Entrez:

Soil metagenome SRA sample SRS009922

Identifiers

SRA:SRS009922

Organism

Soil metagenome

unclassified sequences; metagenomes; ecological metagenomes

Attributes

No attributes

Submitter

JGI

Description

The tropical forest soil sample used for metagenome se-

quencing was collected in a subtropical lower montane

wet forest in the Luquillo Experimental Forest (18.30N,

65.83W), which is part of the NSF-sponsored Long-Term

Ecological Research program in Puerto Rico. The climate

in this region is relatively aseasonal, with mean annual

rainfall of 4500 mm and mean annual temperatures of

22C to 24C. Soils were collected from the Bisley wa-

tershed approximately 250 meters above sea level from

the 0-10 cm depth using a 2.5 cm diameter soil corer.

Sampling date: Summer 2008

ID: 8167

The new database provide a good infrastructure

for future submissions of sample information but a

common set of standard attributes is yet to be devel-

oped.

2.4 Developing Community Standards

for Metagenome Data

The astonishing increase in the amount of data gen-

erated by metagenomics projects that involve shot-

gun sequencing of all the organisms in an environ-

mental sample creates an unanticipated situation in

the field. Data storage and retrieval is becoming a

problem for current database designs, and compre-

hensive analysis of the metagenomics data, which is

far more complex analysis of a genome, is becoming

computationally intractable with existing resources

and pipelines. (see Nature Methods 6, 623 (2009)).

A single lab can no longer alone perform a compre-

hensive analysis of metagenomics data. The devel-

opment common standards would facilitate the data

exchange, sharing and comparisons of the results

across different groups.

The recently formed M5 (metagenomics, meta-

data, meta analysis, multi-scale models and meta

infrastructure) Consortium will be proposing a

promising solution, the 'M5 Platform', later this year.

The success of developing standards depends on the

ability of the public repositories and biologists gene-

rating the data to agree on common data models and

unified data formats. There is commitment, howev-

er, from GenBank, the European Molecular Biology

Laboratory's Nucleotide Sequence Database, and the

DNA Databank of Japan, to capture the metadata

and associate it with the genome records, in the

sequence records and in a project description.

3 REFERENCE SEQUENCE

COLLECTION

NCBI's Reference Sequence (RefSeq) is a public

database of nucleotide and protein sequences with

feature and bibliographic annotation. For more de-

tails see (Pruit et al, 2009).

3.1 Reference Microbial Genomes

Reference collection of complete microbial genomes

includes complete annotated genomes that can be

used as standards for microbial genome annotation,

WGS (Whole Genome Shotgun) genomes that

represent major taxonomic group in the absence of a

complete genome.

3.2 Reference Targeted Loci

Reference collection of targeted loci includes tar-

geted sequence regions that support specific report-

INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS

359

ing or identification needs; for example, gene-

specific benchmarks that are used for identification

purposes. The small subunit ribosomal RNA (16S in

prokaryotes and 18S in eukaryotes) is a useful phy-

logenetic marker that has been used extensively for

evolutionary analyses. This project is the result of an

international collaboration with a number of ribo-

somal RNA databases that curate and maintain se-

quence datasets for these markers. The initial scope

of the project is to compare curated 16S markers that

correspond to type strains and near full length se-

quences from all contributing databases. Sequences

and taxonomic assignments that are in agreement in

all databases will have Reference Sequence records

corresponding to the original GenBank record. The

RefSeqs may contain corrections to the sequence or

taxonomy as compared to the original INSD submis-

sion, and may have additional information added

that is not found in the original. The Refseq Tar-

geted Loci web resource http://www.ncbi.

nlm.nih.gov/genomes/static/refseqtarget.html con-

tains comparison tool for different outside resources

of targeted loci data. One of the goals of the project

is to create a unique reference set that can be used by

many existing databases. The data are available for

download at NCBI ftp site ftp://ftp.ncbi.nih.

gov/genomes/TARGET/

4 ANALYSIS TOOLS

AND RESOURCES

4.1 Family of Standard BLAST

Programs

The BLAST programs (Altschul et al., 1990;

Altschul et al., 1997; Ye et al., 2006) perform se-

quence-similarity searches against a variety of nuc-

leotide and protein databases.

A special search program for genomic and meta-

genomic data MegaBLAST (Ye et al., 2006), is a

faster version of standard nucleotide BLAST de-

signed to find alignments between nearly identical

sequences, typically from the same species. It is

available through a separate web interface that han-

dles batch nucleotide queries and can be used to

search the rapidly growing Sequence Read Archive

as well as the standard BLAST databases. For rapid

cross-species nucleotide queries, NCBI offers Dis-

contiguous MegaBLAST, which uses a nonconti-

guous word match (Zhang et al., 2000) as the nuc-

leus for its alignments. Discontiguous MegaBLAST

is far more rapid than a translated search such as

blastx, yet maintains a competitive degree of sensi-

tivity when comparing coding regions. Sequence

read BLAST searches are now offered for transcript

and whole genome sequence data sets from 454

Sequencing systems, and regular expression pattern

matching against short reads of all types is now

possible.

4.2 Customized BLAST Databases

4.2.1 SRA BLAST

SRA data are rapidly dominating all other sequence

data. Already the number of DNA bases available in

SRA exceeds the number of bases in GenBank. In

fact the output of a single important project, the

1000 genomes project (www.1000genomes.org),

will produce more than 25 times the number of bases

that are currently in GenBank by the time the project

is completed. The NCBI and SRA will continue to

support submission, retrieval, and analyses of these

increasingly challenging and complex sequencing

data. Means of displaying data, analyses, and inte-

gration of SRA data with other molecular databases

will continue to improve making the SRA data a

prominent part of the discovery system at the NCBI.

In addition to text searches of the SRA experi-

ments through Entrez, NCBI also offers a nucleotide

BLAST service for sequence similarity searching of

454 sequencing reads for transcriptome studies. This

service is accessible from the “Specialized BLAST”

section of the BLAST Homepage.

4.2.2 Genomic BLAST

Genomic BLAST (Cummings et al., 2002), a novel

graphical tool for simplifying BLAST searches

against complete and draft genome assemblies. This

tool allows the user to compare the query sequence

against a virtual database of DNA and/or protein

sequences from a selected group of organisms with

finished or unfinished genomes. The organisms for

such a database can be selected using either a graph-

ic taxonomy-based tree or an alphabetical list of

organism-specific sequences. The first option is

designed to help explore the evolutionary relation-

ships among organisms within a certain taxonomy

group when performing BLAST searches.

4.2.3 Concise BLAST

The vast increase in genomic sequences has led to a

flood of data to the protein databases as well. Many

strain-specific genomes are now being sequenced

(for example Streptococcal genomes). The result can

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

360

Figure 1: 16S Ribosomal RNA Reference Sequence Similarity Search Beta release. This tool visualizes BLAST results of

the query sequence search by mapping them on a phylogenetic tree. Query: >NC_002162 |:145338-146803|16S ribosomal

RNA| [gene=rRNA_16S-1] [locus_tag=UUr01].

be an overwhelming amount of data to look through

when executing BLAST similarity searches. In order

to help alleviate both the processing of the data and

to present a broader taxonomic view, the concise

protein database was constructed. Web interface can

be accessed from Microbial Genomes home page or

directly at http://www.ncbi.nlm.nih.gov/genomes/

prokhits.cgi.

The database is constructed from the clusters of

related proteins [ref] Clusters may span a large tax-

onomic branch (kingdom) or may reside at a specific

node (family, genus, species, etc.). Clusters may

consist of many proteins, or be comprised of only

two proteins. From this entire set of clusters, genus-

specific clusters are used for this database. From the

proteins at the genus-level, one (randomly selected)

is chosen as a representative for the Concise Micro-

bial Protein BLAST database and will be found in

BLAST queries. The other proteins in the cluster are

automatically linked to this representative and will

also be found in the search results, although without

the BLAST score and E-value as they are not specif-

ically examined. All proteins that do not belong to

the genus-level clusters are also added to the data-

base for completeness. The result is faster

processing times and reduced load on the database.

The broader taxonomic view will help eliminate

some of the redundancy that is found when many

proteins of closely related organisms are found in

BLAST results.

4.3 Analysis of 16S Ribosomal

RNA Data

Similarity search give you a list of the 10 top

BLAST hits as well as the position of the hits on the

phylogenetic tree. Coursing tree visualization algo-

rithms developed at NCBI allow showing trees for

large datasets with various levels of the resolution.

5 CONCLUSIONS

NCBI provides a basic infrastructure for the se-

quence data and a framework for metadata that de-

scribes project, study, and sample. However, com-

mon standards for the metadata and a new data mod-

el for metagenome sequence data have yet to be

INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS

361

developed. A special interest group (SIG M3) at

ISMB meeting had brought together researchers

collecting samples for metagenomic analysis with

those building the computational infrastructure re-

quired to fully exploit them with those thinking

about the implementation of standards. This discus-

sion initiated by Genomic Standards Consortium -

GSC

(http://gensc.org/gc_wiki/index.php/Main_Page) is a

good start towards developing standards for metage-

nome data that will be supported by major databases

and utilized through already existing NCBI infra-

structure.

REFERENCES

Sayers E. W. et al.: Database resources of the National

Center for Biotechnology Information. Nucleic Acids

Res. 2010 Jan; 38 (Database issue): D5-16.

Shumway M.: The Sequence Read Archive (SRA) – A

worldwide resource. Nucleic Acids Res. 2010 Jan; 38

(Database issue): D.

Benson D. A., Karsch-Mizrachi I., Lipman D. J., Ostell J.,

Sayers E. W.: GenBank. Nucleic Acids Res. 2010 Jan;

38 (Database issue): D46-51.

Pruitt K. D., Tatusova T., Klimke W., Maglott D. R.:

NCBI Reference Sequences: current status, policy and

new initiatives. Nucleic Acids Res. 2009 Jan; 37 (Da-

tabase issue): D32-6.

Altschul S. F., Gish W., Miller W., Myers E. W., Lipman

D. J.: Basic local alignment search tool. J. Mol. Biol.

1990; 215: 403-410.

Altschul S. F., Madden T. L., Schaffer A. A., Zhang J.,

Zhang Z., Miller W., Lipman D. J.: Gapped BLAST

and PSI-BLAST: A new generation of protein data-

base search programs. Nucleic Acids Res. 1997;

25:3389-3402.

Ye J., McGinnis S., Madden T. L.: BLAST: improvements

for better sequence analysis. Nucleic Acids Res. 2006;

34: W6-W9.

Zhang Z., Schwartz S., Wagner L., Miller W.: A greedy

algorithm for aligning DNA sequences. J. Comput.

Biol. 2000; 7: 203-214.

Cummings L., Riley L., Black L., Souvorov A., Resen-

chuk S., Dondoshansky I., Tatusova T.: Genomic

BLAST: custom-defined virtual databases for com-

plete and unfinished genomes. FEMS Microbiol Lett.

2002 Nov 5; 216 (2): 133-8.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

362