2 BACKGROUND
Sequencing a DNA is the task of obtaining the bases
(A, C, G and T) of the so-called short-read sequences
(SRS), which are fragments belonging to the chromo-
somes of one or more organisms in a genome project.
DNA sequencing was strongly improved by the new
technologies developed by automatic sequencers like
Illumina and 454 Roche, among others. These se-
quencers are capable to produce millions of SRS of
one or more genomes in only one sequencing round,
each one of the SRS with lengths from 30 to 1,000
bases, depending on the adopted technology.
To evaluate the volume of data to be analyzed in a
genome project, we present two examples. Filichkin
et al. (Filichkin and et al., 2010) worked with approxi-
mately 271 millions SRS sequenced by Illumina, each
fragment with 32 bases. These bases were mapped to
a relatively short genome of the Arabidopsis thaliana
plant (≈ 120 millions of bases), with the objective
of identifying alternative splicing. Other research
groups (Pan and et al., 2008; Sultan and et al., 2008)
had the objective of identifying alternative splicing of
about 15 millions SRS, mapping them to the human
genome of approximately 3 billions of bases.
In this scenario, many projects were developed
for using cloud computing in bioinformatics aplica-
tions (Schatz, 2009; Langmead et al., 2010; Wall and
et al., 2010; Angiuoli and et al., 2011; Pratt and et al.,
2011; Zhang and et al., 2011). However, working
with a single cloud environment can be restrictive,
mostly due to execution in private companies, absent
of mechanisms to treat failures, and the fact that cur-
rent clouds are not flexible enough to allow unantici-
pated usages, which usually appear in bioinformatics.
Federation is a particular research area in cloud
computing. Celesti et al. (2010) note that middle-
ware implementations (OpenQRM, 2011; Bresnahan
and et al., 2011; Nurmi and et al., 2009) lack fed-
eration features, and explore general concepts about
cross-cloud federation. Cloud computing virtualiza-
tion allows increasing of computational resources and
reducing IT service costs by hiding the underlying in-
frastructure with a logical layer between the physi-
cal infrastructure and the computational processes. In
the cross-cloud federation, each cloud provider can
enlarge its virtualization resources demanding further
computing and storage capabilities to other clouds
transparent to users. Celesti et al. also point that the
implementation of a cross-cloud federation is not triv-
ial, although its clear advantages, since clouds are het-
erogeneous and dynamic, federation models are de-
signed for static environments and agreements among
the partners are required to create the federation.
In a previous work, we proposed BioNimbus (Sal-
danha and et al., 2011), a cloud computing infrastruc-
ture for managing bioinformatics tasks, designed to
be flexible and fault tolerant. The objective was to of-
fer the illusion that computational resources would be
unrestricted or, in other words, computational or stor-
age space demands would be always provided to the
users. In order to reach these objectives, we improved
BioNimbus using the federated cloud model (Fig-
ure 1). The infrastructure allows to integrate phys-
ically separate platforms, each modelled as a cloud,
which means that independent, heterogenous, pri-
vate/public clouds providing bionformatics applica-
tions could be integrated into a single federated cloud.
In BioNimbus, the resources of each user can be max-
imally used, but if more are required, other clouds can
be requested to participate, in a transparent way, so
that BioNimbus virtual resources amount are enlarged
by computational and data storage capabilities of all
the clouds forming the federation. A plug-in maps the
communication between a cloud provider integrating
the federation and the management services, so pro-
viding a simple and efficient way to include a new
cloud provider in BioNimbus.
Management services are implemented in the
BioNimbus core, which offers computational re-
sources such as virtual machines, data storage and
networks. A web interface was created to facilitate
the communication with users. Details of the BioN-
imbus core main services are:
• Discovery Service: identifies service providers
and consolidates information about storage ca-
pacity, processing, network latency and resource
availability;
• Monitoring Service: verifies if a requested service
is available in a cloud provider, searching for an-
other cloud in the federation if it is not; receives
the tasks to be executed from the job controller,
and sends them to the scheduling service to be
distributed, guaranteeing that all the tasks of a
process are really executed; informs the job con-
troller when a task successfully executes;
• Storage Service: coordinates the storage strategy
of the files consumed and produced by the exe-
cuted tasks, deciding about distribution, replica-
tion and file access control among the services;
• Security Service: guarantees integrity among the
distinct tasks executed in the federated clouds;
• Fault Tolerance Service: guarantees that all the
core services are always available. For this, mes-
sages for all the services are sent, and if some of
them do not react, initiates an election algorithm
to execute the service again in another machine;
TASKSCHEDULINGINAFEDERATEDCLOUDINFRASTRUCTUREFORBIOINFORMATICSAPPLICATIONS
115