A Cloud-based GWAS Analysis Pipeline for Clinical Researchers

Paul Heinzlreiter

1,2

, James Richard Perkins

, Oscar Torre

5,1

, Johan Karlsson

5,6

Juan Antonio Ranea

, Andreas Mitterecker

7,1

, Miguel Blanca

and Oswaldo Trelles

5,1

RISC Software GmbH, Softwarepark 35, 4232 Hagenberg, Austria

Leibniz Supercomputing Centre (LRZ), Boltzmannstr. 1, 85748 Garching, Germany

Research Laboratory, University Hospital-IBIMA, M

alaga, Spain

Allergy Unit, University Hospital-IBIMA, M

alaga, Spain

Department of Computer Architecture, University of M

alaga, M

alaga, Spain

Integromics S.L., Avenida de la Innovaci

on 1, 18100 Armilla, Granada, Spain

Institute of Bioinformatics, Joh. Kepler University Linz, Linz, Austria

Keywords:

Cloud Computing, Bioinformatics, Biomedicine.

Abstract:

The cost of obtaining genome-scale biomedical data continues to drop rapidly, with many hospitals and uni-

versities being able to produce large amounts of data. Managing and analysing such ever-growing datasets

is becoming a crucial issue. Cloud computing presents a good solution to this problem due to its ﬂexibility

in obtaining computational resources. However, it is essential to allow end-users with no experience to take

advantage of the cloud computing model of elastic resource provisioning. This paper presents a workﬂow

that allows the end-user to perform the core steps of a genome wide association analysis where raw gene-

expression data is quality assessed. A number of steps in this process are computationally intensive and vary

greatly depending on the size of the study, from a few samples to a few thousand. Therefore cloud computing

provides an ideal solution to this problem by enabling scalability due to elastic resource provisioning. The key

contributions of this paper are a real world application of cloud computing addressing a critical problem in

biomedicine through parallelization of the appropriate parts of the workﬂow as well as enabling the end-user

to concentrate on data analysis and biological interpretation of results by taking care of the computational

aspects.

1 INTRODUCTION

This paper presents genCloud, a workﬂow allowing

the end-user to employ various state of the art tools

for the analysis of genome wide association studies

(GWAS). A workﬂow describes a series of activi-

ties as part of a process. These activities are typi-

cally linked and information is transferred from one

activity to another. Such links represent dependen-

cies (for example, one activity must occur before an-

other). Currently, running these GWAS applications

requires computational infrastructure and knowledge

of speciﬁc tools, including the command line, script-

ing, and the R computational language. These tools

often require an amount of computational resources

greater than what is typically available in a moderate

size hospital, which is a typical environment where

such an analysis takes place. The workﬂow described

here presents a contribution to the areas of cloud com-

puting and biomedicine allowing end-users with basic

computational knowledge, such as clinical scientists,

to analyse their own experiments. This is possible

because the workﬂow is made available to end-users

through the jORCA software client (Martin-Requena

et al., 2010). By presenting a case study of such an

analysis, using real life data we apply cloud com-

puting to solve an important biomedical problem in

a real-world situation. Our biomedical use case is

particularly well suited to cloud computing because it

can require very different amounts of computational

resources depending on the size of the analysis.

In terms of the underlying cloud infrastructure

the work presented in this paper offers several dis-

tinguishing characteristics as compared to standard

cloud installations:

• User authentication based on grid technology.

• Speciﬁc support for transferring big datasets

387

Heinzlreiter P., Perkins J., Torreño Ó., Karlsson J., Ranea J., Mitterecker A., Blanca M. and Trelles O..

A Cloud-based GWAS Analysis Pipeline for Clinical Researchers.

DOI: 10.5220/0004802103870394

In Proceedings of the 4th International Conference on Cloud Computing and Services Science (CLOSER-2014), pages 387-394

ISBN: 978-989-758-019-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

through grid computing protocols.

• Enactment of speciﬁc biomedical and bioinfor-

matics workﬂows through an user-friendly soft-

ware client, which signiﬁcantly lowers the level

of computational experience required by the end-

users.

1.1 Applying Cloud Computing for

Workﬂows

Cloud computing (Mell and Grance, 2011) speciﬁ-

cally addresses requirements of the biomedical work-

ﬂow solution described here:

• Dynamic Instantiation of Additional Resources

through Elasticity: The capacity needed for run-

ning user applications can easily be increased or

decreased either automatically through a schedul-

ing component or following a user request without

requiring interaction with the service provider.

• Flexible Conﬁguration of Workﬂow Modules with-

out Administrator Intervention: This approach

is speciﬁcally viable in a scientiﬁc environment,

where the users often rely on speciﬁc libraries

and software packages to perform their research.

The domains of biomedicine and bioinformatics –

which are addressed within the scope of the work

described here – are no exception to this rule: For

example the statistical software environment R (R

Development Core Team, 2008) is heavily used in

these research domains.

• Storing readily Conﬁgured Instances for

reuse through Snapshotting Mechanisms:

An infrastructure-as-a-service (IaaS) approach

is used giving the user the utmost ﬂexibility by

providing access to fully conﬁgurable virtual

instances. Similarly the instances involved in

an analysis workﬂow can be stored as snapshots

and later reused, allowing the user to recreate the

analysis setup when data sources are updated (for

example for new genome releases), or to allow

other users to recreate their original analysis.

Such reproducibility of published results is a

recurrent problem in biomedical research (Button

et al., 2013) (Mobley et al., 2013).

1.2 Related Work

Considerable work has been done in the ﬁeld of bioin-

formatics in connecting independent web-services

(WS) into higher-level workﬂows solving more com-

plex problems.

Taverna (Oinn et al., 2004) is a widely used tool

in bioinformatics for composing web-services into

workﬂows and enacting them. The stand-alone tool

allows users to access various web-service reposito-

ries and connecting the services graphically. It is im-

portant to note that the focus of Taverna is not spe-

ciﬁc for cloud computing environments but instead it

focuses on web-services which could in turn be de-

ployed on cloud computing environments.

Galaxy (Afgan et al., 2010) is an open, web-based

platform for biomedical and bioinformatics research.

The tool allows users to create and execute workﬂows

either from the web browser or through the API pro-

vided. Users can register and use the system on a

public galaxy server. Additionally it is possible to de-

ploy the tool easily on top of commodity hardware

or in cloud environments such as Amazon EC2 and

Openstack (Le Bras and Chilton, 2013). CloudBi-

oLinux (Krampis et al., 2012) and CloudMan (Af-

gan et al., 2012) allow the researchers to easily and

quickly get access to a functional compute infrastruc-

ture. This infrastructure can be conﬁgured in a matter

of minutes and ﬁnalized when is no longer required,

using the CloudMan tool. The infrastructure provides

access to a set of bioinformatics tools, as speciﬁed in

the CloudBioLinux virtual machines. After the con-

ﬁguration of the compute infrastructure, the Galaxy

workﬂow engine is deployed. With the galaxy tool

deployed and the compute infrastructure conﬁgured,

the end-users could focus on creating analysis work-

ﬂows and running them on the infrastructure being

conﬁgured, rather than focussing on speciﬁc details

of conﬁguration and deployment.

2 WORKFLOW

This section describes the biomedical workﬂow and

its underlying technologies in detail. Starting from

the workﬂow itself, the focus is being put on the un-

derlying cloud technology being applied during its

execution, such as data handling and user authenti-

cation. While the speciﬁc workﬂow as described in

section 2.1 serves as proof of concept, the underly-

ing cloud, data transfer, and authentication solutions

as well as the client software presented in section 2.5

can also be used for other computational workﬂows

and are not limited to the bioinformatics and biomed-

ical domains.

2.1 GWAS Workﬂow

An overview of the bioinformatics workﬂow is given

in Figure 1. The user begins by uploading CEL ﬁles

to the cloud data storage. Each CEL ﬁle contains

the raw data generated from a single nucleotide poly-

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

388

morphism (SNP) – an area of variation in the human

genome – microarray analysis for a single patient con-

taining details of the genotype. The amount of data

involved can range from a few megabytes to poten-

tially terabytes in size, depending on the number of

patients included in the study. Once the CEL ﬁles

have been uploaded, the birdseed algorithm (Korn

et al., 2008), which is implemented in the Affymetrix

Power Tools package, is executed. It produces a ta-

ble of genotype calls for each of the different probes

on the microarray. Each probe represents a differ-

ent SNP. Next this table is ﬁltered to remove SNPs

that do not vary across different patients, and that

could not be called accurately by the birdseed algo-

rithm. This is achieved through the use of a Python

script that parses the output of the birdseed algorithm.

Once this table has been ﬁltered, an additional python

script must be called to convert the birdseed output,

which is speciﬁc for Affymetrix data, into a variant

call format (VCF) ﬁle, the standard representation of

a genotype for biomedical data. This script must be

called separately for each CEL ﬁle, which is speciﬁc

for each patient. Therefore it is an embarrassingly

parallel problem and as such very suitable for a data-

parallel execution on cloud infrastructures. Finally

the user can download the set of VCF ﬁles and can

then apply standard web-based tools to view the data,

or further packages for downstream analysis. Given

that this is a standard ﬁle format for several types of

genomic biomedical data, it is the starting point for

many well established further workﬂows. Several of

these workﬂows are currently under development.

The workﬂow is using already available software

for all of its stages. The speciﬁc novelty is given by its

execution on a community cloud environment, which

has been extended in functionality speciﬁcally regard-

ing the data transfer between the stages of the work-

ﬂow.

It should also be noted that the workﬂow as pre-

sented here is designed to analyse data produced us-

ing the Affymetrix SNP chip microarray platform.

However it is also possible to adapt the workﬂow to

other types of input data by modifying instances of

the genCloud cloud images or creating new ones as

necessary to implement related workﬂows processing

data from different platforms, as well as sequencing

data.

Taken together, these raw data analysis work-

ﬂows and different downstream analysis workﬂows

will provide a ﬂexible, easy to use service enabling a

wide variety of potential analysis workﬂows, for var-

ious genomics technologies.

Although the workﬂow as shown in Figure 1 only

contains a small number of steps, it is computation-

CEL files

Upload

(Birdseed Algorithm)

Affy Power Tools

CEL Files

output

Birdseed

QC Python Script

Clean

Birdseed

Output

Convert to VCF

VCF variant call

format file

Figure 1: The GWAS workﬂow.

ally quite involved and contains the key steps of an

initial GWAS data analysis. Also the different steps

have varying computational requirements. Therefore

we have chosen to implement this workﬂow as a ﬁrst

step for testing the different tools available and ensur-

ing its validity for more complex analysis.

2.2 Cloud Computing Infrastructure

The cloud computing infrastructure is provided on

a small community cloud installation running Open-

Stack (Pepple, 2011) as cloud middleware. Besides

offering IaaS cloud services, as an improvement be-

yond the state of the art the cloud installation has been

extended to support efﬁcient transfer of large data sets

through the GridFTP protocol (Allcock et al., 2003).

In addition, the data storage is built on top of the dis-

tributed ﬁle system Ceph (Weil et al., 2006a), which

is accessed by OpenStack through the RADOS Gate-

way (Ceph Team, 2012) implementing the API being

used by the OpenStack services. Ceph is made ac-

cessible for OpenStack services like object contain-

ers and persistent volumes. The object containers can

be used as storage for ﬁles similar to a directory in

a ﬁle system. However OpenStack containers cannot

be nested. Persistent volumes on the other hand are

equivalent to ordinary hard disks and can be dynami-

cally attached to running OpenStack instances.

To glue all of these components together, standard

web-service protocols have been used. To enable the

transfer of mass data into OpenStack object contain-

ers they have been made accessible for the GridFTP

data transfer protocol by mounting them into the local

ﬁle system of a GridFTP server. The same mounting

approach is applied for providing access to the con-

tainers from inside the running virtual instances.

This underlying infrastructure enables the conﬁg-

uration and execution of arbitrary workﬂows such as

the genCloud workﬂow discussed in this paper.

ACloud-basedGWASAnalysisPipelineforClinicalResearchers

389

2.3 Data Storage and Transfer

The Ceph distributed ﬁle system (Weil et al., 2006a)

is used as the basic data storage system in our cloud

setup. Its main advantage is given by the fact that it

does not have a single point of failure due to not re-

quiring a speciﬁc master node. The storage server is

determined by the CRUSH data distribution mecha-

nism (Weil et al., 2006b).

To make the Ceph storage accessible to the Open-

Stack services, a web-service enabled access mecha-

nism is exposed through the RADOS Gateway, which

implements the web-service interfaces deﬁned by the

Amazon S3 (Amazon Web Services, 2013) and Open-

Stack Swift storage services on top of Ceph.

In addition to the web-service based data access

mechanisms, we have conﬁgured the OpenStack con-

tainers to be accessible for the mass data transfer pro-

tocol GridFTP. Since a GridFTP server operates on

a local ﬁle system, the object containers - which are

normally accessed through web-service calls - need to

be mounted into the local ﬁle system of the GridFTP

server.

This is achieved by the Cloudfuse daemon, which

makes the ﬁles within the object container accessible

through POSIX ﬁle system calls.

The user authentication services represent another

crucial part of the setup of the GridFTP server. They

are performed using Grid credentials, namely X.509

certiﬁcates. These can be either long-lived personal

user credentials or short-lived credentials being gen-

erated by a service such as MyProxy (Novotny et al.,

2011) upon user request.

Our setup of GridFTP works with short-lived cre-

dentials, which are generated by MyProxy upon a user

request and can be used by the GridFTP service to

authenticate itself towards remote servers whilst per-

forming the data transfer requested by the user.

To ensure a reliable transfer of the data between

different GridFTP servers, we are relying on the

GlobusOnline service (Allen et al., 2011), which

manages the data transfers on the user’s behalf, as

soon as a GridFTP server has been registered as an

endpoint of the service.

2.4 User Authentication

The user authentication for the whole OpenStack in-

stallation is performed through the Lightweight Di-

rectory Access Protocol (LDAP) (Howes and Smith,

1995) thus allowing for an easy and central manage-

ment of the local credentials for all services. The

management of the OpenStack users follows the cen-

tral grid computing (Foster and Kesselman, 2003)

concept of a Virtual Organization (VO) (Foster et al.,

2001) which is grouping users from different orga-

nizations, who are for example working on a shared

project. Within OpenStack the VOs are mapped onto

user groups called tenants, thus enabling the partici-

pants of one project to easily share resources among

themselves.

The VO concept is used to assign the users to spe-

ciﬁc OpenStack tenants and also to ease the manage-

ment of access rights, which are assigned at the level

of VOs. A good example is given by the access to

the local GridFTP server, enabling users to access the

service and to use GlobusOnline to transfer data to

the local OpenStack containers. All the access rights

based on the VOs are represented within the central

LDAP tree of the installation, which is subsequently

queried by the different services during user authenti-

cation.

2.5 Client Software

jORCA (Martin-Requena et al., 2010) is a user-

friendly software client for using web-services. The

application was speciﬁcally designed to help users

take advantage of computational resources made

available as web-services, i.e. discover web-services,

display available parameters, request information and

ﬁnally execute the web-service. To enable all these

activities, jORCA uses metadata repositories (i.e.

containers of meta-information), with information

about available web-services and data types. The

MAPI library (Karlsson and Trelles, 2013) provides

the uniﬁcation of metadata information. By using this

library, it is possible to extend the execution function-

ality of jORCA with components called workers.

2.5.1 Web-service Invocation and Data Transfer

Support for invoking a RESTful WS protocol / fron-

tend from jORCA is also implemented, providing an

interface between jORCA and the underlying cloud

infrastructure. This interface provides operations to

submit new jobs, to cancel previous ones, to poll for

status and to retrieve intermediate and ﬁnal results.

To submit a new job the user has to ﬁll the required

WS parameters as references to data already uploaded

on cloud data storage. Once the frontend receives the

job submission it contacts the job scheduler to dis-

patch the new task. The scheduler will decide de-

pending on the previously mentioned parameters if

it needs to create a new instance and also if some

idle instances could be deleted. The scheduler will

use euca2ools (Debian Wiki, 2012) to perform the de-

scribed operations.

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

390

At the end of the job submission the front-end will

return a new unique URL resource to the user which

can be cancelled, polled for status, intermediate re-

sults, and, when ready, the ﬁnal result. The interme-

diate and ﬁnal results will be data references to the

results stored on the data storage. The user can now

choose to download the data from the data storage

(using the jORCA plugin developed) and/or submit

the result data references as input to another WS. This

greatly facilitates the invocation of a series of WS as

a workﬂow because the intermediate data is already

available on the infrastructure for a subsequent WS.

The potentially large input and output data

sizes when invoking web-services deployed on the

cloud infrastructure made it essential to use a well-

established protocol to transfer large data sets reli-

ably and securely. The Globus Online (GO) initia-

tive (Allen et al., 2011) uses GridFTP to transfer large

amounts of data. Additionally, GO provides easy-to-

install software for client-side transfer of data and also

supports long-running data transfers where GO medi-

ates the data transfer. The actual data is transferred di-

rectly between two GridFTP servers being registered

as GO endpoints; GO only monitors and controls the

transfer.

jORCA can initiate a GO-mediated data transfer

either upon user request or automatically if the meta-

data of the web-service to be invoked indicates that it

requires the input data to be transferred via GO prior

to invocation. To transfer data from the local com-

puter running jORCA, it is necessary to previously

install and conﬁgure the Globus Connect software,

which is acting as a local GO endpoint. The transfer

status – running, ﬁnished, or error – is displayed to the

end-user through jORCA. Once ﬁnished, the neces-

sary information about the uploaded ﬁle is transferred

to the service which, in turn, will use GO to move the

data if necessary.

2.6 Mapping the Workﬂow onto the

Cloud Infrastructure

The biomedical workﬂow described in 2.1 consists of

a sequence of computational steps, which are inter-

connected by data stored in POSIX-compliant ﬁles.

The output of one step is stored on the ﬁle system,

from where it is read by the next module within the

workﬂow.

To enable a ﬂexible execution of the workﬂow on

top of the above-described cloud infrastructure and

across multiple cloud instances, the intermediate data

ﬁles are being stored in OpenStack containers.

The current cloud setup enables the containers to

be mounted on the local ﬁle system of the running

instances. Therefore the existing programs and scripts

accessing local ﬁles do not need to be adapted to run

on the cloud infrastructure. When a user wishes to

execute the workﬂow, the ﬁrst step is to upload the

input data and store it in a container.

Since the initial input data will typically come

from an external source, the GO service can be in-

voked through jORCA to transfer the data to the local

cloud installation.

Sequential parts of the workﬂow consisting of

multiple consecutive stages can either be collocated

on one instance or distributed across multiple in-

stances depending on their computational load.

The parts of the workﬂow, which are executed in

parallel, are replicated across multiple instances run-

ning concurrently.

The instances monitor the availability of their in-

put data: as soon as the corresponding input dataset is

available in the speciﬁed container, the instance starts

to copy the input data from the container to its local

hard disk and to subsequently execute its computa-

tional algorithm. As soon as a group of collocated

stages on one instance is ﬁnished, the output is written

to a container, enabling the next stage of the workﬂow

running on a different instance to proceed by consum-

ing the newly produced intermediate data in the same

way.

3 RESULTS

The workﬂow described in section 2.1 has been exe-

cuted on a small community cloud installation using

the technologies described in section 2.6.

The user starts by uploading the input data for the

workﬂow, such as the SNP and CEL data, to an Open-

Stack container using GO through the jORCA client.

To perform the data processing workﬂow, a sim-

ple bash-shell script has been set up which calls the

executables performing the computational as well as

data management steps within the workﬂow.

The infrastructure enables the containers to be

mounted into the local ﬁle system of a running virtual

instance, and thus the data access methodology of the

original non-cloud-aware workﬂow modules remains

valid.

Initial runtime measurements have shown that the

last step of the workﬂow – converting the Birdseed

output into a VCF ﬁle – is by far the most time-

consuming (see tables 1, 2, and 3). A typical use-case

involves processing a large number of CEL-ﬁles and

the VCF conversion of a single CEL-ﬁle takes sig-

niﬁcantly longer than the rest of workﬂow combined.

Given these facts, it is clear that the parallelization of

ACloud-basedGWASAnalysisPipelineforClinicalResearchers

391

the conversion to VCF is key to improving the pro-

cess.

Therefore the following parallelization approach

has been realized:

• All steps before the VCF conversion are executed

sequentially on one instance.

• The VCF conversion is parallelized at the ﬁle level

by assigning the conversion tasks for different

ﬁles to different machine instances.

The workﬂow starts with the sequential component

retrieving all the input data required for the birdseed

algorithm from the OpenStack container and storing it

on a local disk within the instance. It is important to

note that while the data is directly accessible through

the mounted container, the copy operation speeds up

the computational step, which only has to access a lo-

cal disk as compared to web-service calls for access-

ing the OpenStack container.

After the data has been made available locally the

birdseed algorithm and the ﬁltering of its results are

performed in sequence before the output of the ﬁlter-

ing step is again stored in the container.

Making the ﬁltered data accessible within the con-

tainer enables the data-parallel execution of the ﬁnal

VCF generation step across different cloud instances.

For the initial prototype all instances are execut-

ing the same shell script and accessing the same con-

tainer. The existence of a speciﬁc ﬁle marks whether

the sequential part of the workﬂow is still active.

While this is the case the other instances taking

part in the parallel VCF generation load the required

input data for the last workﬂow step from the con-

tainer onto their local hard disks. Thus this data trans-

fer overlaps in time with the computation of the ﬁrst

steps of the workﬂow.

Since the input data for the VCF generation in-

cludes human genome data of signiﬁcant size, this

overlapping reduces the overall processing time of the

workﬂow.

As soon as the previous steps of the workﬂow have

been completed, the responsible instance also loads

the input data for the last step. Meanwhile the other

instances can already start the computation within the

last step of the workﬂow. As mentioned before, the

completion of the previous workﬂow steps is indi-

cated by the existence of a speciﬁc ﬁle in the shared

container, which can be tested by the script being ex-

ecuted by all instances.

To properly distribute the ﬁles, which need to be

converted to the VCF format between the involved in-

stances the same lock-by-ﬁle approach is performed:

Within a loop, the script iterates over the set of

ﬁles to be converted, and adds a corresponding lock

ﬁle to the container marking each ﬁle that is currently

being or has already been processed. If an input ﬁle is

already locked by another instance, it is skipped and

the local instance continues with the next ﬁle avail-

able.

While this approach does not guarantee the avoid-

ance of concurrent access to the same ﬁle, is has been

chosen as the synchronization mechanism for this ﬁrst

workﬂow prototype due to its simplicity.

3.1 Test Runs

Test runs of this setup have been performed based on

the following input data set:

• 8 CEL ﬁles with 66 MB each

• Human genome data slightly bigger than 3 GB

The following cloud resources have been used to per-

form the computations:

• 5 instances with 4 GB of RAM and two Intel

Core2 Duo CPUs with 2.4 GHz each. However

for the execution only one CPU per instance was

used.

One of the instances was used to perform the sequen-

tial part and afterwards all ﬁve instances processed the

eight CEL ﬁles in parallel.

Table 1: Runtimes measured for the sequential part.

Workﬂow Step Runtime

Input data upload 62 s

Birdseed algorithm 263 s

Filtering SNPs 26 s

Storing Result in Container 3 s

Sum 354 s

Within the parallel execution the data upload has

only been performed once for each instance. The up-

loading times for the input data for the sequential and

parallel parts of the workﬂow are different due to the

different types of input data of the different workﬂow

stages.

The result data upload in tables 2 and 3 refers to

the result data of the sequential part of the workﬂow.

Table 2: Maximum runtimes measured for the parallel part.

Workﬂow Step Max Runtime

Genome data upload 558 s

Result data upload 8 s

VCF conversion 5207 s

Sum 5773 s

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

392

Table 3: Mean runtimes measured for the parallel part.

Workﬂow Step Mean Runtime

Genome data upload 547.80 s

Result data upload 2.80 s

VCF conversion 4970.33 s

Sum 5430.93 s

3.2 Application of the Workﬂow to Real

Data

We have shown an example usage scenario for this

workﬂow, where it was used to analyse the genotypes

of patients with cross intolerance to non-steroidal anti

inﬂammatory drugs, one of the most important drug

allergies, which can lead to potentially fatal reac-

tions (Ayuso et al., 2013). A group of over 100 pa-

tients were genotyped, along with over 100 control

subjects. More details on the data and its analysis are

given in (Cornejo-Garc

ıa et al., 2013). This approach

led to the discovery of a number of SNPs that show

a strong association with the pathology. These SNPs

are currently being further investigated in a new co-

hort of patients and through the use of in vitro exper-

imental assays.

4 DISCUSSION AND

CONCLUSIONS

We have presented a novel cloud-based workﬂow,

genCloud, which allows an end-user to perform a

number of important and computationally intensive

steps in the analysis of genome wide association data.

We believe that close collaboration between target

end-users and developers is essential for the devel-

opment of such a service and to ensure that user de-

mands are met. We also believe that cloud based sys-

tems are the most suitable and ﬂexible solutions for

the computational needs of clinical research groups

in hospitals and medical research groups for two main

reasons:

• They require minimal computational know-how

from the end-user in terms of installation, admin-

istration and scripting. The user can therefore fo-

cus on understanding the details of the tools them-

selves and the biological interpretation of the re-

sults, rather than their implementation.

• They can also be scaled according to the user’s

needs: the analysis presented here focussed on

a small, proof of principle analysis, with 8 sam-

ples. However, some GWAS experiments pro-

duce 100s, even 1000s of samples, requiring more

computational resources. This isn’t an issue for

a cloud based system: the cloud instances started

for the analysis can be as large and as numerous

as required. The upload of the genome data - rep-

resenting the biggest input dataset by far - is only

required once. Afterwards multiple sets of SNP

ﬁles can be processed by the workﬂow.

It should be made clear that the workﬂow described

here is only able to handle data from the Affymetrix

SNP chip microarray platform. Given the ever grow-

ing popularity of whole genome sequencing data, as

well as its ever reducing cost, the next step will be

to extend the workﬂow for this kind of data. A typi-

cal whole genome sequencing ﬁle with a high cover-

age can have more than 100 GB per individual. The

task to extract the genotypes out of these ﬁles can take

around a week and will take around 1 TB temporary

hard disk size per individual. As this step is com-

pletely independent for each individual it will beneﬁt

greatly from a cloud-based implementation, since this

will enable ﬂexible scaling and parallelisation. More-

over, given that the output of the analysis of this data

is also a VCF ﬁle, it could be combined with the same

downstream analysis workﬂows that are used for the

current data.

The runtime measurements presented in Sec-

tion 3.1 have been collected from a workﬂow of in-

stances being started by the user before the actual

workﬂow was invoked by starting the workﬂow script

on each instance through an interactive shell. While

the automatic invocation of instances stored as snap-

shots has already been performed for other applica-

tions on the same infrastructure it has not been ap-

plied yet for the workﬂow described here. The auto-

matic deployment of new instances will be realised as

part of the scheduler component of the infrastructure,

which is still in the design phase.

Future work will focus on the integration of the

above-mentioned cloud technologies and our commu-

nity cloud with the Galaxy workﬂow engine subse-

quently enabling us to provide a much better and sim-

pliﬁed user experience to medical and research staff,

who are not experts in cloud computing and work-

ﬂows. As additional steps we will investigate possi-

bilities of automatic exploitation of cloud elasticity to

enable scalable workﬂows being able to handle differ-

ent input sizes.

ACKNOWLEDGEMENTS

This publication is supported by the European Com-

munity through the FP7 IAPP project Mr. Sym-

BioMath, grant agreement number 324554.

ACloud-basedGWASAnalysisPipelineforClinicalResearchers

393

REFERENCES

Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko,

A., and Taylor, J. (2010). Galaxy cloudman: deliv-

ering cloud compute clusters. BMC Bioinformatics,

11(Suppl 12)(S4).

Afgan, E., Chapman, B., Jadan, M., Franke, V., and Taylor,

J. (2012). Using cloud computing infrastructure with

cloudbiolinux, cloudman, and galaxy. Current Proto-

cols in Bioinformatics.

Allcock, W., Bester, J., Bresnahan, S., Plaszczak, P., and

Tuecke, S. (2003). Gridftp: protocol extensions to ftp

for the grid. Technical Report GFD-R-P.020, Open

Grid Forum. Proposed Recommendation.

Allen, B., Bresnahan, J., Childers, L., Foster, I., Kan-

daswamy, G., and Kettimuthu, R. (2011). Globus

online: Radical simpliﬁcation of data movement via

saas. Technical Report Preprint CI-PP-5-0611, Com-

putation Institute, The University of Chicago.

Amazon Web Services (2013). Amazon simple storage ser-

vice (amazon s3). http://aws.amazon.com/s3/.

Ayuso, P., Blanca-L

opez, N., Do

na, I., Torres, M., Gu

eant-

Rodriguez, R., Canto, G., Sanak, M., Mayorga, C.,

eant, J., Blanca, M., and Cornejo-Garc

ıa, J. (2013).

Advanced phenotyping in hypersensitivity drug reac-

tions to nsaids. Clinical and Experimental Allergy,

43(10):1097–1109.

Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint,

J., Robinson, E., and Munaf, M. (2013). Power fail-

ure: why small sample size undermines the reliabil-

ity of neuroscience. Nature Reviews Neuroscience,

14(5):365–376.

Ceph Team (2012). Rados gateway - ceph documentation.

http://eu.ceph.com/docs/wip-3060/radosgw/.

Cornejo-Garc

ıa, J., Liu, B., Blanca-L

opez, N., na, I. D.,

Chen, C., Chou, Y., Chuang, H., Wu, J., Chen, Y.,

Plaza-Ser

on, M., Mayorga, C., Gu

eant-Rodr

ıguez, R.,

Lin, S., Torres, M., Campo, P., Rond

on, C., Laguna,

J., Fern

andez, J., Gu

eant, J., Canto, G., Blanca, M.,

and Lee, M. (2013). Genome-wide association study

in nsaids-induced acute urticaria/angioedema in span-

ish and han-chinese populations. Pharmacogenomics.

in press.

Debian Wiki (2012). euca2ools - debian wiki.

https://wiki.debian.org/euca2ools.

Foster, I. and Kesselman, C., editors (2003). The Grid 2:

Blueprint for a New Computing Infrastructure. Else-

vier.

Foster, I., Kesselman, C., and Tuecke, S. (2001). The

anatomy of the grid: Enabling scalable virtual organi-

zations. International Journal of Supercomputer Ap-

plications, 3(15).

Howes, T. and Smith, M. (1995). A scalable, deployable di-

rectory service framework for the internet. Technical

Report UM-CITI 95-7, University of Michigan.

Karlsson, J. and Trelles, O. (2013). Mapi: a software frame-

work for distributed biomedical applications. Journal

of Biomedical Semantics, 4(4).

Korn, J., Kuruvilla, F., McCarroll, S., Wysoker, A.,

Nemesh, J., Cawley, S., Hubbell, E., Veitch, J.,

Collins, P., Darvishi, K., Lee, C., Nizzari, M., Gabriel,

S., Purcell, S., Daly, M., and Altshuler, D. (2008). In-

tegrated genotype calling and association analysis of

snps, common copy number polymorphisms and rare

cnvs. Nature Genetics, 10(40):1253–1260.

Krampis, K., Booth, T., Chapman, B., B. Tiwari, M. B.,

Field, D., and Nelson, K. (2012). Cloud biolinux: pre-

conﬁgured and on-demand bioinformatics compu ting

for the genomics community. BMC Bioinformatics,

13(1):42.

Le Bras, Y. and Chilton, J. (2013). Deploying pro-

duction galaxy instances on openstack with

cloudbiolinux and cloudman. https://www.e-

biogenouest.org/resources/243.

Martin-Requena, V., Rios, J., Garcia, M., Ramirez, S., and

Trelles, O. (2010). jorca: easily integrating bioinfor-

matics web services. Bioinformatics, 26(4):553–559.

Mell, P. and Grance, T. (2011). The nist deﬁnition of cloud

computing. Technical Report 800-145, National Insti-

tute of Standards and Technology.

Mobley, A., Linder, S., Braeuer, R., Ellis, L., and Zwelling,

L. (2013). A survey on data reproducibility in can-

cer research provides insights into our limited ability

to translate ﬁndings from the laboratory to the clinic.

PLoS One, 8(5). e63221.

Novotny, J., Tuecke, S., and Welch, V. (2011). An on-

line credential repository for the grid: Myproxy. In

Proceedings of the Tenth International Symposium on

High Performance Distributed Computing.

Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M.,

Greenwood, M., Carver, T., Glover, K., Pocock, M.,

Wipat, A., and Li, P. (2004). Taverna: a tool for the

composition and enactment of bioinformatics work-

ﬂows. Bioinformatics, 20(17):3045–3054.

Pepple, K. (2011). Deploying OpenStack. O’Reilly Media,

ﬁrst edition.

R Development Core Team (2008). R: A language ad en-

vironment for statistical computing. R Foundation for

Statistical Computing.

Weil, S., Brandt, S., Miller, E., Long, D., and Maltzahn,

C. (2006a). Ceph: A scalable, high-performance dis-

tributed ﬁle system. In Proceedings of the 7th Sym-

posium on Operating System Design and Implementa-

tion, pages 307–320.

Weil, S., Brandt, S., Miller, E., and Maltzahn, C. (2006b).

Crush: controlled, scalable, decentralized placement

of replicated data. In Proceedings of the 2006

ACM/IEEE Conference on Supercomputing.

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

394