An HPC Application Deployment Model on Azure Cloud for SMEs

Fan Ding

1,2

, Dieter an Mey

, Sandra Wienke

, Ruisheng Zhang

and Lian Li

School of Mathematics and Statistics, Lanzhou University, Lanzhou, China

Center for Computing and Communication, RWTH Aachen University, Aachen, Germany

School of Information Science & Engineering, Lanzhou University, Lanzhou, China

Keywords: Azure Cloud, MPI, HPC, Azure HPC Scheduler, SMEs.

Abstract: With the advance of high-performance computing (HPC), more and more scientific applications which

cannot be satisfied by on-premises compute power need large scale of computing resources, especially for

small and medium-sized enterprises (SMEs). Emerging cloud computing offerings promise to provide us

with enormous on-demand computing power. Many cloud platforms have been developed to provide users

with various kinds of computer and storage resources. The user only needs to pay for the required resources

and does not need to struggle with the underlying configuration of the operation system. But it is not always

convenient for a user to migrate on-premises applications to these cloud platforms, which is especially true

for an HPC application. In this paper, we proposed an HPC application deployment model based on the

Windows Azure cloud platform, and developed an MPI application case on Azure.

1 INTRODUCTION

High-performance computing (HPC) dedicates big

processing power to compute-intensive complex

applications for scientific research, engineering and

academic subjects. From the mainframe era to

clusters and then grid, more and more available

computing resources can be employed by HPC

applications on-premises. Nowadays we have come

to the times of cloud computing. There are some

differences to provide users with the resources

between on-premises and cloud. On-premises

require users to invest much cost in purchasing

equipment and software for building their

infrastructures at their initial development. This may

be a challenge for some SMEs (small and medium-

sized businesses) since they have not enough

capability to invest. With cloud computing, users

can obtain on-demand resources from cloud virtual

servers not only for the basic infrastructure but also

for the extra resources which used to process

complex work cannot be finished by the

enterprise/researcher competence of its own.

Moreover, the massive data centers in the cloud can

meet the requirement of data-intensive applications.

By means of these advantages of the cloud

paradigm, it will be an inevitable trend to migrate

the HPC applications into the cloud.

Nowadays, there are many cloud service

platforms provided by different cloud vendors. The

major cloud platforms include Amazon’s Elastic

Compute Cloud (EC2) (see Amazon), IBM

SmartCloud (see IBM), Google Apps (see Google

Apps) and Microsoft’s Azure cloud (see Windows

Azure). Users can choose the different kind of cloud

platform according to which level of cloud service

they need. Generally, they provide three types of

cloud services, SaaS (Software as a Service), PaaS

(Platform as a Service) and IaaS (Infrastructure as a

Service), also called service models. These

platforms with their support for HPC have been

summarized in Table 1.

From the point of view of a user, who is not

familiar with complex computer configuration, it is

difficult to migrate existing on-premises applications

into the cloud because of the differences and the

complex configuration of the user interface in these

cloud platforms. Some methods or middleware

which enable users to use cloud platform easily are

required by scientists and other potential users of

cloud resources. Much work has been done to study

how to take an existing application into the cloud.

The work presented in paper (Costaa and

Cruzb,2012) is similar to ours, but the authors

focused on moving a web application to Azure.

CloudSNAP also was a web deployment platform to

253

Ding F., an Mey D., Wienke S., Zhang R. and Li L..

An HPC Application Deployment Model on Azure Cloud for SMEs.

DOI: 10.5220/0004412202530259

In Proceedings of the 3rd International Conference on Cloud Computing and Services Science (CLOSER-2013), pages 253-259

ISBN: 978-989-8565-52-5

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

deploy Java EE web applications into a Cloud

infrastructure (Mondéjar, Pedro, Carles, Lluis,

2013). In paper (Marozzo, Lordan, Rafanell, Lezzi

and Badia, 2012), the authors developed a

framework to execute an e-Science application on

Azure platform through expending COMPSs

programming framework. But all of these

aforementioned efforts did not consider parallel

computing for the compute-intensive applications in

HPC.

Table 1: Major current cloud platforms.

Service

vendor

Cloud

Platform

Description

Support

for HPC

Amazon EC2 (IaaS)

Provides

scalable web

services,

enables

users to

change

capacity

quickly.

Cluster

Compute

and

Cluster

GPU

Instances

IBM

SmartCloud

(IaaS )

Allows users

to operate

and manage

virtual

machines

and data

storage

according to

their need

IBM

HPC

clouds

Microsoft

Windows

Azure

(PaaS)

Composed

of Windows

Azure, SQL

Azure and

Azure

AppFabric

Azure

HPC

Scheduler

Google

Apps

(SaaS)

Provisions

web-based

applications

(Gmail,

Google Talk

and Google

Calendar)

and file

storage

(Google

Docs)

Google

Compute

Engine

(see

hpccloud)

In our paper, we take the advantage of Azure to

develop a cloud deployment model through

expending the Azure HPC scheduler (see HPC

scheduler). Windows Azure is an open cloud

platform developed by Microsoft on which users can

build, deploy and manage applications across a

global network of the Microsoft datacenter. Our

work is based on the project ‘Technical Cloud

Computing for SMEs in Manufacturing’. In this

project, the application “ZaKo3D” (Brecher,

Gorgels, Kauffmann, Röthlingshöfer, Flodin and

Henser, 2010) developed by WZL (the Institute for

Machine Tools and Production Engineering, RWTH

Aachen University) (see RWTH WZL), which aims

to do FE-based 3D Tooth Contact Analysis, is a

high-performance technical computing software tool

based on simulation of the tooth contact. It reads

several geometry data of the flank and a FE-Model

of a gear section as the software’s variation and then

performs a set complex of variation computations.

As a result of the variation computations, one gets

contact distance, loads and deflections on the tooth.

Such a variation computation includes e.g.

thousands of variants to be processed which lead to

computing times of around months on a single

desktop PC. Currently, this challenge is addressed

by employing on-premises

HPC resources are available at RWTH Aachen

University, e.g. if one variant takes 1 hour

computing time, the entire variant time for 5000

variations would take 5000 hours computing time.

This exceeds by far the capabilities even of multi-

socket multi-core workstations, but can well be

performed in parallel by an HPC cluster. But the

small and medium-sized businesses in general can

neither access those, nor do they maintain similar

capabilities themselves. As described earlier, the

availability of HPC resources in the cloud with a

pay-on-demand model may significantly change this

picture.

In this paper, we also developed a case which

deploys the “ZaKo3D” application on Azure

according to our deployment model. This use case

convinces SMEs to adopt cloud computing to

address this computational challenge. Moreover, this

application is a serial version, as outlined before, to

execute a large scale of variation calculation for a

long time. It is a challenge for us to design an

optimal method to improve the computing

efficiency. We have figured out a framework to

deploy the application on Azure which parallelizes

the variation computation and run the application on

Azure by using our deployment model.

The rest of the paper is structured as follows. In

Section 2, we introduce the Azure cloud and its

module Azure HPC Scheduler. Section 3 describes

our problem statement. Followed by our deployment

model and framework for parallel execution of the

Zako3D application on Azure presented in Section

4. In Section 5, we compare the runtime between on-

premises and cloud for HPC by deploying our

application on the RWTH Cluster and Azure. This

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

254

comparison can be used as a reference point for

potential users to consider whether employ the on-

demand resources. Finally, Section 6 concludes the

paper with a summary and future work.

2 WINDOWS AZURE AND HPC

SCHEDULER

In this section we introduce the Azure cloud

platform and related technologies that we have used

in the development of the deployment model

architecture. The Windows Azure platform was

announced by our collaborative partner Microsoft in

2010. In our work, they provided us the Azure cloud

accounts for initial development and testing. This

platform includes Windows Azure, SQL Azure and

AppFabric. Our work focuses on Windows Azure,

which is a Platform as a Service offering and

provides us the compute resources and scalable

storage services. We employ Windows Azure HPC

Scheduler to deploy the serial ZaKo3D by means of

an MPI-based framework developed by us, to the

Azure compute resources. For data management, the

Azure storage blob service facilitates the transfer

and storage of massive data in the execution of our

MPI application on the Azure cloud.

2.1 The Three Roles of Windows Azure

Windows Azure provides the user with three types

of roles to develop a cloud application: Web Roles,

Work Roles and VM Roles. Web Roles aim to

display websites and present web applications,

supported by IIS. Work Roles are used to execute

tasks which require the compute resources. Work

roles can communicate with other roles by means of

Message queues as a choice in various techniques.

The VM Role differs from the other two roles in that

it acts as an IaaS to provide services instead of PaaS.

The VM Role allows us to run a customized instance

of Windows Server 2008 R2 in Windows Azure. It

facilities migrate with some application, which is

difficult to bring to cloud, into the cloud.

2.2 HPC Scheduler

for High-performance Computing

in Azure Cloud

Microsoft developed the HPC Scheduler to support

running HPC applications in the Azure cloud.

Compute resources are virtualized as instances on

Windows Azure. When an HPC application requires

an Azure instance to execute, it means the work will

be divided into lots of small work items, all running

in parallel on many virtual machines simultaneously.

The HPC Scheduler allows scheduling this kind of

applications built to use the Message Passing

Interface (MPI) and distributes their works across

some instances. The deployment build with

Windows Azure SDK includes a job scheduling

module and a web portal to submit job and resource

management.

The role types and the service topology can be

defined when creating a service model in

configuring cloud hosting service. HPC Scheduler

supports Windows Azure roles through offering

plug-ins. There are three types of nodes which

provide different function and runtime support.

 Head node: Windows Azure work role with

HpcHeadNode plug-in, provides job scheduling

and resource management functionality.

 Compute node: Windows Azure work role with

HpcComputeNode plug-in, provides runtime

support for MPI and SOA.

 Front node: Windows Azure web role with

HpcFrontEnd plug-in, provides web portal (based

on REST) as the job submission interface for HPC

Scheduler.

Visual studio has been specified as the

development environment for this component.

2.3 Azure Storage Blob Service

Azure storage service provides data storage and

transfer to applications in Windows Azure and

supports multiple types of data: binary, text data,

message and structured data. It includes three types

of storage service:

 Blobs (Binary Large Objects), the simplest way for

storing binary and text data (up to 1TB for each

file) that can be accessed from anywhere in the

world via HTTP or HTTPS.

 Tables, for storing non-relational data using

structured storage method.

 Queues, for storing messages that may be accessed

by a client, and communicate messages between

two roles asynchronously.

In our deployment model, we employ the Blob

storage service to manage our application’s data

because ZaKo3D uses text data for input files and

output files. As described in Figure 1, there are three

layers in the concept of Blob storage. In order to

store data into the Windows Azure with the blob

storage service, a blob storage account has to be

created which can contain multiple containers. A

AnHPCApplicationDeploymentModelonAzureCloudforSMEs

255

container looks like a folder in which we place our

blob items.

Figure 1: Three layers concept of Blobs storage

3 PROBLEM STATEMENT

As described in the introduction, ZaKo3D is a

software package, part of the WZL Gear toolbox,

and has been used to process the FE-based 3D Tooth

Contact Analysis. It reads several geometric data of

the flank and a FE-Model of a gear section as the

software’s variation. The results of the execution are

the contact distance, loads and deflections on the

tooth.

Figure 2: ZaKo3D variation file.

The variant computing will process a large numbers

of variants. For example, there are 8 deviations at

pinion and gear in the input data of one analysis, and

each deviation with 4 values, so the number of

variants the ZaKo3D needs to process is, 48= 65536

variants. Calculating such large amounts of variants

one by one on a single PC, it would take way too

long. Obviously, this traditional way does not work

well. Our method is to split these variants and then

compute them on different work units in the cloud in

parallel. Figure 2 represents the variants description

in the parameter input file. We consider ZaKo3D as

an HPC application and develop an automatic

parallel framework to distribute the parameter file

over a fixed number of cloud nodes and execute the

application in parallel.

4 HPC APPLICATION

DEPLOYMENT MODEL

AND PARALLEL

FRAMEWORK

In order to deploy our HPC application on the Azure

cloud platform, we developed an HPC application

deployment model as described in Figure 3.

Furthermore, a parallel framework based on MPI

was developed to ensure effective and efficient

execution of the application. The function of the

framework includes parsing variation file and

distributing the variations on Azure nodes in

parallel.

The deployment model combines on-premises

resources and cloud resources. The scientist deploys

the application binary through an on-premises

cluster server or any windows desktop based on

HPC Scheduler. All computing tasks will be

processed by the Azure cloud computing resources.

4.1 Move Application to Cloud

Firstly, the head node and compute nodes on Azure

have to be configured with Azure HPC Scheduler.

The number of compute nodes as worker role in

Azure is allowed to be set from the deployment

interface according to the user’s requirement.

Secondly, after configuration for the cloud hosting

service, there are two methods to move an

application onto the configured Azure instance: from

a local server and from a head node on the Azure

portal. We focus on the way of using the head node.

Here we have three steps: 1. Move the HPC

application pack on the head node. 2. Upload the app

package to Azure blob storage. 3. Synchronize all

compute nodes with the app package. These

operations result in each compute instance

maintaining a copy of the HPC application.

4.2 Parallel Framework and HPC Job

Scheduler

As described in Figure 3, the deployment module

takes the application binary to Azure with a parallel

framework based on MPI as the job scheduler. This

framework aims to distribute and run in parallel the

application binary onto Azure nodes. We take the

ZaKo3D for example of HPC application. ZaKo3D

is a serial application. It will take a long time for

executing a number of parameters due to a large

numbers of variations computing involved. Look

back to the problem statement, the variation file

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

256

Figure 3: HPC application deployment architecture.

Figure 4: Variation distribution workflow.

needs to be divided and distributed to Azure nodes.

When an HPC job in the cloud was scheduled,

firstly, the framework will perform a distribution

workflow for parsing variation file and distributing

variations can be seen in Figure 4, in order to ensure

each node gets same number of variations, the

variation distribution workflow adopts loop method

to distribute the variations to Azure nodes based

MPI. After completing the variation distribution

workflow, each MPI process

will get a part of

variations which is an average of total variation, and

then merge these obtained variations into a sub

parameter file on different Azure nodes owned by

the MPI process. This process is depicted by the job

flow in Figure 3.

After job submission with our parallel

framework, the MPI job will execute on Azure

which needs to configure windows firewall setting to

allow these MPI sub jobs to communicate across

Azure node. The application binary with the sub

parameter file is deployed on each compute node

called Azure worker instance, and then executed on

these allocated nodes in parallel. Windows Azure

deals with load balancing for us, so we do not need

to handle this on our own.

Three methods can be used to submit an HPC job

by means of this model.

1. Azure portal: Azure provides us with a job portal

in the HPC scheduler. Through this portal, we

can manage all jobs, submit a new job or delete a

job, or view the status of a job.

2. Command prompt on Azure node: job

submission API is supported by Azure job

submission interface similar as in Windows HPC

Server 2008 R2 through a remote connection on

the command prompt of an Azure node.

3. HpcJobManager on Azure node: an interface for

submitting and monitoring jobs, similar to the

HPC Job Manager on cluster.

4.3 Data Management

Data management, described by data flow in Figure

3, is supported by a sub module which dedicates to

manage the application data on Blob storage, and

gather output results from each compute node. The

AnHPCApplicationDeploymentModelonAzureCloudforSMEs

257

application data, which includes input files, library

files and the application executable file, is

synchronized on each compute node. As a result, all

compute instances get a copy of the application. For

gathering the work result, the results generated on

each compute node are merged and then copied to

the Blob Container by this module. Afterwards,

results can be viewed and downloaded from the

Azure web portal.

5 PERFORMANCE ANALYSIS

We have conducted a set of experiments on both

Azure platform and the RWTH Cluster to compare

the difference between cloud resource and on-

premises. Moreover, we hope to evaluate whether

SMEs can profit from cloud’s advantage in HPC.

We deployed ZaKo3D application with the

developed parallel framework using our developed

deployment model with the same number of

variations (120) on the two different platforms. We

distribute the work on respectively 1, 2, 3, 4, 5, 6, 8,

10, 12, 15 compute nodes of Azure and cluster (the

number of instances to distribute variations must be

the divisor of the number of variations in our case),

this means that each experiment submits a job on

different numbers of processors. It should be pointed

out that there are some differences regarding the

deployment on these two platforms: After the

deployment of the application, all Azure nodes have

a copy of the application data automatically,

whereas using the cluster, we need to create a copy

of the application data manually for each compute

node.

Figure 5: Runtime of ZaKo3D on Cluster and Azure

nodes.

We gathered our results on up to 15 nodes of the

Windows part of the RWTH Compute Cluster, each

node containing 2 Intel Xeon E5450 8-core CPU

running at 3.00 GHz with 16 GB memory. The

Windows Azure platform can supply us a hosted

service with max 20 small virtual instances as

compute nodes with Quad-Core AMD Opteron

Processors at 2.10 GHz, 1.75 GB of memory.

Figure 5 presents the runtimes on different

number of nodes of Cluster and Azure. Due to the

different node configuration, we can see that the

Azure’s curve is always above the cluster one.

Azure’s performance cannot catch up with the

cluster. But as the number of compute nodes

increases, the two curves will probably get closer as

depicted with 15 nodes in Figure 5. Furthermore,

from Figure 6, we can see that scaling is well for a

small number of nodes. However, due to the design

of the application with a portion of 1.6% sequential

code, we are restricted by Amdahl’s law and could

get a maximum speedup of 58. We assume that

applications with a higher portion of parallel code

may scale well on Azure nodes for a high number of

nodes. This indicates that cloud have good

scalability and cloud’s power can support HPC

application’s execution under the circumstances that

user’s inadequate on-premises resources cannot

satisfy the requirements of large scale of compute-

intensive applications.

Figure 6: Scalability of ZaKo3D execution on different

number of Cluster and Azure nodes.

6 CONCLUSIONS

In this paper, we have presented a cloud deployment

model for an HPC application. Moreover, a parallel

framework for the HPC application ZaKo3D has

been developed which enables the application to run

on a number of cloud nodes, thus easing the

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

258

migrating process of HPC application from existed

on-premises resources into the cloud. The advantage

of running HPC applications in the cloud

environment is that using on-demand cloud

resources can reduce the cost of maintenance and

save on purchase of the software and equipment.

This work can give a reference to SMEs (small

and medium-sized enterprises) to develop their HPC

applications for cloud environments. We have to

point out that although cloud can leverage the

enterprise’s HPC application development, current

cloud power can only be used to supply to the status

when an organization does not have enough on-

premises resources to support its development, due

to the capability of current cloud cannot catch up

with on-premises HPC resources.

Considering our future research, for the parallel

framework, we will make efforts to decrease the

overhead in the parallel scheduler. Furthermore,

based on our performance analysis of the cluster and

the cloud, in the next step we will investigate the

price of cluster and cloud, through comparing

differences between these two platforms, figure out

an available rule for users to make the best decision

to choose HPC platforms in rational combination of

the price and performance within their capability.

REFERENCES

Amazon. Amazon’s Elastic Compute Cloud.

http://aws.amazon.com/ec2/.

Brecher, C., Gorgels, C., Kauffmann, P., Röthlingshöfer,

T., Flodin, A., Henser, J., 2010. ZaKo3D-Simulation

Possibilities for PM Gears. In World Congress on

Powder Metal, Florenz.

Costaa, P., Cruzb, A., 2012. Migration to Windows Azure-

Analysis and Comparison. In CENTERIS 2012 -

Conference on ENTERprise Information Systems /

HCIST 2012 – International Conference on Health

and Social Care Information Systems and

Technologies. Procedia Technology 5 ( 2012 ) 93 –

102.

Google Apps. http://www.google.com/apps/intl/en/

business/cloud.html.

Hpccloud. Google Enters IaaS Cloud Race.

http://www.hpcinthecloud.com/hpccloud/2012-07-03/

google_enters_iaas_cloud_race.html.

HPC scheduler. Windows Azure HPC scheduler,

http://msdn.microsoft.com/en-

us/library/hh560247(v=vs.85).aspxWindows Azure.

http://www.microsoft.com/windowsazure/.

IBM. IBM SmartCloud. http://www.ibm.com/cloud-

computing/us/en/.

Mondéjar, R., Pedro, ., Carles, P., Lluis, P., 2013.

CloudSNAP: A transparent infrastructure for

decentralized web deployment using distributed

interception. Future Generation Computer Systems 29

(2013) 370–380.

Marozzo, F., Lordan, F., Rafanell, R., Lezzi, D., Talia D.,

and Badia, R., 2012. Enabling Cloud Interoperability

with COMPSs. In Euro-Par 2012, LNCS 7484(2012)

16–27.

RWTH WZL. http://www.wzl.rwth-aachen.de/en/

index.htm.

AnHPCApplicationDeploymentModelonAzureCloudforSMEs

259