the node’s characteristics, the operations to perform
upon them and the system architecture.
The execution of the RADL through the IM pre-
pares the whole environment (PS and MR nodes) to
execute the training script. The complete version of
the resource allocation configuration files and code
are available on GitHub
3
, and thoroughly described
in (Jorge Cano, 2019). We have relied on an An-
sible role to deploy the Hadoop cluster in order to
use HDFS as a shared data storage among the nodes.
Once this step is completed, the Hadoop cluster is
completely deployed, and then we can proceed with
data handling, as well as configuring and installing
TF. Amazon S3 is used as the initial storage for the
data, and the dataset is retrieved from S3 and staged
into HDFS upon deployment of the Hadoop cluster,
but the data can come from any external source. Af-
ter deploying the Hadoop cluster, the distributed TF
script is prepared, installed and executed using an-
other Ansible role. The set of parameters for the PS
node is similar to the MR nodes, with some changes
accordingly to their function, such as the node type
or whether a GPU is used or not. The role involves
four parts: first, some environment preparation with
paths and additional variables. Second, the condi-
tional installation of a specific TF version depending
on whether the node has a GPU or not. Third, cloning
the actual training scripts and running them, depend-
ing on the role of the node in the cluster. Finally, we
have included the code to upload the resulting model
to an S3 bucket to persist it beyond the lifecycle of
the dynamically provisioned Hadoop cluster, but we
can select any external platform or service to store the
model.
3.3 Distributed TensorFlow Training
This section shows how we performed the adaptation
of the distributed TF code to be used in our pipeline.
TF distributed training implements a scheme where
one or more nodes act as PS nodes and the rest as MR
nodes. The code is available in Github
4
.
This work uses the Estimator API in TF, which
provides high-level functions that encapsulate the dif-
ferent parts of the machine learning pipeline: model,
input, training, validation and evaluation. Provid-
ing common headers for these steps they can manage
training and inference better, as the whole pipeline
is decomposed into isolated functions. By defining
the pipeline in these terms, it is entirely managed by
3
https://github.com/JJorgeDSIC/Master-Thesis-
Scalable-Distributed-Deep-Learning/
4
https://github.com/JJorgeDSIC/
DistributedTensorFlowCodeForIM
TF, without worrying about running iterations, evalu-
ation steps, logs or saving the model manually. An-
other significant advantage is to have a program that
can run in a single node with CPU or GPU, with
multiple GPUs or in a distributed way, just indicat-
ing these variations as an environment variable called
TF CONFIG.
4 EXPERIMENTATION
In this section, we will use the configuration files,
codes and scripts that we have presented in previous
sections in order to evaluate our deployment and ex-
ecution in terms of flexibility and timing. For do-
ing this, we have selected a public Cloud provider
to deploy our infrastructure and perform the evalua-
tion. First, we consider executing the training using
single nodes, in physical and virtual machines, to get
the baseline that we can achieve with the resources at
reach. Second, we study the use of a Cloud provider
to deploy Virtual Machines with and without special-
ized hardware, to compare the results to both single
and multiple node configurations.
To perform the local experimentation, we used a
physical node with Ubuntu 16.04, NVIDIA CUDA 9
- CUDNN 7, TF r1.10. Regarding the hardware, the
node includes an Intel(R) Xeon(R) CPU E5-1620 v3
4 cores @ 3.50GHz with 128 GB of RAM along with
the GPU GeForce GTX 1080Ti (11GB, 3.5K cores @
1.6Ghz).
We selected AWS as the public Cloud provider
to use during the evaluation. An analysis of the in-
stance types and pricing was carried out, in terms
of choosing cost-effective computing resources. To
reduce costs, we have selected an Amazon Machine
Image (AMI) to accelerate the deployment and avoid
some time-consuming tasks such as update packages
when we use a fresh Ubuntu 16.04 VM. The AMI
used is called Deep Learning Base AMI (Ubuntu)
Version 10.0 AMI
5
, and it is available in most AWS
regions. This AMI comes with NVIDIA CUDA 9 and
NVIDIA cuDNN 7 and can be used with several in-
stance types, from a small CPU-only instance to the
latest high-powered multi-GPU instances. The infor-
mation of these instances’ families is available on the
AWS website
6
. We have selected p2 instances, the
p2.xlarge in particular, that are composed of an In-
tel Xeon E5-2686 v4 (Broadwell) CPU and NVIDIA
K80 GPUs (12GiB, 2.5K cores). It is important to
remark the years of development between the vir-
5
https://aws.amazon.com/marketplace/pp/
B077GCZ4GR
6
https://aws.amazon.com/ec2/instance-types/
CLOSER 2021 - 11th International Conference on Cloud Computing and Services Science
138