Accelerating Deep Learning Model Training on Cloud Tensor Processing

Unit

Cristiano A. K

unas

1 a

, Edson L. Padoin

2 b

and Philippe O. A. Navaux

1 c

Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazil

Regional University of Northwestern Rio Grande do Sul, Iju

ı, Brazil

Keywords:

Cloud Computing, TPU, High-Performance Computing, Diabetic Retinopathy.

Abstract:

Deep learning techniques have grown rapidly in recent years due to their success in image classiﬁcation,

speech recognition, and natural language understanding. These techniques have the potential to solve complex

problems and are being applied in various ﬁelds, such as agriculture, medicine, and administration. However,

training large and complex models requires high-performance computational platforms, making accelerator

hardware an essential tool and driving up its cost. An alternative solution is to use cloud computing, where

users only pay for usage and have access to a wide range of computing resources and services. In this paper,

we adapt a Diabetic Retinopathy neural network model for TPU-based training in the cloud and observe

promising results, including reduced training time without code optimization. This demonstrates the potential

of cloud computing in reducing the burden on local systems that are often overwhelmed by multiple running

applications. This allows for training larger and more advanced models at a lower cost than local computational

centers.

1 INTRODUCTION

Machine learning techniques have grown rapidly fol-

lowing their success in image classiﬁcation. Cur-

rently, models for speech recognition and natural

language understanding improve the functionality of

smartphones, while autonomous vehicles are being

tested and robotic consultants work in the ﬁnancial

market (Hatcher and Yu, 2018; Abiodun et al., 2018).

These techniques have the potential to solve com-

plex problems and are being applied in different ar-

eas, such as agriculture, medicine, forensic science,

theoretical and applied physics, administration, and

management.

Studies in machine learning and deep neural net-

works indicate a link between the size/complexity of

models and their ability to generalize, learn and ob-

tain efﬁcient results in complex tasks (Amodei et al.,

2016; Hestness et al., 2019; Sun et al., 2017). Hence,

there is an increasing demand for ever bigger deep

learning models with more parameters, trained with

more data, and high-resolution data.

https://orcid.org/0000-0003-1080-7230

https://orcid.org/0000-0002-4015-5619

https://orcid.org/0000-0002-9957-5861

The training of increasingly large and complex

models is making high-performance computational

platforms an important tool. As a result, the hard-

ware accelerator has become essential for speeding up

the training of deep learning models because it is one

of the tasks that consume the most computational re-

sources and can take several days to complete without

proper hardware support. However, the growing de-

mand for this type of hardware is increasing its price,

making it unfeasible for many researchers.

Cloud Computing is an alternative to high hard-

ware costs for training deep learning models. With it,

you only pay for usage (Roloff, 2013), and have ac-

cess to a wide range of computing resources and ser-

vices. In addition, cloud providers regularly update

their resources, including GPUs and TPUs, which al-

lows for training larger and more advanced models at

a lower cost. This type of beneﬁt is difﬁcult to obtain

with local computational centers since this equipment

is costly, and the renewal of the computational center

in small companies and universities does not occur

with the frequency that new versions of this equip-

ment are released.

In this context, we suggest ofﬂoading the model

training to the cloud, reducing the burden on sys-

tems often overwhelmed by multiple running appli-

316

Künas, C., Padoin, E. and Navaux, P.

Accelerating Deep Learning Model Training on Cloud Tensor Processing Unit.

DOI: 10.5220/0012017300003488

In Proceedings of the 13th International Conference on Cloud Computing and Services Science (CLOSER 2023), pages 316-323

ISBN: 978-989-758-650-7; ISSN: 2184-5042

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

cations

. As a case for evaluation, we adapted a Di-

abetic Retinopathy neural network model for TPU-

based training and observed promising results, includ-

ing reduced training time even without code optimiza-

tion.

The rest of this paper is organized as follows. Sec-

tion 2 covers existing work on cloud-based training

of large neural networks using TPUs and TPU Pods.

Section 3 provides an overview of the TPUv3 archi-

tecture used in the experiments and explains the topic

of Diabetic Retinopathy (DR), our use case. In Sec-

tion 4, we describe the methodology, detail the appli-

cation, and outline the hardware and software setup

used. The results of the performance evaluation are

presented in Section 5. Finally, Section 6 concludes

the paper and outlines future work.

2 RELATED WORK

You et al. investigate supercomputers’ capability of

speeding up DNN training (You et al., 2019b). The

approach is to use a large batch size powered by

the Layerwise Adaptive Rate Scaling (LARS) algo-

rithm for efﬁcient usage of massive computing re-

sources. They empirically evaluate the effective-

ness on ﬁve neural networks: AlexNet, AlexNet-BN,

GNMT, ResNet-50, and ResNet-50-v2 trained with

large datasets while preserving the state-of-the-art test

accuracy. Using 2,048 Intel Xeon Phi 7250 Proces-

sors, they reduced the 90-epoch ResNet-50 training

time from hours to 20 minutes. They implemented an

approach on Google’s cloud Tensor Processing Unit

(TPU) platform, which veriﬁes your previous success

on CPUs and GPUs (You et al., 2018). They scaled

the batch size of ResNet-50-v2 to 32K and achieved

76.3 percent accuracy. They applied the approach to

Google’s Neural Machine Translation (GNMT) appli-

cation, which helps to achieve a 4x speedup on the

cloud TPUs.

Wongpanich et al. explore techniques to scale

up the training of EfﬁcientNets on TPU-v3 Pods

with 2048 cores, motivated by speedups that can be

achieved when training at such scales (Wongpanich

et al., 2021). Currently, EfﬁcientNets can take on

the order of days to train. EfﬁcientNets are a family

of state-of-the-art image classiﬁcation models based

on efﬁciently scaled convolutional neural networks.

They discuss optimizations required to scale train-

ing to a batch size of 65536 on 1024 TPU-v3 cores,

such as selecting large batch optimizers and learning

rate schedules and utilizing distributed evaluation and

Projects in progress on the SDumont Supercomputer:

https://sdumont.lncc.br/projects statistics.php

batch normalization techniques. Additionally, they

presented timing and performance benchmarks for

EfﬁcientNet models trained on the ImageNet dataset

to analyze the behavior of EfﬁcientNets at scale. With

the optimizations, they could train EfﬁcientNet on Im-

ageNet to an accuracy of 83% in 1 hour and 4 minutes.

Deep learning is computationally intensive, and

hardware vendors have responded by building faster

accelerators in large clusters. Training deep learn-

ing models requires overcoming both algorithmic and

systems software challenges. Ying et al., discuss

three systems-related optimizations: (1) distributed

batch normalization to control per-replica batch sizes,

(2) input pipeline optimizations to sustain model

throughput, and (3) 2-D torus all-reduce to speed up

gradient summation (Ying et al., 2018). They com-

bined these optimizations to train ResNet-50 on Im-

ageNet to 76.3% accuracy in 2.2 minutes on a 1024-

chip TPU v3 Pod with a training throughput of over

1.05 million images/second and no accuracy drop.

The paper of Jouppi et al. evaluates a Tensor Pro-

cessing Unit (TPU) (Jouppi et al., 2017). They com-

pare the TPU to a server-class Intel Haswell CPU and

an Nvidia K80 GPU. The workload, written in the

high-level TensorFlow framework, uses production

neural networks (NN) applications (MLPs, CNNs,

and LSTMs) that represent 95% of NN inference de-

mand. Despite low utilization for some applications,

the TPU is, on average, about 15X to 30X faster than

the GPU or CPU.

There is an industry-wide trend toward hard-

ware specialization to improve performance, prin-

cipally deep learning models which are compute-

intensive. To systematically benchmark deep learning

platforms, Wang et al. introduce ParaDnn, a bench-

mark suite for deep learning that generates models for

fully connected (FC), convolutional (CNN), and re-

current (RNN) neural networks (Wang et al., 2019a).

Along with six real-world models, they benchmarked

Google’s Cloud TPU v2/v3, NVIDIA’s V100 GPU,

and an Intel Skylake CPU platform. They deeply

dive into TPU architecture, reveal its bottlenecks, and

highlight valuable lessons learned for future special-

ized system design. They also provide a thorough

comparison of the platforms and ﬁnd that each has

unique strengths for some types of models.

You et al. studied a principled layerwise adapta-

tion strategy to accelerate the training of deep neu-

ral networks using large mini-batches (You et al.,

2019a). Using this strategy, they developed a new

layerwise adaptive large batch optimization technique

called LAMB. The empirical results demonstrate the

superior performance of LAMB across various tasks,

such as BERT and ResNet-50 training, with very lit-

Accelerating Deep Learning Model Training on Cloud Tensor Processing Unit

317

tle hyperparameter tuning. In particular, for BERT

training, their optimizer enables the use of huge batch

sizes of 32,868 without any degradation of perfor-

mance. By increasing the batch size to the memory

limit of a TPUv3 Pod, BERT training time can be re-

duced from 3 days to just 76 minutes.

Most of these works address training large deep-

learning models using TPU Pods. Differently from

the related works, we used a single TPU with eight

cores, and although we didn’t explore optimization

techniques, we achieved interesting results. In addi-

tion, we also performed a cost analysis and showed

that the preemptive TPU can achieve better cost efﬁ-

ciency than the local cluster.

3 BACKGROUND

This Section provides an overview of the TPUv3 ar-

chitecture used in the experiments and presents con-

cepts about Diabetic Retinopathy (DR).

3.1 The Google Cloud TPU

In this section, we introduce the TPU architecture

developed by Google, which is utilized in our ex-

periments. TPUs, or Tensor Processing Units, are

application-speciﬁc integrated circuits (ASICs) that

are speciﬁcally designed to accelerate machine learn-

ing workloads. As shown in Figure 1, a TPUv3 de-

vice has a structure of four internal chips, each of

which comprises of two cores. Each core is equipped

with scalar, vector, and matrix units (MXU) that are

connected to the on-chip high bandwidth memory

(HBM) of 16 GB per TPUv3 core. The TPUv3 of-

fers a peak performance of 420 TFlops of ﬂoating

point throughput (Ying et al., 2018). The cores of

the TPU device perform calculations independently,

and the high-bandwidth interconnections enable the

chips to communicate with one another within the

TPU device. These TPUs can be used to train and run

large machine learning models and also can be used

for other high-performance computing tasks (Google,

2023a).

When working with the Cloud TPU model, it is

important to conﬁgure it correctly in order to take

advantage of the distributed training capabilities of

the device. One strategy for doing this is to scale

the batch size by the number of TPU cores that are

available. For example, if the batch size is 32, the

global batch size will be 256 (8 cores x 32 = 256)

unas et al., 2021). This means that each core will

process a batch of 32 examples, and the results will

be combined across all cores to form the ﬁnal out-

Figure 1: The architecture of TPUv3 device with four chips,

420 TFlops of peak ﬂoating point throughput and 128 GB

of HBM.

put. The global batch size is then automatically frag-

mented across all replicas, which allows for efﬁcient

processing of large data sets. This approach allows for

the parallel processing of multiple examples at once,

which can greatly speed up the training process.

3.2 Diabetic Retinopathy

Diabetes Mellitus is a metabolic disorder character-

ized by an abnormal increase in blood sugar levels.

The patient will be subject to complications such as

heart attack, stroke, kidney failure, hard-to-heal in-

juries, and vision problems when not appropriately

treated (Zheng et al., 2018). Vision problems oc-

cur because diabetes affects the circulatory system,

including progressive vascular ruptures, and can de-

velop regardless of the severity of the patient. One

speciﬁc vision problem caused by diabetes is diabetic

retinopathy (DR) (Janghorbani et al., 2000). DR can

be seen in Figure 2, which illustrates a comparison

between a healthy retina and a retina affected by the

disease.

Figure 2: Comparison of a healthy retina and retina with

diabetic retinopathy.

The Global Diabetic Retinopathy Project Group

(Wilkinson et al., 2003) has proposed a ﬁve-stage

classiﬁcation protocol for PDR and NPDR. The

stages are as follows:

• No Apparent Retinopathy: no abnormalities are

present.

• Mild Non-Proliferative Diabetic Retinopathy:

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

318

the presence of retinal microaneurysms.

• Moderate Non-Proliferative Diabetic

Retinopathy: more than just microaneurysms,

but less severe than stage IV.

• Severe Non-Proliferative Diabetic Retinopa-

thy: the presence of more than 20 intra-retinal

hemorrhages in each of the four quadrants, venous

pearling in at least two quadrants, and intra-retinal

microvascular abnormalities in at least one quad-

rant, in the absence of PDR.

• Proliferative Diabetic Retinopathy: character-

ized by neovascularization and vitreous or pre-

retinal hemorrhage.

Regular eye examinations are crucial in tracking the

severity level of diabetic retinopathy (DR) for people

with diabetes mellitus. Timely diagnosis and treat-

ment of DR are essential (Network, 2010), as the con-

dition can progress to advanced stages without pro-

ducing any immediate symptoms, thereby increasing

the risk of vision loss (Stitt et al., 2016).

4 METHODOLOGY

The goal of this research is to ofﬂoad the DL model

training by using the cloud. Our inspiration codebase

is the Voets reproduction

(Voets et al., 2019). Our

model uses the Inception v3 architecture to transfer

learning. We initialized the network with imagenet

weights. In Figure 3, the 42 layers of the Inception v3

architecture are detailed.

After loading the imagenet weights, we add a

Global Average Pooling 2D layer and two Dense lay-

ers, the ﬁrst fully connected with 1024 units using

Rectiﬁed Linear Unit (ReLU) activation function, and

the second using Softmax activation function and ﬁve

units, one to each of 5 classes.

The model uses Adam optimizer, a gradient de-

scendent algorithm based on the adaptive estimation

of ﬁrst and second-order moments. The learning rate

value was 0.0014. The accuracy was collected to

judge the model. The loss function calculates the log-

arithmic loss between actual and predicted labels. In

this paper, we use the Sparse Categorical Crossen-

tropy function.

As input, we used Kaggle’s EyePACS dataset

This database is commonly used for deep learning

applications in DR detection and is divided into two

https://github.com/mikevoets/jama16-retina-

replication

https://www.kaggle.com/competitions/diabetic-

retinopathy-detection

subsets, train and test. The training dataset contains

35,126 images, where 25,810 have no signs of dis-

ease; 2,443 present mild retinopathy; 5,292 present

moderate retinopathy; 873 present grave retinopathy;

and 708 present proliferative retinopathy. We split

these images into training and validating datasets.

Thus, we perform a 5-class classiﬁcation.

First, we process all images by locating the center

and radius of the eye fundus and redimensioning

every picture to 256x256 pixels. The images dataset

was converted to TFRecord format and then uploaded

to the bucket on Google Storage for the training

in the TPU device. Each TFRecord ﬁle contains

2,000 images (except the last one which has 1,126

images). We split the TFRecord ﬁles into training

(80%) and validating (20%) datasets. Therefore,

the training dataset consists of 28,000 images,

and the validating dataset consists of 7,126 im-

ages. The dataset is public and available on Kaggle at

https://kaggle.com/datasets/cristianokunas

/diabetic-tfrecords256.

TFRecord, TensorFlow’s custom data format, is

a powerful tool. It’s natively supported by the

high-performance tf.data API, can handle distributed

datasets and takes advantage of parallel I/O. Working

with large datasets can greatly beneﬁt from using a

binary ﬁle format for storage. Binary data consumes

less space on disk, is faster to transfer, and can be read

more efﬁciently. Using a binary ﬁle format can lead

to a faster import pipeline and ultimately reduce the

training time for your model. In addition to perfor-

mance beneﬁts, the TFRecord ﬁle format is also opti-

mized for use with TensorFlow. It simpliﬁes combin-

ing multiple datasets, and seamlessly integrates with

the data import and preprocessing features of the li-

brary. This is particularly useful for datasets that are

too large to ﬁt in memory (88.29 GB for the EyePACS

dataset), as only the necessary data (e.g. a batch) is

loaded from the disk and processed at a time. Over-

all, the TFRecord ﬁle format provides a convenient

and efﬁcient way to work with large datasets in Ten-

sorFlow.

4.1 Software Setting

A recent survey showed that Python remains the top

language for deploying, executing, and integrating

ML/DL algorithms and related tasks like data trans-

formation (Wang et al., 2019b). Its popularity stems

from its ease of learning, fast implementation, and

rich environment, including popular ML/DL frame-

works like Caffe, Tensorﬂow, Torch, and MXNet.

Our application was deployed using Python 3.7.3

and the embedded frameworks Tensorﬂow (2.6.0) and

Accelerating Deep Learning Model Training on Cloud Tensor Processing Unit

319

Figure 3: Inception V3 architecture. Source: https://cloud.google.com/static/tpu/docs/images/inceptionv3onc–oview.png.

Keras (2.6.0). We used CUDA Toolkit 11.8 and

cuDNN 8.7 for GPU versions, following each devel-

oper’s recommended installation procedures.

4.2 Experimental Platforms

The experiments described in this paper were con-

ducted on the computational resources available at

the Google Cloud

, in the PCAD infrastructure at

INF/UFRGS

, and Santos Dumont Supercomputer

(SDumont)

at the National Laboratory for Scientiﬁc

Computing (LNCC)

• Cloud TPUv3: We use a single TPUv3 with 8

cores and 128 GB memory. The TPU device pro-

vides 420 Teraﬂops performance. This environ-

ment is named TPUv3 throughout the rest of this

paper.

• Blaise: A single compute node composed of

two Intel Xeon E5-2699 v4 Broadwell (2.2GHz)

CPU, 44 physical cores (22 per socket), 256 GB

of RAM, and four NVIDIA Tesla P100-SXM2-

16GB. All the experiments conducted used only

one GPU. This environment is named P100

throughout the rest of this paper.

• Bull Sequana X1120 (GPU): A single compute

node composed of two Intel Xeon Cascade Lake

Gold 6252 (2.1GHz) CPU, 48 physical cores (24

per socket), 384 GB of RAM, and four NVIDIA

Tesla V100-SXM2-32GB. All the experiments

conducted also used only one GPU. This environ-

ment is named V100 throughout the rest of this

paper.

https://cloud.google.com/

http://gppd-hpc.inf.ufrgs.br/

https://sdumont.lncc.br

https://www.lncc.br

5 RESULTS

In this section, we showcase the performance evalua-

tion results obtained from the experimental platform

mentioned in the previous section. We present metrics

for execution time for different architectures. The re-

sults presented are an average of at least 10 runs, with

a relative error of less than 5% and a 95% conﬁdence

level using the t-Student distribution. We also present

the accuracy achieved by our model and perform a

cost efﬁciency analysis when using TPUs compared

to a local cluster.

5.1 Performance Evaluation

As mentioned previously, we use the Inception V3 ar-

chitecture. After loading and initializing the network

with the imagenet weights and adding all layers de-

scribed in the previous section, we trained the model

on 28,000 samples and validated it on 7,126 samples,

with a batch size of 32 and limiting to 25 epochs. For

the TPU, we use the strategy presented in Section 3.1,

where each core processes 32 examples, resulting in

a global batch size of 256.

The performance results of our study are pre-

sented in Fig. 4, showcasing that the V100 outper-

formed the P100 in terms of average training time,

with a 1.63× improvement. The TPUv3, on the other

hand, showed a remarkable performance with an aver-

age training time that was 3.48× faster than the V100.

This result is even more signiﬁcant when compared

to the P100, where the TPUv3 demonstrated an im-

provement of 5.63× in terms of average training time.

This gain is achieved without any code optimiza-

tion. Furthermore, although there is a time spent to

transfer the dataset to the cloud to run on the TPU

device, the performance of TPUv3 is still quite con-

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

320

4347.19

2690.12

772.81

500

1000

1500

2000

2500

3000

3500

4000

4500

P100 V100 TPUv3

Architecture

Time (s)

P100 TPUv3 V100

Figure 4: Neural Network Training times in different hard-

ware.

siderable, about 4.95× and 3.06× more effective than

P100 and V100, respectively. The information about

the transfer is presented in Table 1. The measure-

ments demonstrate that the average throughput was

≈ 12.2 MBytes/second, with an average size of 20.73

MBytes per TFRecord ﬁle.

Table 1: Dataset transfer measurements to the cloud. The

edge location is in the PCAD infrastructure at INF/UFRGS.

The cloud location is in the Google Cloud Storage - Iowa

us-central1.

Parameter Edge to Cloud

Time 106.510 seconds

Average throughput 12.2 MBytes/second

Total Tranferred 932.2 MBytes

The results of this study are important in the ﬁeld

of deep learning, where the ability to process large

amounts of data in a timely and efﬁcient manner

is crucial. TPUs have a higher computation den-

sity, meaning they can perform more operations in a

smaller space. This allows for more efﬁcient use of

the chip’s power and results in faster training times.

Our ﬁndings demonstrate that the TPUv3 outperforms

the other devices in average training time, making it

an attractive option for researchers and practitioners

looking to improve their deep-learning models.

Increasing the global batch size is necessary to

fully utilize the TPU cores when training deep learn-

ing models. This is because the TPU cores operate

on the XLA memory layout (Google, 2023b), which

requires each tensor’s batch dimension to be a multi-

ple of 8 (Jouppi et al., 2017) to more optimally utilize

the memory of each TPU core and increase through-

put. However, it’s important to note that training with

large batch sizes can lead to a degradation in model

quality due to the ”generalization gap” (Keskar et al.,

2016). This has been observed compared to models

trained with smaller batch sizes.

Although we did not explore optimization tech-

niques for scaling training to large batch sizes, such

as selecting large batch optimizers and learning rate

schedules, as well as utilizing distributed evaluation

techniques and batch normalization, we still achieved

good performance on the TPUv3.

5.2 Accuracy Evaluation

This section compares the accuracy collected from ten

executions of the Neural Network model on the P100,

V100, and TPUv3 architectures. All values shown are

averages, and the t-test (Kim, 2015) is used to com-

pare them. The model’s average accuracy achieves

85.45%, 81.59% and 81.43% for TPUv3, V100, and

P100, respectively, with a standard deviation of less

than 1% on both architectures. Fig. 5 depicts the ac-

curacy when running in each architecture.

0.81432

0.81529

0.85454

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TPUv3 V100 P100

Architecture

Accuracy

P100 TPUv3 V100

Figure 5: Neural Network Accuracy in different hardware.

The 85% accuracy of the diabetic retinopathy de-

tection model is deemed appropriate. This is rein-

forced by the fact that other studies have shown sim-

ilar accuracy rates (Lin et al., 2018; Ghosh et al.,

2017), including using the Inception V3 architecture

(Mohammadian et al., 2017). It is important to high-

Accelerating Deep Learning Model Training on Cloud Tensor Processing Unit

321

light that the early detection of diabetic retinopathy

is crucial to avoid serious vision complications. A

model with an accuracy of 85% can provide reliable

results and signiﬁcantly contribute to the accurate di-

agnosis of the disease.

5.3 Cost Evaluation

We also analyze the cost efﬁciency of using the cloud

for model training by scaling the performance value

with the price per hour. Table 2 shows the cost

per hour for a local cluster and the Google Cloud

TPU

. The value for the cluster was calculated as fol-

lows. We consider the hardware cost for a machine

of $25, 000 and that the machine will be used for one

year, so we arrive at a hardware cost per hour of $2.85.

We do not measure the price for facilities, person-

nel, and power consumption and do not include them

in the total price. The TPUv3 cost per hour is about

≈ 2.8× higher than the local cluster. However, the

performance achieved was ≈ 3.48× better.

Table 2: Cost (in Dollar/hour) of each solution.

Device Cost

TPUv3-8 $8.00

Local cluster $2.85

On the local cluster, our estimated cost would be

$2.13 to train the model. On the other hand, TPUv3,

costing $8 per hour, gives us an average cost per train-

ing of $1.72. TPU is about 19% more efﬁcient, i.e., it

costs 19% less to perform the same amount of work in

the cloud than in the local cluster. This indicates that

Cloud TPU can be a good choice for training deep

learning models, especially for our case, the Diabetic

Retinopathy model.

Additionally, the cost per training can be further

reduced by using preemptible TPUs. Preemptible

TPUs cost much less than non-preemptible ones,

about 70% less. However, it can be interrupted at

any time. In this case, the application must be restart-

resilient to save model checkpoints regularly and re-

stores the most recent one upon restart. In our case

study, the estimated cost of using preemptible TPUs

is $0.52. This is 3.3× better than the on-demand TPU

and represents a cost efﬁciency of around 75% com-

pared to the local cluster.

https://cloud.google.com/tpu/pricing

6 CONCLUSION AND FUTURE

WORK

TPUs are specialized hardware devices designed to

accelerate machine learning workloads, especially

matrix operations used in deep learning algorithms.

TPUs are considered better for deep learning tasks

because they provide a high computational perfor-

mance, efﬁcient and cost-effective solution for accel-

erating these workloads.

In this paper, we sought to evaluate the perfor-

mance of the Diabetic Retinopathy model training by

asynchronously ofﬂoading the training to the cloud

using TPU devices. Such an approach aids in alle-

viating the contention for high-demanded local HPC

resources, allowing them to be focused on running ap-

plications. We adjusted the neural network model to

be trained on TPUs, and have seen encouraging out-

comes, including shorter training time, with gains of

up to 5.63× in the best case, even without any op-

timizing the code. Our results provide a good start-

ing point for those interested in improving the perfor-

mance of their deep-learning models.

Future work will extend the performance evalua-

tion to the Cloud TPUv4 and the TPU Pods, exploring

optimization techniques, such as selecting large batch

optimizers and learning rate schedules, to scale train-

ing to large batch sizes.

ACKNOWLEDGEMENTS

This study was partially supported by the

Coordenac¸

ao de Aperfeic¸oamento de Pessoal de

ıvel Superior – Brasil (CAPES) – Finance Code

001, by Petrobras grant n.º 2020/00182-5, by

CNPq/MCTI/FNDCT - Universal 18/2021 under

grants 406182/2021-3, MCTIC/CNPq - Universal

28/2018 under grants 436339/2018-8, by CIARS

RITEs/FAPERGS project and by CI-IA FAPESP-

MCTIC-CGI-BR project. Some experiments in this

work used the PCAD infrastructure, http://gppd-

hpc.inf.ufrgs.br, at INF/UFRGS. The authors

acknowledge the National Laboratory for Scientiﬁc

Computing (LNCC/MCTI, Brazil) for providing HPC

resources of the SDumont supercomputer, which

have contributed to the research results reported

within this paper. URL: http://sdumont.lncc.br.

REFERENCES

Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V.,

Mohamed, N. A., and Arshad, H. (2018). State-of-the-

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

322

art in artiﬁcial neural network applications: A survey.

Heliyon, 4(11):e00938.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J.,

Battenberg, E., Case, C., Casper, J., Catanzaro, B.,

Cheng, Q., Chen, G., et al. (2016). Deep speech 2:

End-to-end speech recognition in english and man-

darin. In International conference on machine learn-

ing, pages 173–182. PMLR.

Ghosh, R., Ghosh, K., and Maitra, S. (2017). Automatic

detection and classiﬁcation of diabetic retinopathy

stages using cnn. In 2017 4th International Confer-

ence on Signal Processing and Integrated Networks

(SPIN), pages 550–554. IEEE.

Google (2023a). Cloud tpu system architecture. [Accessed

Jan. 23, 2023].

Google (2023b). Xla: Google’s accelerated linear algebra

library. [Accessed Jan. 26, 2023].

Hatcher, W. G. and Yu, W. (2018). A survey of deep learn-

ing: Platforms, applications and emerging research

trends. IEEE Access, 6:24411–24432.

Hestness, J., Ardalani, N., and Diamos, G. (2019). Beyond

human-level accuracy: Computational challenges in

deep learning. In Proceedings of the 24th Symposium

on Principles and Practice of Parallel Programming,

pages 1–14.

Janghorbani, M., Jones, R. B., and Allison, S. P. (2000). In-

cidence of and risk factors for proliferative retinopa-

thy and its association with blindness among diabetes

clinic attenders. Ophthalmic Epidemiology, 7(4):225–

241.

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,

G., Bajwa, R., Bates, S., Bhatia, S., Boden, N.,

Borchers, A., et al. (2017). In-datacenter performance

analysis of a tensor processing unit. In Proceedings of

the 44th annual international symposium on computer

architecture, pages 1–12.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M.,

and Tang, P. T. P. (2016). On large-batch training for

deep learning: Generalization gap and sharp minima.

arXiv preprint arXiv:1609.04836.

Kim, T. K. (2015). T test as a parametric statistic. Korean

journal of anesthesiology, 68(6):540.

unas, C. A., Serpa, M. S., Bez, J. L., Padoin, E. L., and

Navaux, P. O. (2021). Ofﬂoading the training of an i/o

access pattern detector to the cloud. In 2021 Interna-

tional Symposium on Computer Architecture and High

Performance Computing Workshops (SBAC-PADW),

pages 15–19. IEEE.

Lin, G.-M., Chen, M.-J., Yeh, C.-H., Lin, Y.-Y., Kuo, H.-Y.,

Lin, M.-H., Chen, M.-C., Lin, S. D., Gao, Y., Ran, A.,

et al. (2018). Transforming retinal photographs to en-

tropy images in deep learning to improve automated

detection for diabetic retinopathy. Journal of ophthal-

mology, 2018.

Mohammadian, S., Karsaz, A., and Roshan, Y. M. (2017).

Comparative study of ﬁne-tuning of pre-trained con-

volutional neural networks for diabetic retinopathy

screening. In 2017 24th National and 2nd Interna-

tional Iranian Conference on Biomedical Engineering

(ICBME), pages 1–6. IEEE.

Network, S. I. G. (2010). Management of obesity: a

national clinical guideline. Scottish Intercollegiate

Guidelines Network: Edinburgh, 20.

Roloff, E. (2013). Viability and performance of high-

performance computing in the cloud. Master’s thesis,

Federal University of Rio Grande do Sul.

Stitt, A. W., Curtis, T. M., Chen, M., Medina, R. J., McKay,

G. J., Jenkins, A., Gardiner, T. A., Lyons, T. J.,

Hammes, H.-P., Simo, R., et al. (2016). The progress

in understanding and treatment of diabetic retinopa-

thy. Progress in retinal and eye research, 51:156–186.

Sun, C., Shrivastava, A., Singh, S., and Gupta, A. (2017).

Revisiting unreasonable effectiveness of data in deep

learning era. In Proceedings of the IEEE international

conference on computer vision, pages 843–852.

Voets, M., Møllersen, K., and Bongo, L. A. (2019). Repro-

duction study using public data of: Development and

validation of a deep learning algorithm for detection

of diabetic retinopathy in retinal fundus photographs.

PloS one, 14(6):e0217541.

Wang, Y. E., Wei, G.-Y., and Brooks, D. (2019a). Bench-

marking tpu, gpu, and cpu platforms for deep learning.

arXiv preprint arXiv:1907.10701.

Wang, Z., Liu, K., Li, J., Zhu, Y., and Zhang, Y. (2019b).

Various frameworks and libraries of machine learning

and deep learning: a survey. Archives of computa-

tional methods in engineering, pages 1–24.

Wilkinson, C., Ferris III, F. L., Klein, R. E., Lee, P. P.,

Agardh, C. D., Davis, M., Dills, D., Kampik, A.,

Pararajasegaram, R., Verdaguer, J. T., et al. (2003).

Proposed international clinical diabetic retinopathy

and diabetic macular edema disease severity scales.

Ophthalmology, 110(9):1677–1682.

Wongpanich, A., Pham, H., Demmel, J., Tan, M., Le, Q.,

You, Y., and Kumar, S. (2021). Training efﬁcient-

nets at supercomputer scale: 83% imagenet top-1 ac-

curacy in one hour. In 2021 IEEE International Paral-

lel and Distributed Processing Symposium Workshops

(IPDPSW), pages 947–950. IEEE.

Ying, C., Kumar, S., Chen, D., Wang, T., and Cheng, Y.

(2018). Image classiﬁcation at supercomputer scale.

arXiv preprint arXiv:1811.06992.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli,

S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-

J. (2019a). Large batch optimization for deep learn-

ing: Training bert in 76 minutes. arXiv preprint

arXiv:1904.00962.

You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer,

K. (2018). Imagenet training in minutes. In Proceed-

ings of the 47th International Conference on Parallel

Processing, pages 1–10.

You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer,

K. (2019b). Fast deep neural network training on dis-

tributed systems and cloud tpus. IEEE Transactions

on Parallel and Distributed Systems, 30(11):2449–

2462.

Zheng, Y., Ley, S. H., and Hu, F. B. (2018). Global ae-

tiology and epidemiology of type 2 diabetes mellitus

and its complications. Nature reviews endocrinology,

14(2):88–98.

Accelerating Deep Learning Model Training on Cloud Tensor Processing Unit

323