The Application of Neural Networks for Facial Landmarking on Mobile

Devices

Connah Kendrick

, Kevin Tan

, Kevin Walker

and Moi Hoon Yap

School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University,

John Dalton Building, Manchester, U.K.

Image Metrics Ltd, City Tower, Piccadilly Plaza, Manchester, U.K.

Keywords:

Facial Landmarking, Android, Deep Learning.

Abstract:

Many modern mobile applications incorporate face detection and landmarking into their systems, such as

Snapchat, beauty ﬁlters and camera auto-focusing systems, where they implement regression based machine

learning algorithms for accurate face landmark detection, allowing the manipulation of facial appearance. The

mobile applications that incorporate machine learning have to overcome issues such as lighting, occlusion,

camera quality and false detections. A solution could be provided through the resurgence of deep learning

with neural networks, as they are showing signiﬁcant improvements in accuracy and reliability in comparison

to the state-of-the-art machine learning. Here, we demonstrate the process by using trained networks on

mobile devices and review its effectiveness. We also compare the effects of employing max-pooling layers, as

an efﬁcient method to reduce the required processing power. We compared network with 3 different amounts

of max-pooling layer and ported one to the mobile device, the other two could not be ported due to memory

restrictions. We will be releasing all code to build, train and use the model in a mobile application. The results

show that despite the limited processing capability of mobile devices, neural networks can be used for difﬁcult

challenges while still working in real-time. We show a network running on a mobile device on a live data

stream and give a recommendation on the structure of the network.

1 INTRODUCTION

Face identiﬁcation and landmarking are often used

as the backbone to many camera based applications.

Features, such as the auto-focus built into most mod-

ern day cameras use a face detector to identify the

regions that require focusing. Commercial applica-

tions, such as Snapchat (Inc, 2017b), BeautyPlus (Co,

2017) and Makeup Genius (L’Oreal, 2017) are based

around identifying the facial region and performing

image manipulation to the facial features, these appli-

cations are extremely popular and have a high num-

ber of users world-wide. Many mobile applications

rely on the use of machine learning algorithms to

identify patterns in image data. Machine learning

uses pre-annotated data to learn the difference be-

tween the data values, but it requires a large amount

of annotated data to learn which is time consuming

even with the available tools (Kendrick et al., 2017).

However, Mathias (Mathias et al., 2014) showed that

if a strict data annotation regime was implemented,

a small dataset can match the accuracy of methods

trained on large datasets. We aim to analyse how well

neural networks can be integrated into mobile devices

to increase the accuracy and reliability of the applica-

tions.

Deep learning is restricted and inefﬁcient for low-

powered machines, as it requires high-end machines

with powerful Graphical Processing Units (GPUs) ca-

pable of processing the required quantities of data.

However, after the model has been trained, run-

ning the network requires signiﬁcantly less process-

ing power even without optimization, possibly allow-

ing the models to run on low powered mobile devices.

In recent years, with the increase in power and afford-

ability of GPUs, deep learning has become more com-

monly used and in many recent tests, this method out-

performed machine learning algorithms signiﬁcantly

(“Grother” and Ngan, 2016). The improved perfor-

mance of deep learning architectures has potential to

improve available application, but the method of im-

plementing deep learning into mobile applications is

still in its infancy. Deep learning based software re-

quire large amounts of memory and processing capa-

Kendrick, C., Tan, K., Walker, K. and Yap, M.

The Application of Neural Networks for Facial Landmarking on Mobile Devices.

DOI: 10.5220/0006623101890197

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

189-197

ISBN: 978-989-758-290-5

189

bilities, which is not currently efﬁcient for mobile de-

vices. However, the processing power and memory of

mobile devices is increasing at a rapid rate, allowing

for multi-core processing of large quantities of data,

which has the possibility to allow even some of the

deeper neural networks to run on the devices.

As deep learning is outperforming machine learn-

ing for real-world applications, such as medical

healthcare (Goyal et al., 2017), skin assessment (Alar-

iﬁ et al., 2017), security (Yi et al., 2014) and waste

management (Sudha, 2016), we demonstrate the ca-

pabilities of implementing deep learning algorithms

to mobile devices and analyse its effectiveness. This

paper will evaluate the current methods of integrating

trained neural networks onto mobile devices.

2 RELATED WORKS

Deep learning and facial landmarking are well re-

searched areas. Deep learning can perform many dif-

ferent tasks, such as object classiﬁcation and segmen-

tation, to do these the networks structure and layer

methodology must be suited to that task. Facial land-

marking is best suit for a regression based approach

as it has been commonly proven most effective, be-

cause the landmarks will shift with different facial ex-

pressions. In deep learning, regression is performed

by ensuring the networks layer do not perform any

‘squashing’ of the data. Furthermore, this means the

exclusion of methods, such as Softmax (Krizhevsky

et al., 2012) and using Root Mean Squared error

(RMSE) to calculate the loss, as it provides the dis-

tance from the ground truth which is required for re-

gression over the traditional hit or miss methods for

classiﬁcation (Litjens et al., 2017). Furthermore, con-

sideration also has to be based on the type of activa-

tion method to prevent restricting the networks out-

puts, such as Rectiﬁed Linear Unit (ReLU) can be

used for regression. Other methods to perform re-

gression are provided through APIs, such as Keras

(Chollet, 2016) which provides a regression function,

but after some initial tests, it shows little difference in

performance with an increase in training times.

To make a neural network more efﬁcient, there are

many different options, such as use smaller kernels in

the convolutions and output less convolution images

or use less ﬁlters, a different output image is produced

for each ﬁlter requiring more memory, at each of the

layers. However, another method to reduce the strain

of the network is to implement max-pooling layers

(Srivastava et al., 2014) into the network. The max-

pooling layer split the images into a number of re-

gions, from each region the highest value is taken a

placed into a new matrix, illustrated Fig 2. This also

helps with over-ﬁtting as the data becomes more ab-

stract. The issue with implementing max-pooling lay-

ers is that by shrinking the data, data is lost, possibly

reducing the accuracy, but reducing processing time.

For facial landmarking, the data should be preserved

as much as possible, as it is easy to lose key facial

features, such as the eye corners and nose tip when

employing max-pooling layers. Implementing max-

pooling allows networks to run much faster on lower

power devices. The max-pooling acts as a cost/reward

system which will be analysed in this paper on the

trade-off between speed and accuracy.

2.1 Deep Learning for Facial

Landmarks

Deep learning for facial landmarking is a known area

of research and is used in some commercial applica-

tions, such as face++ (Megvii, 2015) which utilises

deep learning to give high accuracy results. However,

this application requires a connection to their servers

and cannot run on devices in real-time. Researchers

have shown that deep learning can be used effectively

to solve regression problems, such as predicting a

point on an image (Zhang et al., 2016; Lai et al.,

2015). Recent methods use deep learning for face

landmarking are performed by Zhang et al. (Zhang

et al., 2016), where they use auxiliary attributes to aid

in deciphering the data. During the training stages

of the model they provided additional parameters to

identify, such as the gender, glasses, expression and

pose. These additional parameters can help the net-

work, identify key pieces of information that can af-

fect the results drastically, such as facing left or right

can aid the network to identify, when a point is oc-

cluded. The results showed increased accuracy in

comparison to other systems.

Hochreiter et al. (Hochreiter and Schmidhuber,

1997) proposed a recurrent Long Short Term Memory

(LSTM) network with a focus on the inherited regres-

sion problem (Lai et al., 2015) in facial landmarking.

The network deﬁned by Lai et al. (Lai et al., 2015)

uses deconvolution layers which pad out the convolu-

tions, effectively up-scaling the images. The mix of

max-pooling, deconvolution and the nature of LSTM,

help avoid over ﬁtting while still learning the core fea-

tures of the facial landmarks position. Other methods,

such as Zhou et al. (Zhou et al., 2013) uses a cascad-

ing style detector to identify sections of the face and

uses the bounding box region to align points to the

face. Similarly, Luo et al. (Luo et al., 2012) who also

uses a cascade style classiﬁer to retrieve the facial fea-

tures, but then performs segmentation on the features

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

190

instead of using the bounding boxes.

Bulat and Tzimiropouloas (Bulat and Tzimiropou-

los, 2017a), focus on using binary convolution neu-

ral networks by changing the hierarchy style of the

networks to parallel style network. They reduce the

amount of data being restricted by processing the pre-

vious convolution layers. Their method manages to

match the accuracy of existing techniques while re-

ducing the bottleneck and the total required parame-

ters, allowing the network to run on lower powered

devices.

2.2 Neural Network Compression

Neural networks can be of varying sizes and scales.

The types of layers can drastically increase the

amount of processing required to run the network,

such as the Inception (Szegedy et al., 2016) and

LSTM (Hochreiter and Schmidhuber, 1997) when

compared to a convolution layer. This section will de-

scribe the work done to compress and analyse neural

network performance on mobile devices. The most

commonly used methods to compress a neural net-

work is the hashing trick or HashedNets by Chen et

al. (Chen et al., 2015). Hashed networks function by

grouping random connection weights together into a

single ‘bucket’, the connection weights are all tuned

by one parameter reducing the networks total size and

memory requirements. However, by joining multi-

ple weight values into one can cause the network to

become less diverse, meaning the networks accuracy

can be negatively affected. Another method called

One-Shot Whole Compression by Kim et al. (Kim

et al., 2016) focuses on shrinking the entire convo-

lution based network. Kim et al. (Kim et al., 2016)

splits the compression into three main stages:

• Identiﬁcation. Using Bayesian matrix factorisa-

tion the sections of the neural network, which con-

tributes most to the success of the network are

identiﬁed. By identifying the highest contribution

sections, the method can better preserve its accu-

racy.

• Reduction. Using the ranking from the Bayesian

matrix factorisation, the method can apply one of

two versions of Tuckers decomposition, the ﬁrst

analyses the core components and then merges

them to a single tensor. Whereas, the second per-

forms Single Value Decomposition (SVD).

• Fine Tuning. As the network has been modiﬁed,

the results produced can differ signiﬁcantly. Kim

et al. (Kim et al., 2016) retrains the network with

pre-built models to aid in accuracy recovery.

Overall, Kim et al. (Kim et al., 2016) method allows

Figure 1: A example of four 3*3 ﬁlters.

for large decreases in ﬁle size with only small losses

in accuracy. Research has also been done to see how

different layers, such as recurrent layers can be used

to shrink the network size (Robinson, 1994).

3 METHODOLOGY

This section will demonstrate and analyse the net-

works designed for this experiment and give details

into the reasoning behind them. After this, a descrip-

tion of the dataset and the elaboration of the prepro-

cessing steps.

To test the processing capability of the devices

when implementing neural networks, three basic neu-

ral networks have been deﬁned, each with similar lay-

outs designed to maximise the efﬁciency of the net-

work. Each of the networks follow the same structure

as the Base network, as the purpose of this experi-

ment is to discriminate how the max-pooling layer af-

fects the applications capabilities, a simple neural net-

work (henceforth, basic network) is chosen. The basic

network consists of three layers: convolutions, max-

pooling and activation, followed by two fully con-

nected layers with intermediate activation layers and

the ﬁnal output layer. The initial convolution layer

The Application of Neural Networks for Facial Landmarking on Mobile Devices

191

Figure 2: A example of a 2*2 max-pooling layer.

uses a 3×3 kernel, whereas the following convolu-

tions used a 2×2 kernel. The ﬁrst two convolutions

used 32 ﬁlters on each image and the last convolu-

tion uses 64. Filters are used to widen the neural

network as illustrated in Fig 1, they work by creat-

ing different convolution kernels for using on the in-

put image. Having multiple ﬁlters aid the network, as

different ﬁlters focus on the type of variation the net-

work could encounter, such as a frontal face ﬁlter or

side face ﬁlter. By having ﬁlter account for variation

more reliable network can be produced. The max-

pooling layers were ﬁxed to 2×2, effectively shrink-

ing the convoluted images by half as shown in Fig 2.

The initial input size of the image was 96×96 mean-

ing the ﬁnal image size for the fully connected layer

was 12×12 which equals to the required output of 136

(68 points × 2 for the X and Y). The two activation

layers uses 1000 and 500 connected neurons before

connecting to the output layer. The goal of these net-

works is to give a clear view of the cost/reward for im-

plementing max-pooling layer to reduce the workload

of the network and how that impacts the networks ac-

curacy. The max-pooling layer were removed starting

with the ﬁrst, as the longer the network maintains the

full data the more it can learn. The seed for gener-

ating the neurons initial weighting was ﬁxed to 7 to

improve reproducibility. Table 1 shows the number of

max-pooling layer in each model.

A network with no max-poolings was not con-

sidered due to the memory limitations. The mem-

ory requirements for training could be improved by

reducing the batch size, however, this would have a

detrimental effect on the accuracy and increase train-

ing times. This is a complicated process, in terms

of requirements, as each ﬁlter produces an additional

image for each layer, so from a single input image,

65536 images of 96*96 are produced (1*32*32*64),

this would required 604MB, much higher then what

current phones have. For the training stage, even the

Titan Pascal (12GB) card with reasonable batch sizes

were not possible. The networks were trained over

300 epochs, this is because more data was retained

through the removal of max-pooling layer, the longer

the networks need to converge and this gave the fairest

option for comparison, as this shows when conver-

gence occurs.

For each of the networks, ReLU was implemented

as well as RMSE (eq. 1) for loss calculation equation:

RMSE =

∑

i=0

− y

)

(1)

where:

• n is the number of samples in the training batches.

• y

is the ground truth output for the training image.

• y

is the predicted output for the training image.

These methods prevent any ‘squashing’ of the data

between 0 - 1, which prevents a fully regressive learn-

ing system. The Adam optimizer (Kingma, 2015) was

used. The accuracy for the network was calculated us-

ing binary accuracy (eq. 2), as with landmarking the

chances of early networks getting an exact hit of the

landmarks is slim. As a result, the distance for the

point is required and this gives a clearer indication of

how well the trained model is performing.

Accuracy =

∑

= [y

])

(2)

where:

• n is the number of samples in the training batches.

• y

is the ground truth output for the training image.

• [y

] is the rounded predicted output of the training

image.

The networks were designed and trained using

tensorﬂow (Abadi et al., 2016), thus the tensors were

saved in graph form with both weights and structure

in the same ﬁle. Optimization was then performed

to remove the training, testing and none require vari-

ables from the graph. The networks were trained us-

ing the GTX 1080Ti GPU (11GB) with a batch size

of 120. The batch size was determined by modifying

the Basic network multiple times and comparing the

accuracy which have a high impact. To ensure consis-

tency, all networks were trained with the same batch

size and the random weights initializer seed was set to

7 to increase reproducibility. We also split 10% from

the training set to be used for validation. The data

chosen for validation was performed by the Keras in-

terface.

In this section, we review the methodology behind

the training of neural network and the preparation of

data. The network was trained on the Morph dataset

(Albert and Ricanek Jr, 2008) using 49828 out of the

55134 available within the dataset, images were ex-

cluded if they were taken in incorrect lighting, de-

formed or unrecognisable, had most of the face ob-

scured or failed in both the OpenCV and Dlib (King,

2013) face detectors, as both are commonly used face

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

192

Table 1: A comparison of the accuracy and loss for three types of network.

Network Max-Pooling layers Accuracy Loss

Basic 3 0.63 0.17

Two max-poolings 2 0.64 0.12

One max-pooling 1 0.69 0.38

Figure 3: A visualisation of the basic network used for this experiment.

detectors. Images that met the required standard were

passed to the next stage of annotation. To speed up

the landmark annotation and maintain consistency,

which shows to boost system accuracy (Mathias et al.,

2014), an automatic approach was taken, similar to

(Bulat and Tzimiropoulos, 2017b; Cao et al., 2014).

The dataset was loaded into a created program that

would ﬁrst transform all the images into grayscale,

the faces were then cropped out using Dlibs face de-

tector and resized to 96×96 images. The images were

passed into Dlibs face point detector to retrieve 68 fa-

cial landmarks. The points and 96×96 images were

saved into a csv ﬁle as input data for training. The for-

mat for the csv ﬁle was face point x coordinate, face

point y coordinate for all 68 points on the face and the

ﬁnal cell was the raw 96×96 pixel information with

a space between each pixel value. The morph dataset

consists of frontal face only and with limited expres-

sions.

The neural networks were trained to take in a

96×96 grayscale face image and process to retrieve

the facial landmarks. To retrieve the facial image, the

OpenCV library (Corporation et al., 2000) was used

as it has high accuracy (Zhu and Ramanan, 2012),

the library is compatible to mobile platforms making

it suitable for this experiment. The OpenCV library

uses a Viola and Jones (Viola and Jones, 2004) style

cascade classiﬁer to detect the faces. To prepare the

images for the network, the images were ﬁrst con-

verted to grayscale from RGBA (Red, Green, Blue,

Alpha) to reduce the processing time and also resized

to 96 × 96. The resulting image after cropping and

rescaling was then processed by the network. The

process of preparing data was integrated into the mo-

bile application using the phones backwards facing

camera (faces away from the user’s face) as the live

image feed allowing for the real-time cropping of the

face and landmarking.

4 NEURAL NETWORK

INTEGRATION

Neural Networks have seen an increase of use in re-

cent years (Learning, 2017; Schmidhuber, 2015), but

focusing on how accurate and effective the networks

were. Whereas, another major focus should be the

implementation of neural networks into consumer and

industry based systems. As neural networks are show-

ing increase accuracy and reliability, it shows that

many current consumer based applications could be

improved by using these techniques. It is possible

once the model is trained, to run the model on de-

vices of signiﬁcantly less processing power as the re-

moval of the back propagation and training layers al-

lows faster computation. However, the memory for

processing the image is still required which is still

signiﬁcant amount for a mobile device. Neural net-

works are stored usually as two separate ﬁles. The

ﬁrst ﬁle is the model, the networks layout and struc-

ture. The second ﬁle is the networks weights, which

is larger size and controls how the network processes

the data. For porting to a mobile device, merging the

weights and network structure ﬁles is required saving

loading times and memory. To run a trained neural

network, the device has to be able to load the ﬁle into

memory, which is an issue on mobile devices as even

the most modern mobile devices contain little RAM,

such as the iPhone 7 which has 2GB. In addition, ten-

sorﬂow can have issues with ﬁle sizes over 68mb, due

to android compression. To load the model onto the

device, Tensoﬂow has a mobile speciﬁc libary (Abadi

et al., 2016) for android, this allows the device to open

a tensor graph containing the model and weights. To

build an Android Package (APK), Android Studio

(Inc, 2017a) was used with a build targeting the API

level 25 on x86 and x64 CPU architecture, this was to

ensure compatibility with the Tensorﬂow library and

OpenCV. By using the TensorFlow library, the model

The Application of Neural Networks for Facial Landmarking on Mobile Devices

193

can be loaded onto the device if sufﬁcient memory is

available. OpenCV is then used to access the mobile

camera to detect faces and pre-process the images for

the neural network.

5 RESULTS AND DISCUSSION

Using multiple networks, we showed that to track a

large number of points on the face, max-pooling lay-

ers can have a detrimental effect on the results of the

tracking as facial feature and expression were lost to

the degradation of the data. We compare the accuracy

and the loss of different max-pooling layers in Table

As the experiments are to determine the effective-

ness of how deep learning can be implemented on a

mobile device, the ﬁle size of the trained net is com-

pared and illustrated in Table 2. The frames per sec-

ond is also given on the mobile devices as well as the

image size before the fully connected layers.

Table 2: Table of the result ﬁle sizes.

Network File size Image size FPS

Basic 32.5MB 12 × 12 15

Two max-poolings 120MB 24 × 24 N/A

One max-pooling 518MB 48 × 48 N/A

Face Detector Only 50KB 864 × 480 17

As shown in Table 2, the number of max-pooling

layers has a incremental impact of the resulting

weights ﬁle. For reference of how this would af-

fect an application, the current Google Play store

(Google, 2017) rules for publishing an application

are provided. An application on the Play store can-

not exceed the ﬁles size of 100MB. Compression on

the optimized ﬁle also cannot be performed as stan-

dard methods negatively effect the reading and use of

the ﬁles. As a result, the ﬁle size of the one max-

pooling and two max-poolings could not be loaded

onto the mobile device. Fig 4 illustrates that as the

network begins to retain more data by removing max-

pooling layer the network gets signiﬁcant boosts in

accuracy. The increase is much more prominent from

One max-pooling to Two max-pooling, this could be

as it loses more data through max-pooling compared

to Two max-poolings to Basic max-poolings. Fig 5

shows that the more max-pooling layer lowers the

loss, but is not as signiﬁcant as the impact on ac-

curacy. Fig 6 illustrates that with max-pooling lay-

ers removed the networks accuracy increases this is

much more prominent between Basic max-poolings

and Two max-poolings. The reason behind the accu-

racy drops is that, by using max-pooling on the neu-

Figure 4: The networks accuracy, both training and valida-

tion.

ral face and large movements can be learned, such as

mouth open, because subtle movements and deviation

a lost though pooling. However, the method fails to

track expressions, such as mouth widen and struggle

with accurate placement of the eyes. However, With

each improvement we also receive a corresponding

drop in performance.

Figure 5: The networks loss, both training and validation.

Figure 6: The networks MAE, both training and validation.

We ported each of the models to a HTC 10 mo-

bile phone (4GB Ram, Quad-core (2 × 2.15GHz and

2 × 2.16GHZ) CPU, Adreno 530 GPU) to test the

effectiveness on live data. The live tests show that

even though the basic network runs the fastest it can-

not distinguish between many of the facial movement

include mouth widen, the system favours the neutral

face. As shown in Table 2 all trained models struggle

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

194

to reach a real-time landmarking time, but the system

also had to perform features, such as the face detec-

tion. When comparing the performance of the sys-

tems we recommend to implement one max-pooling

as it aids performance signiﬁcantly with little impact

on accuracy, giving the best cost/reward trade-off, but

further research needs to be done to optimize the size

for mobile devices. However, if hardware is restricted

and accuracy is not too important, the the two max-

poolings will sufﬁce. Image from the live capture is

shown in Fig. 7, where we tested in both controlled

light conditions and uncontrolled lighting Fig. 8. We

also performed a comparison of glasses on/off as in

Fig. 9. We provide all codes use to build the android

application (Android studio project), codes to train /

export (Python 3.5) the model and the models (.pb)

used for this experiment publicly available.

Figure 7: Example output of the android application.

6 DISCUSSION

We have demonstrated the effectiveness of basic neu-

ral networks on mobile devices and compared the sub-

tle changes in the network design and its effect on

the performance, both on the device and accuracy of

the network. More processor intensive networks can

achieve better accuracy for the system, but its perfor-

mance on a mobile is unregistered and an open area

of research. By testing higher resource required net-

works on mobile devices and showing their effective-

ness in comparison to the one shown in this paper will

aid in benchmarking the progression of systems in the

future. A major requirement for future work is to per-

form optimization on the application to increase the

frame rate to real time, methods, such as Fagg et al.

(Fagg et al., 2017) fast fourier transform face detector,

that can run over 60fps freeing up memory and time

for the neural network. Other methods could also be

tested, such as Ranjan et al. (Ranjan et al., 2017) to

allow a neural network to handle everything from face

detection to landmarking.

Figure 8: Example output of the android application in

a controlled environment(bottom) and wild environment

(top), with both male and female faces.

Figure 9: Same Scenario as Fig 8, but with glasses.

7 FUTURE WORK

A future path of our research is to determine how the

variation of phone hardware affects the neural net-

works. Furthermore, this include the phone speciﬁc

chips that are designed for image processing, such

as the chips that allow these “low power” devices

to record and play 4K images in real-time, in which

some modern day computers have difﬁculty doing.

From observation the training times varied greatly be-

hind the different levels of max-poolings used and if

the network implement the regression learning. This

would lead to further research to investigate how the

different number of max-pooling affect the training

time of deep learning networks. A limitation of the

system is that it relies on face detector to identify and

segment the face for landmarking. Another, limitation

is that the network has not been train on none-face im-

ages, meaning false detections from the face detector,

as the system does not understand none-face images,

causes the network to return an average face shape

points. A future path of research would be to expand

the networks capabilities and training set to include

none-face images so this system can handle false pos-

itives from the face detector. As even some of the net-

works shown in this paper could not be ported to the

phone for memory issues, future work can be done on

The Application of Neural Networks for Facial Landmarking on Mobile Devices

195

test how the ﬁlter sizes effects the results of the net-

work.

8 CONCLUSION

We have demonstrated that the max-pooling layers

have high potential to aid in compressing a networks

size to operate on mobile devices. Employing max-

pooling layers allows networks to become ‘lighter’

and be trained faster, but with reduced accuracy. With

the basic network, the standard learning libraries in

networks that don’t employ the max-pooling layers

have improved results. We have implemented a deep

learning network onto a mobile platform, tested the

performance on real-world data. We have also show

that the outputs on an array of varying neural net-

works to demonstrate how the networks were ef-

fected by the max-pooling layer. By comparing the

cost/reward, we recommend the Basic max-pooling

network as it has high accuracy, with little impact on

the phones memory or processing capabilities for mo-

bile platforms.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M.,

Kudlur, M., Levenberg, J., Monga, R., Moore, S.,

Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V.,

Warden, P., Wicke, M., Yu, Y., Zheng, X., Brain, G.,

Osdi, I., Barham, P., Chen, J., Chen, Z., Davis, A.,

Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,

M., Kudlur, M., Levenberg, J., Monga, R., Moore,

S., Murray, D. G., Steiner, B., Tucker, P., Vasude-

van, V., Warden, P., Wicke, M., Yu, Y., and Zheng,

X. (2016). TensorFlow : A System for Large-Scale

Machine Learning. Osdi.

Alariﬁ, J., Goyal, M., Davison, A., Dancey, D., Khan, R.,

and Yap, M. H. (2017). Facial Skin Classiﬁcation Us-

ing Convolutional Neural Networks. In Image Anal-

ysis and Recognition: 14th International Conference,

ICIAR 2017, Montreal, QC, Canada, July 5–7, 2017,

Proceedings, volume 10317, page 479. Springer.

Albert, A. M. and Ricanek Jr, K. (2008). The MORPH

database: investigating the effects of adult craniofa-

cial aging on automated face-recognition technology.

Forensic Science Communications, 10(2).

Bulat, A. and Tzimiropoulos, G. (2017a). Binarized Con-

volutional Landmark Localizers for Human Pose Esti-

mation and Face Alignment with Limited Resources.

Bulat, A. and Tzimiropoulos, G. (2017b). How far are we

from solving the 2D & 3D Face Alignment problem?

(and a dataset of 230,000 3D facial landmarks).

Cao, C., Weng, Y., Zhou, S., Tong, Y., and Zhou, K. (2014).

FaceWarehouse: A 3D facial expression database for

visual computing. IEEE Transactions on Visualization

and Computer Graphics, 20(3):413–425.

Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and

Chen, Y. (2015). Compressing Neural Networks with

the Hashing Trick. Proceedings of The 32nd Inter-

national Conference on Machine Learning, 37:2285–

2294.

Chollet, F. (2016). Keras.

Co, X. M. T. (2017). BeautyPlus.

Corporation, I., Garage, W., and Itseez (2000). OpenCV.

Fagg, A., Lucey, S., and Sridharan, S. (2017). Fast , Dense

Feature SDM on an iPhone. pages 95–102.

Google (2017). Google Play Store.

Goyal, M., Reeves, N., Rajbhandari, S., Spragg, J., and Yap,

M. H. (2017). Fully Convolutional Networks for Di-

abetic Foot Ulcer Segmentation. Systems, Man, and

Cybernetics (SMC), 2017 IEEE International Confer-

ence on.

“Grother”, P. and Ngan, M. (2016). The IJB-A Face Identi-

ﬁcation Challenge Performance Report.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Inc, G. (2017a). Android Studio.

Inc, S. (2017b). Snapchat.

Kendrick, C., Tan, K., Williams, T., and Yap, M. H. (2017).

An Online Tool for the Annotation of 3D Models.

pages 362–369.

Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., and Shin,

D. (2016). Compression of Deep Convolutional Neu-

ral Networks for Fast and Low Power Mobile Appli-

cations. Iclr, pages 1–16.

King, D. (2013). Dlib.

Kingma, D. P. (2015). A : a m s o. pages 1–15.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

ageNet Classiﬁcation with Deep Convolutional Neu-

ral Networks. Advances In Neural Information Pro-

cessing Systems, pages 1–9.

Lai, H., Xiao, S., Pan, Y., Cui, Z., Feng, J., Xu, C., Yin,

J., and Yan, S. (2015). Deep Recurrent Regression for

Facial Landmark Detection. pages 1–13.

Learning, D. (2017). Deep Learning for Consumer Devices

and Services. (APRIL).

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A.,

Ciompi, F., Ghafoorian, M., van der Laak, J. A. W. M.,

van Ginneken, B., and S

anchez, C. I. (2017). A Survey

on Deep Learning in Medical Image Analysis. (1995).

L’Oreal (2017). Makeup genius.

Luo, P., Wang, X., and Tang, X. (2012). Hierarchical Face

Parsing via Deep Learning. pages 1–8.

Mathias, M., Benenson, R., Pedersoli, M., and Van Gool,

L. (2014). Face detection without bells and whis-

tles. Lecture Notes in Computer Science (including

subseries Lecture Notes in Artiﬁcial Intelligence and

Lecture Notes in Bioinformatics), 8692 LNCS(PART

4):720–735.

Megvii, I. (2015). Face Plus Plus.

Ranjan, R., Sankaranarayanan, S., Castillo, C. D., and Chel-

lappa, R. (2017). An All-In-One Convolutional Neural

Network for Face Analysis. pages 17–24.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

196

Robinson, A. J. (1994). An Application of Recurrent Nets

to Phone Probability Estimation. IEEE Transactions

on Neural Networks, 5(2):298–305.

Schmidhuber, J. (2015). Deep Learning in neural networks:

An overview. Neural Networks, 61:85–117.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A Simple

Way to Prevent Neural Networks from Overﬁtting.

Journal of Machine Learning Research, 15:1929–

1958.

Sudha, S. (2016). an Automatic Classiﬁcation Method for

Environment Friendly Waste Segregation Using Deep

Learning. (Tiar):65–70.

Szegedy, C., Ioffe, S., and Vanhoucke, V. (2016). Inception-

v4, Inception-ResNet and the Impact of Residual Con-

nections on Learning. Arxiv, page 12.

Viola, P. and Jones, M. J. (2004). Robust real-time face

detection. International journal of computer vision,

57(2):137–154.

Yi, D., Lei, Z., Liao, S., and Li, S. Z. (2014). Learning Face

Representation from Scratch. arXiv.

Zhang, Z., Luo, P., Loy, C. C., and Tang, X. (2016). Learn-

ing Deep Representation for Face Alignment with

Auxiliary Attributes. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 38(5):918–930.

Zhou, E., Fan, H., Cao, Z., Jiang, Y., and Yin, Q. (2013).

Extensive facial landmark localization with coarse-to-

ﬁne convolutional network cascade. Proceedings of

the IEEE International Conference on Computer Vi-

sion, pages 386–391.

Zhu, X. and Ramanan, D. (2012). Face detection, pose es-

timation, and landmark localization in the wild. Pro-

ceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, pages

2879–2886.

The Application of Neural Networks for Facial Landmarking on Mobile Devices

197