Convolutional Neural Network Face Recognition for Lecturer

Attendance

Muhammad Raﬁ Muttaqin

, Anshorulloh Nur Aziz

, Dede Irmayanti

and Sumanto

Informatics Engineering Study Program, Wastukancana College of Technology, Purwakarta, Indonesia

Universitas Bina Sarana Informatika, Jakarta, Indonesia

ﬁ

Keywords:

Face Recognition, MobileNet V2, Attendance.

Abstract:

Face recognition is a ﬁeld of research that is widely used to solve various problems, but to apply face recogni-

tion requires high accuracy so that there are no errors in the system that applies face recognition. The purpose

of this research is how to use one of the architectures of the Convolutional Neural Network (CNN), namely

MobileNet v2 to perform the task of face recognition of STT Wastukancana lecturers. The data used is taken

from the social media of each lecturer, data sharing is done with the K-Fold Cross Validation method. Mo-

bileNet v2 architecture will perform classiﬁcation tasks using different hyperparameter values to ﬁnd the best

performance. From various patterns, the best accuracy is 85dropout of 0.3 to reduce overﬁtting. Data sharing

using K-Fold Cross Validation provides results that improve accuracy. The addition of a dropout layer reduces

overﬁtting of the model.

1 INTRODUCTION

A face is one way to recognize a person’s identity.

Humans can recognize someone’s name from look-

ing at their face, if they have known that person be-

fore. Many computer applications or systems that are

made require a person’s identity, and there are also

many ways to recognize that identity. Attendance sys-

tem is one of the examples. There are various ways

used in an attendance system, one of the simplest is

by signing on paper which is now used in the atten-

dance system for lecturers at STT Wastukancana. To

facilitate the attendance system, face recognition can

be applied to replace the manual signature process on

paper. Basically, face recognition is an image classi-

ﬁcation that is specialized for face classiﬁcation only.

Convolutional neural network (CNN) is the most suit-

able model used for image classiﬁcation, because it

has been specialized to separate and detect patterns in

input images, thus making this approach useful in the

ﬁeld of face recognition(Farayola and Dureja, 2020).

There are various CNN architectures such as AlexNet,

GoogleNet, LeNet 5, or MobileNet. In this journal,

the author will use the MobileNet v2 architecture, be-

cause this model was developed for efﬁciency and

without sacriﬁcing many resources (S. K. A. B. Singh,

2019). MobileNet is built using a deeply decoupled

convolutional architecture for the development of a

lightweight model(Howard, 2017). There was two

versions of MobileNet, MobileNet v1 and MobileNet

v2. The updates in MobileNet v2 are the addition

of bottleneck layers and shortcut connections(Sandler

et al., 8 12). Convolutional neural networks have been

used in previous research for face recognition clas-

siﬁcation. Thirty-nine (39) classes were included in

the dataset. Fully Connected Layer, pooling layer,

and Convolutional layer without additional architec-

ture were used for training and the accuracy obtained

was 86.71 (Abhirawan et al., 2017).

Cross-Industry Standard Process for Data Min-

ing or CRISP-DM is one of the datamining process

models (datamining framework) which was originally

(1996) built by 5 companies namely Integral Solu-

tions Ltd (ISL), Teradata, Daimler AG, NCR Cor-

poration and OHRA (Mauritsius and Binsar, 2020).

CRISP-DM has the advantage over other models of a

clear deﬁnition of the Business Understanding phase.

This phase is not at all considered in detail in other

Data Mining models(Chapman, 2020). Deep learn-

ing has been used in various areas such as computer

vision, natural language processing, audio recogni-

tion, including face recognition. Deep learning is a

multi−layer algorithm for extracting characteristics

and identifying edges such as letters, numbers, faces,

etc. (Farayola and Dureja, 2020).

Convolutional is a subset of deep neural networks

Muttaqin, M., Aziz, A., Irmayanti, D. and Sumanto, .

Convolutional Neural Network Face Recognition for Lecturer Attendance.

DOI: 10.5220/0012447800003848

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Advanced Information Scientiﬁc Development (ICAISD 2023), pages 255-261

ISBN: 978-989-758-678-1

255

that have been introduced to evaluate visual images.

Convolutional neural networks have been specialized

to isolate and recognize patterns in visual inputs, thus

applying this method to the ﬁeld of face recognition

(Farayola and Dureja, 2020). The problem that exists

in the face recognition process is that there are differ-

ences in light intensity and also differences in poses

in existing data (Zhao et al., 2003). In general, frame-

works that process input face images through a fea-

ture extraction method and then the feature extraction

is recognized by a classiﬁer method for identiﬁcation

(Abhirawan et al., 2017).

Author at (Peryanto et al., 0 05) using K-Fold

Cross Validation looking at data division can improve

the accuracy of the model. KFold Cross Validation is

a given data set divided into a number of K parts/folds

where each fold is used as a test set at some point.

Figure 1: Fold Cross Validation (Krishni, 2018).

Data augmentation is a process in image data pro-

cessing, augmentation is the process of changing or

modifying images in such a way that the computer

will detect that the changed image is a different im-

age, but humans can still know that the changed im-

age is the same image (Mahmud et al., 2019).

MobileNet is an efﬁcient deep learning model that

may be deployed on embedded devices or mobile de-

vices such as smartphones without sacriﬁcing a lot

of resources(S. K. A. B. Singh, 2019). MobileNet

is built using a depthseparable convolutional archi-

tecture to create lightweight models (Howard, 2017).

There was two versions of MobileNet, MobileNet v1

and MobileNet v2. The updates in MobileNet v2 are

the addition of bottleneck layers and shortcut connec-

tions (Sandler et al., 8 12).

Figure 2: MobileNet v2 Architecture for object detection

and classiﬁcation (Wibowo et al., 0 07).

ReLU is the default activation in MobileNet v2’s

activation layer. ReLU is an activation function that

was ﬁrst introduced by H Sebastian Seung in 2000.

The activation function serves to activate and deacti-

vate neurons (Agarap, 2018). Speciﬁcally, ReLU6 is

used in every layer except the last convolution layer.

The equation for the ReLU6 activation function is

shown in equation 1.

f (x) = min(max(0, x)6) (1)

where f (x) is the result of ReLU6 activation, and

x is the value applied to be changed in the range (0,6).

(Wibowo et al., 0 07).

1.1 Convolutional Layer

Convolution is considered to be a situation where a

ﬁlter is applied to the input data (image) and gives the

activation result. Also, it can be said to be a linear

operation that involves multiplication performed be-

tween the set of weights and the input. These are the

layers required for feature extraction from an input

image (Farayola and Dureja, 2020).

Figure 3: Visualization of convolution layer (Biswas and

Islam, 2021).

1.2 Pooling Layer

This layer is usually and regularly used in CNNs to

reduce the size of the input data to increase the com-

putational speed of the network. It functions in each

feature map independently. Hence, whenever a situ-

ation of excessive image input arises, the pool layer

part will reduce the number of parameters. Also,

pooling can be of different types. There are several

types namely sum pooling, average pooling, and max

pooling. Usually, the most commonly used is max

pooling. Max pooling is a procces known as sample-

based discretization.. It down-samples the input data,

minimizing the dimensionality of the input and cre-

ating space for assumptions to be made regarding the

sub-regions in which the features are located (Faray-

ola and Dureja, 2020).

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

256

Figure 4: Visualization of max-pooling layer (Biswas and

Islam, 2021).

1.3 Global Average Pooling

Global average pooling is to generate one feature map

for each corresponding category of the classiﬁcation

task in the last convolution layer. Instead of adding

fully connected layers on top of the feature maps, it

then takes the average of each feature map, and the re-

sulting vectors are fed directly into the softmax layer.

One advantage of pooling global averages over fully

connected layers is that it is more native to the convo-

lution structure by enforcing the correspondence be-

tween feature maps and categories (Lin et al., 2014).

Another advantage is that there are no parameters

to optimize in global average pooling so overﬁtting

is avoided at this layer. In addition, global average

pooling summarizes spatial information, so it is more

robust to spatial translation of inputs(Lin et al., 2014).

1.4 Dropout

ropout is a process that prevents overﬁtting and also

speeds up the learning process. Dropout refers to re-

moving neurons that are either hidden or visible lay-

ers in the network. The neurons to be dropped will be

chosen randomly (Abhirawan et al., 2017).

2 RESEARCH METHODS

The proposed architecture for performing face clas-

siﬁcation uses MobileNet v2. This section describes

the data and architecture in more detail.

2.1 Business Understanding

The purpose of this research is to create a system that

can recognize the face of a lecturer using a camera

that can be used for a lecturer attendance system. The

system will use artiﬁcial neural network, using convo-

lutional neural network method for face recognition.

The system will receive input in the form of a face

image and will be processed on an artiﬁcial neural

network, then will produce an output of the lecturer’s

name from the input face image.

2.2 Convolutional Layer

The data to be used is in the form of facial images of

STT Wastukancana lecturers. The raw data needed is

a photo of a lecturer with RGB colors that are visi-

ble on his face. The size of the required photo must

be more than 224x224 pixels. The face part must be

clearly visible, because the face part will be used.

Figure 5: Raw Data Example.

2.3 Convolutional Layer

For this face recognition research, photos of each STT

Wastukancana lecturer will be taken from each social

media that will be used. Where there are 10 labels

which are the names of lecturers from each lecturer

which can be seen in Table 1.

Table 1: Total Datasets.

No Lecturer Name Label Total Data

1 Agus Sunandar Agus 10

2 Syariful Alam Nature 63

3 Chandra Dewi Lestari Chandra 99

4 Dede Irmayanti Dede 18

5 Irsan Jaelani Irsan 20

6 Meriska Defriani Meriska 109

7 Mochzen G. Resmi Mochzen 55

8 M. Raﬁ Muttaqin Raf 16

9 Rani Sri Wahyuni Rani 166

10 Yusuf Muhyidin Yusuf 335

Total Data 891

Table 1 shows the names of the lecturers to be

used, the labels to be used for each lecturer, the num-

ber of photos of each lecturer, and the total amount of

data. The data obtained will be cropped, which is to

Convolutional Neural Network Face Recognition for Lecturer Attendance

257

cut part of the digital image so that only the necessary

parts of the face are visible. Then the resize process

is carried out, which changes pixels in Figure 6.

Figure 6: Example of Dataset photo after Cropping Process.

After the cropping process, the dataset photos that

are too large are resized. An example of a resized

photo can be seen in Figure 7.

Figure 7: Example of Dataset photo after Resize Process.

Then a dataframe will be created with 2 columns,

namely ﬁlename and label (Table 2). The ﬁlename

column will be ﬁlled with the ﬁle name of all images

and the label column will be ﬁlled with the label of

each image according to the image in the same row.

2.4 Modeling

The model built will use the MobileNet v2 architec-

ture using the tensorﬂow framework. With several

Table 2: Example of Dataframe in Use.

No File Name Label

1 Agus001.jpg Agus

2 Agus002.jpg Agus

3 Agus003.jpg Agus

4 Agus004.jpg Agus

5 Agus005.jpg Agus

... ... ...

887 Yusuf332.jpg Yusuf

888 Yusuf333.jpg Yusuf

889 Yusuf334.jpg Yusuf

890 Yusuf335.jpg Yusuf

additional layers before MobileNet v2 including in-

put layer, data augmentation, and preprocessing to

scale image data between 0-255 to -1-1. Some ad-

ditional layers after MobileNet v2 include Global Av-

erage Pooling and output layer.

Figure 8: Design Model.

3 RESULT AND DISCUSSION

Python 3.7.9 with Tensorﬂow 2.3.1 framework was

used in this study, which was conducted on NVIDIA

GeForce GTX 1050 3GB GPU and AMD Ryzen 5

3550H laptop processor. The proposed architecture

has been run on tensorﬂow with several different

parameters. Once the architecture has been imple-

mented, training on the model is done. In the training

process, the dataset is divided into 2 parts, training

data and validation data using k-fold cross validation

with k = 5 to ﬁnd the best data division. If overﬁtting

occurs, dropout will be added to the model to reduce

overﬁtting. Then to improve accuracy, training ex-

periments will be conducted with a larger number of

epochs. The training process will use Adam as model

optimization, calculate loss with Crossentropy Loss,

and calculate how often the prediction is correct by

calculating the accuracy. Comparison of the results

of the training process will be done by comparing pa-

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

258

rameter values. The parameters compared are the k-

fold value, dropout, and number of epochs. Compar-

ison of parameter values is done based on research

from (Rokhana, 9 03)

3.1 Effect of K-Fold Value

The effect of sharing data with k-fold with the number

of k=5 with the number of epochs of 50, resulting in

5 models that have different accuracies. The effect of

k-fold value on model accuracy can be seen in Table

Table 3: Effect of K-Fold Value on Accuracy Value and

Loss Model.

K-Fold Accuracy Loss Accuracy Loss

(Training) (Training) (Validation) (Validation)

1 0,94 0,24 0,85 0,4

2 0,94 0,24 0,76 0,80

3 0,94 0,23 0,77 0,74

4 0,93 0,25 0,80 0,58

5 0,96 0,19 0,83 0,70

Based on Table 3, it can be seen that in the training

data, the best accuracy is in the ﬁfth fold of 0.96, and

in the validation data is in the ﬁrst fold of 0.85. If the

ﬁrst fold and the ﬁfth fold are compared in the amount

of overﬁtting, the ﬁrst fold is better against overﬁtting

than the ﬁfth fold, with a distance of 0.09 in the ﬁrst

fold and 0.13 in the ﬁfth fold. And when viewed at

the loss value, the ﬁrst fold has a smaller loss value in

the validation data of 0.47 while the ﬁfth fold is 0.70.

Therefore, from the results of data division using k-

fold cross validation, the ﬁrst fold is considered the

best.

3.2 Effect of Dropout Rate

Based on the results of the effect of data division us-

ing k-fold cross validation, the ﬁrst fold is the best.

However, there is a little overﬁtting in the model. To

reduce overﬁtting in the model, we will retrain the

ﬁrst fold by adding dropouts to the model which can

be seen in Figure 9.

The number of dropouts is set from the small-

est value between 0 to 1 to reduce overﬁtting. The

dropout value starts from 0.1. If the overﬁtting value

is still large, the dropout value will be added little by

little by 0.1, until the training results on the model

have no overﬁtting or have the smallest possible over-

ﬁtting value. Training results on accuracy and loss

can be seen in Table 4.

3.3 Effect of Number of Epochs

With the aim of increasing the amount of accuracy

in the model, retraining is carried out with a larger

Figure 9: Model Design with Dropout Added.

Table 4: Accuracy and Loss Value With the Addition of

Dropout.

Dropout Accuracy Loss Accuracy Loss

(Training) (Training) (Validation) (Validation)

1 0,94 0,24 0,85 0,47

0,1 0,92 0,26 0,87 0,48

0,2 0,91 0,34 0,84 0,52

0,3 0,85 0,42 0,85 0,55

number of epochs, namely 100 epochs. The results of

training on accuracy and loss can be seen in Table 5.

Table 5: Comparison of Accuracy and Loss Model Values

With 50 and 100 Epochs.

Epoch Accuracy Loss Accuracy Loss

(Training) (Training) (Validation) (Validation)

50 0,85 0,42 0,85 0,55

100 0,92 0,27 0,85 0,52

In Table 5, it can be seen that there is no improve-

ment in the accuracy of the validation data. In the

training data, there is an increase of 0.07. However,

the increase in training data causes overﬁtting in the

model. Therefore, adding epochs to 100 does not im-

prove the accuracy of the model. The graph of the

results of training models with 50 epochs can be seen

in Figure 10 and for training models with 100 epochs

can be seen in Figure 10 and 11.

An example of face recognition using the created

model can be seen in Fig. 12. In Fig. 12, the red

square line shows the face area that will be used for

classiﬁcation. The photo used is a photo of one of

the lecturers labeled Yusuf. The classiﬁcation results

using the model that has been trained show the out-

put is Yusuf, with a conﬁdence value of 99.82%. So

that the classiﬁcation results are declared correct. The

results of the model have been trained showing the

highest accuracy value of 85 valitation data. The best

model conﬁguration uses input layer, data augmenta-

tion layer, image preprocessing to scale image data

between 0-255 to -1-1, MobileNet v2, Global Aver-

age Pooling layer for output layer ten-class classiﬁ-

cation. The initial dataset is resized to 224x224 pix-

els. The division between training data and validation

Convolutional Neural Network Face Recognition for Lecturer Attendance

259

Figure 10: Graph of model training results with 50 epochs.

Figure 11: Graph of model training results with 100 epochs.

data uses K-Fold Cross Validation with the number k

= 5, the ﬁnal result shows the ﬁrst fold is the division

that gives the highest accuracy value. The process of

training the model starts the image will be entered in

the input layer with a size of 224x224x3. After going

through the input layer, data augmentation will be car-

ried out to train the model to recognize images from

various points of view. Then the image data will be

converted into a scale of -1 to 1 to match MobileNet

v2. Next, the data will

go through the MobileNet v2 architecture for the

training process, and will go through the global aver-

age pooling layer with a layer dropout of 0.3, the re-

sults of which will be classiﬁed into 10 classes. This

model was trained and validated using 891 lecturer

photo data. The accuracy obtained is 8550 epochs.

The addition of a dropout layer of 0.3 in the global

average layer is enough to reduce the overﬁtting of the

Figure 12: Example of Face Recognition Using a Model

That Has Been Created.

training results to make the accuracy of the training

data and validation data the same at 85the number of

epochs was not effective enough because the accuracy

on the validation data did not change but only on the

training data. However, it should be noted that the

data used in this study only used 10 lecturers, not all

lecturers. For application to the lecturer attendance

system, it is necessary to use data from all lecturers.

4 CONCLUSIONS

The main topic of this research is to create a deep

learning model that can recognize lecturers’ faces to

be applied to the lecturer attendance system at STT

Wastukancana. This research uses photos of 10 lec-

turers taken from social media with a total amount of

891 data. The architecture model used is MobileNet

v2 with the best parameter conﬁguration resulting in

an accuracy of 85In this model, the MobileNet v2 ar-

chitecture is the main layer that carries out the train-

ing process, and classiﬁcation is carried out through

the global average pooling layer. The best results use

data sharing with k-fold cross validation in the ﬁrst

fold with k = 5. The dropout layer added to the global

average pooling layer of 0.3 is enough to reduce over-

ﬁtting on training results so that the accuracy value

of training and validation data becomes the same at

85%.

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

260

REFERENCES

Abhirawan, H., Jondri, and Ariﬁanto, A. (2017). Face

recognition using convolutional neural networks (cnn.

e-Proceeding Eng, 4(3):4907-4916.

Agarap, A.F.M. (2018). Deep Learning using Rectiﬁed Lin-

ear Units (ReLU).

Biswas, A. and Islam, M. (2021). An efﬁcient cnn model

for automated digital handwritten digit classiﬁcation.

J. Inf. Syst. Eng. Bus. Intell, 7(1):42.

Chapman (2020). CRISP-DM ready for Machine Learning

Project.

Farayola, M. and Dureja, A. (2020). A Proposed Frame-

work: Face Recognition With Deep Learning.

Howard, A. (2017). Mobilenets: Efﬁcient convolu-

tional neural networks for mobile vision applications.

Jounal Apr. Accessed: Mar. 05, 2017

Krishni (2018). Evaluating a Machine Learning model

can... ’by Krishni — Data Driven Investor — Medium.

Data Driven Investor.

Lin, M., Chen, Q., and Yan, S. (2014). Network in network.

In 2nd Int. Conf. Learn. Represent. ICLR 2014 - Conf.

Track Proc, pages 1-10.

Mahmud, K., Adiwijaya, and Faraby, S. (2019). Multi-class

Image Classiﬁcation Using Convolutional Neural Net-

work, e-Proceeding Eng. vol. 6, pag. 2127-2136.

Mauritsius, T. and Binsar, F. (2020). Cross-Industry Stan-

dard Process for Data Mining (CRISP-DM. MMSI

BINUS University.

Peryanto, A., Yudhana, A., and Umar, R. (2020-05). Image

classiﬁcation using convolutional neural network and

k fold cross validation. J . Appl. Informatics Comput,

4(1):45-51,.

Rokhana, R. (2019-03). Convolutional neural network for

femur fracture detection in b-mode ultrasonic image.

J. Nas. Tech. Electro and Technol. Inf, 8(1):59.

S. K. A. B. Singh, D. T. (2019). Shunt connection: An in-

telligent skipping of contiguous blocks for optimizing

mobilenet-v2. vol. 118, pag.192-203.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L. (2018-12). Mobilenetv2: Inverted residuals

and linear bottlenecks. Proc.IEEE Comput. Soc. Conf.

Comput. Vis. Pattern Recognit, pages 4510-4520.

Wibowo, A., Hartanto, C., and Wirawan, P. (2020-07). An-

droid skin cancer detection and classiﬁcation based

on mobilenet v2 model. J. Adv. Intell. Informatics,

6(2):135-148.

Zhao, W., Chellappa, R., Phillips, P., and Rosenfeld, A.

(2003). Face recognition: A literature survey. ACM

Computing Surveys.

Convolutional Neural Network Face Recognition for Lecturer Attendance

261