Night Fatigue Driving Detection Technology Using Infrared Images and

Convolutional Neural Networks

Huei-Yung Lin

1 a

and Kai-Chun Tu

Department of Computer Science and Information Engineering, National Taipei University of Technology,

Taipei 106, Taiwan

Department of Electrical Engineering, National Chung Cheng University, Chiayi 621, Taiwan

Keywords:

Fatigue Driving Detection, IR Images, Convolutional Neural Network, Action Recognition.

Abstract:

Trafﬁc accident is one of top ten causes of death, and fatigue driving is one of the major reasons. It usually

reduces the driver’s concentration and reaction speed, and is especially dangerous in some situations at night.

This works presents a real-time driving fatigue monitoring system. The proposed network architecture with

Unbalanced Local CNNs can effectively draw attentions to different face regions according to driver’s states

due to fatigue. Based on SlowFast, the recognition accuracy of our method on the IR image datasets is greatly

improved compared to the original model. Moreover, an adversarial learning mechanism is incorporated to

extract the common features of daytime RGB and nighttime IR images to increase the overall robustness. The

experiments carried out on public datasets and road scene images have demonstrated the effectiveness of the

proposed technique. The code is available at https://github.com/KaiChun-Tu/slowfastDrowsyDriver

1 INTRODUCTION

According to the NHTSA (National Highway Traf-

ﬁc Safety Administration), fatigue driving is a ma-

jor cause of trafﬁc accidents in the United States.

There were 697 people died due to fatigue driving

in 2019, and 2.5% of fatal trafﬁc accidents and 2.0%

of non-fatal trafﬁc accidents were related to drowsy

driving. Between 2009 and 2010, the CDC (Cen-

ters for Disease Control and Prevention) interviewed

a total of 147,076 people in 19 states and the District

of Columbia, 4.2% of respondents admitted that they

have once fallen asleep while driving. NHTSA’s sur-

vey on the number of car accidents in age groups at

different times of the day revealed that drivers under

the age of 45 have far more car accidents between 11

p.m. and 8 a.m. than other car accidents happened in

other time.

The above information clearly shows that fatigue

driving is a very dangerous situation. Although the

driver might be able to hold back and does not fall

asleep, fatigue will still reduce the driver’s reaction

speed, concentration, decision-making ability, and in-

crease the chance of car accidents. Thus, fatigue driv-

ing is a serious trafﬁc problem and needs to be solved

urgently (Zhang and Lin, 2021).

https://orcid.org/0000-0002-6476-6625

Since the introduction of AlexNet (Krizhevsky

et al., 2012), deep neural networks have been devel-

oped rapidly and used in various transportation ﬁelds.

Deep learning approaches are applied to driving fa-

tigue detection and technical advances have been re-

ported (Sikander and Anwar, 2018). However, most

algorithms take RGB images captured during the day

as input for network training, and cannot be general-

ized to nighttime application scenarios. Since fatigue

driving is more likely to occur at night, it is necessary

to develop a system exclusively for nighttime use. In

addition, current methods consider the task as an im-

age recognition problem, and annotate training data

frame-by-frame for fatigue driving detection. It might

have some difﬁculties to distinguish the behaviors of

falling asleep and blinking when only a single image

is used.

In this work, we develop a nighttime driver fatigue

detection system using the images acquired by an in-

frared camera. The proposed network structures take

IR images as input, and model the problem as abnor-

mal behavior classiﬁcation from an image sequence.

We collect both daytime RGB images and nighttime

IR images for training and testing. A network archi-

tecture based on multiple LANets is combined with

SE-blocks to focus on different areas according to dif-

ferent fatigue actions. Moreover, a modality-invariant

Lin, H. and Tu, K.

Night Fatigue Driving Detection Technology Using Infrared Images and Convolutional Neural Networks.

DOI: 10.5220/0011847400003479

In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2023), pages 273-280

ISBN: 978-989-758-652-1; ISSN: 2184-495X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

273

feature extraction module based on adversarial learn-

ing is added to improve the robustness of fatigue de-

tection by an image type dependent discriminator. In

the experiments, ablation studies on real scene images

are carried out to verify the feasibility and effective-

ness of the proposed technique.

2 RELATED WORK

In the existing literature, the detection of drowsy driv-

ing is based on two types of information sources, in-

cluding biomedical signal sensing and external char-

acteristics extraction (Doudou et al., 2020). Biomed-

ical signals refer to the electric currents generated by

the human body. In recent works, there are three types

of signals commonly used to distinguish the states of

fatigue, namely EEG (electroencephalogram), ECG

(electrocardiogram) and EOG (electrooculogram).

EEG is the signal produced by measuring the po-

tential differences between groups of cerebral cortex

cells. In an early investigation, (Jap et al., 2009) col-

lected the frequencies of four EEGs in a fatigued driv-

ing state, and found that the signals from some areas

of a brain changed obviously. Since the brain activity

is directly related to fatigue, EEG signals can effec-

tively reﬂect the state of fatigue and provide accurate

results. However, EEG is not only related to fatigue,

but also arms, eyeballs, mouth, etc. Many parts of hu-

man body are controlled by the brain, so EEG is very

susceptible to other noise. Thus, it is required to carry

out an additional preprocessing stage.

ECG utilizes electrodes to detect and amplify the

signals produced by a very small potential change in

the skin caused by the heartbeat. To measure the ECG

signals during driving, early approaches placed elec-

trodes in the seat belt (Murugan et al., 2020). Never-

theless, this can be easily inﬂuenced by the dynamic

driving environment. In a later implementation (War-

necke et al., 2022), the electrodes are attached to the

steering wheel. But this approach still has the short-

comings of low accuracy, and is easily to affected by

the noise.

EOG records the voltages between the retina and

the cornea of the eye. This kind of signals needs to be

measured by electrodes attached to both sides of the

eyes, and therefore have a serious impact on normal

driving. To cope with this problem, some techniques

proposed to place the electrodes on the forehead of the

driver (Zhang et al., 2015). However, the applicability

is still very limited in practical uses.

For fatigue or drowsy driving detection, extrinsic

characteristics refer to the visible appearance features

related to the driver’s state, such as nodding, yawning,

Figure 1: The dataset images captured in the daytime (left)

and nighttime (right).

blinking frequency (Dong and Lin, 2021). Consider-

ing the fatigue characteristics related to the eyes, it

is commonly evaluated by the duration of eye closed,

or its proportion per unit time. In addition, the mouth

also represents an important factor for the fatigue con-

dition. There are many techniques combining the fea-

tures extracted from eyes and mouth (Savas¸ and Be-

cerikli, 2020). These approaches ﬁrst utilize convolu-

tional neural networks to identify the eyes and mouth.

It is then followed by the eyes closing time per unit

period and the aspect ratio of the mouth to evaluate

the fatigue state.

In addition to the use of a single image for fatigue

driving detection, utilizing image sequences with the

temporal information can provide more robust results.

It is usually difﬁcult to distinguish the situations such

as blinking and falling asleep or yawning and talking

with a still image. Thus, several recent techniques are

developed based on optical ﬂow, LSTM and 3DCNN

(Quddus et al., 2021). The optical ﬂow obtained us-

ing two consecutive images can provide the move-

ment of features to improve the recognition rate. For

3DCNN, an additional sliding window is added to

2DCNN to acquire the temporal information between

adjacent images. In LSTM, long short-term memory

is adopted as a specialized recurrent neural network.

It considers the features from neural networks inte-

grated along the time axis.

There are only a limited number of datasets avail-

able for the fatigue driving evaluation. In the YawDD

dataset, the images were captured under the natural

daylight (Abtahi et al., 2020). It consists of four ac-

tion states for the drivers: talking, singing, silence and

yawning. The dataset was collected with 47 males and

43 females from different regions and under different

ages. It provides only the RGB images and cannot be

used to train the neural networks for nighttime fatigue

detection. Furthermore, there are only mouth features

provided by YawDD. Without the information related

to eyes and head, the detection of certain behaviors,

such as falling asleep and nodding, will become more

challenging. Another commonly used public dataset

is the DDDD dataset (Weng et al., 2016). The images

were captured using an IR camera in an indoor envi-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

274

ronment under daytime and nighttime lighting condi-

tions. Each driver was ﬁlmed in ﬁve scenarios: wear-

ing sunglasses in daytime, not wearing glasses in day-

time, wearing glasses in daytime, not wearing glasses

in nighttime, and wearing glasses in nighttime. Since

the DDDD images were collected indoors, it does not

reﬂect the actual lighting and environment illumina-

tion changes in the outdoor scenes.

3 METHOD

It is indicated by NHTSA that most car accidents hap-

pened in the nighttime, but a majority of existing fa-

tigue detection research was conducted based on RGB

images. Due to the lack of ambient light in the night

scenes, it is not feasible to adopt the RGB images ac-

quired at night for fatigue detection. In this work, we

use IR images captured at night for processing. These

images are less likely to have the reﬂection issue of

glasses compared to the daytime RGB images. They

are also better suited for the networks to have atten-

tions on the eyes and mouth. Thus, we propose a con-

volutional neural network which emphasizes the use

of IR characteristics of images.

To collect nighttime IR images, we used a Garmin

Dash Cam Tandern camera. It is equipped with an ac-

tive infrared function to capture clear driving images

in the absence of light sources during the night. Our

dataset was acquired in order to match the actual ap-

plication scenarios as much as possible. There were 8

different drivers in a vehicle with no light source other

than streetlights ﬁlmed for about four minutes in both

fatigue and normal states. For the testing purpose, we

also collected image data during the day. The camera

automatically switch between the IR and RGB modes

according to the ambient light. Figure 1 shows exam-

ples of RGB and IR images acquired in the daytime

and nighttime, respectively.

To label the ground-truth data, we adopt a method

different from the frame-by-frame approach for most

public datasets. In our technique, the fatigue detection

task is considered as ﬁnding abnormal events, and the

image annotation is carried out the same way as for

the Kinetics dataset (Carreira and Zisserman, 2017).

The image sequence is marked in groups of 8 frames,

with 3 images shared between the consecutive groups,

to generate distinct samples while maintaining overall

data volume.

In the proposed method, fatigue detection is based

on the facial features of the driver, such as eye closing,

yawning, head tilting or nodding, etc. Thus, face im-

ages are ﬁrst extracted using RetinaFace (Deng et al.,

2020) to generate the input sequences for fatigue driv-

Figure 2: The network architecture of SlowFast modiﬁed

for this work.

ing detection. The backbone of our proposed network

architecture is based on the SlowFast model (Feicht-

enhofer et al., 2019). Different from the 3D convo-

lution models utilizing the same spatial and time di-

mensions, SlowFast considers the temporal informa-

tion changing very rapidly compared to the spatial se-

mantic information. That is, it is not possible that one

frame with yawning, the next frame with talking, and

then followed by the third frame with yawning again.

Since the facial expression changes rapidly, SlowFast

establishes two separate paths to extract the spatial se-

mantics and temporal action information.

3.1 SlowFast with Attention Module

Figure 2 depicts the SlowFast architecture adopted in

this work. The slow pathway has image input at a

lower frame rate. It contains a larger amount of pa-

rameters in the channel dimension for the extraction

of spatial semantic information. In contrast, the fast

pathway inputs images at a higher frame rate in the

time dimension. The larger amount of parameters in

the time dimension is used to extract the motion infor-

mation. By combining these two pathways, SlowFast

connects the motion information to the Slow pathway

for data fusion at each stage.

The main difference between IR and RGB images

is that the ambient lights at night are much less com-

pared to the daytime. Thus, the reﬂective brightness

on the glasses and appeared in the image causes a se-

rious impact on fatigue detection. Most existing tech-

niques extract the eyes and mouth to improve the ac-

curacy. However, this method can be greatly affected

if the glasses reﬂect lights and cause the loss of eye

features. Since IR images record much less reﬂec-

tive appearance on the glasses, they are more suitable

for the approach. In addition to the extra efforts for

the extraction of eyes and mouth regions, the network

structure also increases in the model size and process-

ing time

The above issues are generally very unfavorable

for the practical use. To deal with these problems, the

proposed method utilizes face images without further

extraction of eye and mouth regions. Alternatively, an

Night Fatigue Driving Detection Technology Using Infrared Images and Convolutional Neural Networks

275

additional module is added to the architecture so that

the network model is able to focus more on eyes and

mouth. In our network structure, the SlowFast model

is adopted. Since it contains the information along the

time axis, which is not necessary for current attention

models taken as input, we utilize the average pooling

operation to compress the temporal information of the

feature maps instead.

3.2 Unbalanced Local CNNs

In recent years, there has been a mainstream research

on the facial expression recognition. It is very similar

to the objective of this work in the input/output (i.e.,

image/category) modeling of the system. Hence, it is

desirable to adopt the method to improve the network

performance. More speciﬁcally, the adjustment of fa-

cial regions for expression recognition can be used to

provide the attention for the network. In the previous

work, an attention adjustment model, Local CNNs, is

proposed (Xue et al., 2021). The basic idea is to ﬁrst

use multiple LANets (Ding et al., 2020) to generate

attention maps which focus on different regions. It

is then followed by combining all attention maps via

ﬁnding the maximum value of each pixel in the at-

tention map. The MAD module in the architecture is

utilized to prompt the network to explore more poten-

tially important areas similar to dropout by randomly

reducing the attention maps to zero.

For facial expression recognition, the image areas

involved in each expression are different, and atten-

tion maps generated by Local CNNs need to be aver-

aged and cannot be too imbalanced. However, in our

driving fatigue detection, the network only has to pay

attention to the eyes and mouth. It is not necessary to

care about the imbalance of generated attention maps,

since the important parts for fatigue may be different

each time. For example, we expect the network model

being able to focus exclusively on the mouth features

when yawning.

With the modiﬁcation to the original model, the

MAD module is removed and an additional fully con-

nected layer is added to utilize the input features to

calculate the importance of the attention maps gener-

ated by each LANet. The original Local CNNs tend

to result in a more average attention map since they

simply combine the maximum value of each attention

map. Our Local CNNs can make one or two attention

maps completely larger than the others based on the

features computed by fully connected layers. Conse-

quently, the ﬁnal attention map has only the features

of fewer attention maps, and provide a less averaged

result. Thus, this architecture is referred to as Unbal-

anced Local CNNs.

Figure 3: Our ﬁnal Local CNNs model.

In our original design, we inserted the Unbalanced

Local CNNs in the slow pathway which is responsible

for spatial operations. It is expected to have the spatial

semantic information focused on the positions such as

the eyes and the mouth. Due to the convolution opera-

tion of previous layers, the features have been focused

on uninteresting areas. When the ﬁfth layer is added

to our Local CNNs, the attention of the model cannot

be effectively adjusted. Since the eyes and mouth are

expected to have larger movements than other regions

and the fast pathway is responsible for motion detec-

tion, the features of the fast pathway are used instead

as the input for attention map generation. It is then

used to adjust the attention of the spatial semantics of

the slow pathway to improve the overall performance.

In a recent study (Woo et al., 2018), CBAM (Con-

volutional Block Attention Module) was proposed to

make a joint attention adjustment through the channel

and spatial attention modules. The network structure

of our Local CNNs is also a spatial attention module,

and a channel attention module is incorporated ahead.

By adding the new ECA (Efﬁcient Channel Attention)

module (Wang et al., 2020), the ﬁnal architecture of

our Unbalanced Local CNNs is shown in Figure 3.

Figure 4 shows the attention maps of each model.

The ﬁrst column is the input image, and the second to

fourth columns are attention maps of SlowFast+our

Local CNNs, SlowFast, and SlowFast+Local CNNs,

respectively. It can be observed that the attention map

of SlowFast+Local CNNs is quite messy and without

speciﬁc focus. The attention of SlowFast is more con-

centrated, but it also focuses on other parts of the face

besides the eyes and mouth. Our network model will

focus on different aspects of the driving status. It con-

stantly observes the eye features, but the mouth would

receive the highest attention when yawning. Although

yawning will also cause changes in the entire face, the

attention is comparably less signiﬁcant.

3.3 Modality Invariant Feature

Learning

In general, nighttime and daytime images have very

different characteristics. Even both cases are recorded

with IR images, it is still difﬁcult for a single network

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

276

Figure 4: The heatmaps of each network model.

Figure 5: The modality-invariant feature learning based on

adversarial learning.

to derive good results for day and night scenes simul-

taneously. The performance will be further degraded

if IR images are used for nighttime and RGB images

are used for daytime. Therefore, most of the current

fatigue detection techniques only work on either day-

time or nighttime images. In the recent research on

pedestrian recognition, the accuracy of the network

model is improved by learning the modality-invariant

features for pedestrian re-identiﬁcation between RGB

and IR images. Based on a similar idea, we adopt the

adversarial learning utilizing modality-invariant fea-

tures (Lin et al., 2022) to construct our fatigue detec-

tion model, and the network architecture is shown in

Figure 5.

In the network modiﬁcation, we insert a discrimi-

nator to the ﬁfth layer of SlowFast. This discriminator

is responsible for determining whether the input fea-

tures are from IR images, RGB images, or modality-

invariant. Two loss functions

Loss

= CE(out

, Label

) (1)

and

Loss

= CE(out

, Label

) (2)

Table 1: Ablation study of the proposed network model. A

sequence of 8 IR images is used for evaluation.

F1 score

SlowFast 0.954

SlowFast + Attention Augmented 0.960

SlowFast + Local CNNs 0.961

SlowFast + Unbalanced Local CNNs 0.965

are used in the proposed network, where CE is cross-

entropy and out

represents the output of the discrim-

inator. In Eq. (1), Label

indicates the true category of

the image, and Loss

represents the loss between the

discriminator output and real class. The discriminator

will be optimized according to the back-propagation

gradient of Loss

to enhance its ability to distinguish

the input features from IR or RGB images. In Eq. (2),

Label

represents the category with modality invari-

ant, and Loss

denotes the loss between the discrimi-

nator output and modality-invariant class. The ﬁrst to

the ﬁfth convolutional layers in SlowFast is optimized

according to the back-propagation gradient of Loss

to improve the ability of feature extraction from RGB

and IR images.

By the adversarial learning established using the

discriminator and two loss functions in SlowFast, the

feature extraction capability can be improved by con-

frontation. We can effectively make the convolutional

layers work on both IR and RGB image inputs. Com-

mon features will be extracted, so the fatigue state can

be accurately identiﬁed in the subsequent fully con-

nected layers.

4 EXPERIMENTS

To ensure the effectiveness of our proposed model, we

ﬁrst perform several ablation experiments, including

spatial attention at night, temporal feature compres-

sion, and channel attention. We use IR images to test

the original SlowFast model with the following atten-

tion models, Attention Augmented, Local CNNs, and

Unbalanced Local CNNs (without channel attention

module), and and the performance is tabulated in Ta-

ble 1. The results indicate that SlowFast + LocalC-

NNs improves the F1 score by 0.7% for night scenes.

Our Unbalanced Local CNNs is 1.1% higher than the

original SlowFast and 0.4% higher than SlowFast +

Local CNNs. Moreover, the parameters of the pro-

posed model are only about 40% of SlowFast + Atten-

tion Augmented. The experiments have demonstrated

that our method has good performance in terms of ac-

curacy, processing speed, and model parameters for

IR image input.

For channel attention, we test the following four

models, SE-block, CBAM, Coordinate Attention, and

Night Fatigue Driving Detection Technology Using Infrared Images and Convolutional Neural Networks

277

Table 2: Ablation study of different channel attention mod-

ules.

F1 score FPS Model Size

SE-block 0.966 78.62 136.8 MB

CBAM 0.953 76.75 136.8 MB

Coordinate Attention 0.935 77.19 136.8 MB

ECA 0.969 79.38 136.8 MB

Figure 6: Attention maps of different channel attention

modules.

ECA. Table 2 tabulates the evaluation results. It can

be seen that the performance of using ECA is slightly

higher using SE-block, but there is a large difference

between CBAM and Coordinate Attention, with 3.4%

higher for CBAM. The attention maps derived from

different channel attention models are shown in Fig-

ure 6. The results from the SE-block provide almost

no additional RoIs other than the eyes and mouth, and

only focus on the mouth when driving and yawning.

Nevertheless, the attention of the mouth is difﬁcult to

accurately cover the entire mouth region. For CBAM,

some face images are focused on the mouth area and

others are focused on the eyes. As for Coordinate At-

tention, the overall attention distribution is quite aver-

age, and there is no place to pay more attention.

For feature compression in the time dimension, we

evaluate three approaches in compressing spatial fea-

tures from CBAM (Woo et al., 2018), average pool-

ing, max pooling, and average pooling+max pooling,

and the F1 scores are 0.969, 0.958 and 0.963, respec-

tively. Finally, we compare our method with the work

(Liu et al., 2019). Taking 8 IR images as input, the F1

scores of the proposed SlowFast + Unbalanced Local

CNNs and the baseline approach (Liu et al., 2019) are

0.969 and 0.941, respectively.

Table 3: The experimental results of the IR images in

NTHU-DDDD dataset.

F1 score

SlowFast + our Local CNNs (train) 0.96

SlowFast + our Local CNNs 0.73

(Bai et al., 2021) 0.854

(Lyu et al., 2018) 0.9005

(Park et al., 2016) 0.748

(Yu et al., 2016) 0.683

(Liu et al., 2019) 0.962

Table 4: The experimental results using YawDD dataset.

F1 score

SlowFast + our Local CNNs 0.970

(Bai et al., 2021) 0.895

(You et al., 2020) 0.943

4.1 Public Datasets

The proposed fatigue driving detection technique is

tested on two public datasets, NTHU-DDDD (Weng

et al., 2016) and YawDD (Abtahi et al., 2014). In the

DDDD dataset, the images are marked in a way that

one image corresponds to one category. The annota-

tion is modiﬁed to conform our network input format

in a group of eight images. Table 3 shows the results

evaluated using IR images and compared with several

different algorithms (Bai et al., 2021; Lyu et al., 2018;

Park et al., 2016; Liu et al., 2019). It can be seen that

the F1 score of our model is 73%, but the performance

during training is 96%. The serious overﬁtting prob-

lem might be caused by the data labeling with a long

period of time instead of image by image. As a result,

there will be abnormal images very similar to normal

images, no matter for training or testing images. Our

model performs poorly due to the learnable features

in training data not extended to the testing images.

In the YawDD dataset, there are three types of an-

notations: Normal, Talking and Yawning. It is labeled

for video clips of about 20 to 40 seconds. Since it is

not possible to have driving and yawning for the en-

tire video clips, there are Normal images mixed in the

Yawning category images. Thus, we use the annota-

tion ﬁle for ﬁne-grained classiﬁcation of videos in the

Yawning category. It is also adopted for data segmen-

tation of Yawning category videos. For Normal and

Talking videos, we randomly select 14 drivers and di-

vide them into training and testing data. The evalua-

tion results and comparison with other techniques are

shown in Table 4.

4.2 Dynamic Images and RGB/IR Mix

In the previous works, most experiments were carried

out in indoor environments or using images captured

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

278

Figure 7: The attention maps derived using the mixture of

IR and RGB images.

in a stationary vehicle. However, in practical applica-

tions, the fatigue detection system will operate under

the vehicle in motion. The current approaches usually

extract the face regions, so the impact of environment

illumination changes should be minimized. To further

investigate the inﬂuences of real driving to the fatigue

detection, we perform the experiment using a driving

video captured at night. The test result shows the F1

score of 94.1%, about 2.8% lower than the static case.

Same as the previous observation, the false detection

mostly happened during the transition between nor-

mal and abnormal states.

In addition to network training and testing both on

nighttime IR images, it is also desirable to make our

technique work on daytime application scenario using

RGB images. Thus, in the last experiment we mix IR

images recorded at night with RGB images captured

during the day using the same hardware settings for

training. The network model is tested for both day-

time and nighttime scenes, and the results are shown

in Table 5. It can be seen that contrastive learning can

Table 5: The evaluation results using the mixture of IR and

RGB images.

F1 score FPS Model Size

Without A. L. 0.941 77.60 135.1 MB

With A. L. 0.956 72.29 136.8 MB

effectively improve the recognition result, and the ac-

curacy is about 1.5% higher. Since the discriminator

does not need to perform operations during inference,

the overall FPS does not change too much. Figure 7

shows the attention maps derived from this model. By

adding adversarial learning, the attention can be effec-

tively focused on the eyes without special actions. In

the yawning region, the mouth shows good attention

with or without adversarial learning. When the driver

is nodding, however, neither cases provide a particu-

lar area of interest.

5 CONCLUSIONS

In this paper, we develop a nighttime fatigue driving

detection technique using infrared images. Since the

commonly used public datasets for fatigue driving re-

search provide indoor or RGB images, there are many

limitations for nighttime scenes. This work uses an IR

camera to collect fatigue driving images in the actual

application scenario. We propose a network architec-

ture which is able to effectively use action information

in the SlowFast model to change the spatial attention.

With limited parameters and computing time, the net-

work can focus on the eyes and mouth regions. More-

over, a modality invariant feature learning mechanism

based on adversarial learning is added to improve the

accuracy for both the daytime and nighttime scenes.

The experiments carried out on real application sce-

narios have demonstrated the effectiveness of the pro-

posed technique.

ACKNOWLEDGMENTS

This work was ﬁnancially/partially supported by the

Ministry of Science and Technology of Taiwan under

Grant MOST 109-2221-E-027-126-MY3 and Create

Electronic Optical Co., LTD, Taiwan.

REFERENCES

Abtahi, S., Omidyeganeh, M., Shirmohammadi, S., and

Hariri, B. (2014). Yawdd: A yawning detection

dataset. In Proceedings of the 5th ACM multimedia

systems conference, pages 24–28.

Night Fatigue Driving Detection Technology Using Infrared Images and Convolutional Neural Networks

279

Abtahi, S., Omidyeganeh, M., Shirmohammadi, S., and

Hariri, B. (2020). Yawdd: Yawning detection dataset.

Bai, J., Yu, W., Xiao, Z., Havyarimana, V., Regan, A. C.,

Jiang, H., and Jiao, L. (2021). Two-stream spatial-

temporal graph convolutional networks for driver

drowsiness detection. IEEE Transactions on Cyber-

netics, pages 1–13.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6299–6308.

Deng, J., Guo, J., Ververas, E., Kotsia, I., and Zafeiriou, S.

(2020). Retinaface: Single-shot multi-level face local-

isation in the wild. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Ding, L., Tang, H., and Bruzzone, L. (2020). Lanet: Local

attention embedding to improve the semantic segmen-

tation of remote sensing images. IEEE Transactions

on Geoscience and Remote Sensing, 59(1):426–435.

Dong, B.-T. and Lin, H.-Y. (2021). An on-board monitor-

ing system for driving fatigue and distraction detec-

tion. In 2021 22nd IEEE International Conference on

Industrial Technology (ICIT), volume 1, pages 850–

855. IEEE.

Doudou, M., Bouabdallah, A., and Berge-Cherfaoui, V.

(2020). Driver drowsiness measurement technolo-

gies: Current research, market solutions, and chal-

lenges. International Journal of Intelligent Trans-

portation Systems Research, 18(2):297–319.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).

Slowfast networks for video recognition. In Proceed-

ings of the IEEE/CVF international conference on

computer vision, pages 6202–6211.

Jap, B. T., Lal, S., Fischer, P., and Bekiaris, E. (2009). Us-

ing eeg spectral components to assess algorithms for

detecting fatigue. Expert Systems with Applications,

36(2):2352–2359.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Pereira, F., Burges, C. J. C., Bottou, L.,

and Weinberger, K. Q., editors, Advances in Neural

Information Processing Systems, volume 25. Curran

Associates, Inc.

Lin, X., Li, J., Ma, Z., Li, H., Li, S., Xu, K., Lu, G.,

and Zhang, D. (2022). Learning modal-invariant and

temporal-memory for video-based visible-infrared

person re-identiﬁcation. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 20973–20982.

Liu, W., Qian, J., Yao, Z., Jiao, X., and Pan, J. (2019). Con-

volutional two-stream network using multi-facial fea-

ture fusion for driver fatigue detection. Future Inter-

net, 11(5):115.

Lyu, J., Yuan, Z., and Chen, D. (2018). Long-term multi-

granularity deep framework for driver drowsiness de-

tection. arXiv preprint arXiv:1801.02325.

Murugan, S., Selvaraj, J., and Sahayadhas, A. (2020). De-

tection and analysis: driver state with electrocardio-

gram (ecg). Physical and engineering sciences in

medicine, 43(2):525–537.

Park, S., Pan, F., Kang, S., and Yoo, C. D. (2016). Driver

drowsiness detection system based on feature repre-

sentation learning using various deep networks. In

Asian Conference on Computer Vision, pages 154–

164. Springer.

Quddus, A., Zandi, A. S., Prest, L., and Comeau, F. J.

(2021). Using long short term memory and convolu-

tional neural networks for driver drowsiness detection.

Accident Analysis & Prevention, 156:106107.

Savas¸, B. K. and Becerikli, Y. (2020). Real time driver

fatigue detection system based on multi-task connn.

Ieee Access, 8:12491–12498.

Sikander, G. and Anwar, S. (2018). Driver fatigue detection

systems: A review. IEEE Transactions on Intelligent

Transportation Systems, 20(6):2339–2352.

Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., and Hu, Q.

(2020). Eca-net: Efﬁcient channel attention for deep

convolutional neural networks. In 2020 IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 11531–11539.

Warnecke, J. M., Ganapathy, N., Koch, E., Dietzel, A., Flor-

mann, M., Henze, R., and Deserno, T. M. (2022).

Printed and ﬂexible ecg electrodes attached to the

steering wheel for continuous health monitoring dur-

ing driving. Sensors, 22(11):4198.

Weng, C.-H., Lai, Y.-H., and Lai, S.-H. (2016). Driver

drowsiness detection via a hierarchical temporal deep

belief network. In Asian Conference on Computer Vi-

sion, pages 117–133. Springer.

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam:

Convolutional block attention module. In Proceed-

ings of the European conference on computer vision

(ECCV), pages 3–19.

Xue, F., Wang, Q., and Guo, G. (2021). Transfer: Learn-

ing relation-aware facial expression representations

with transformers. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

3601–3610.

You, F., Gong, Y., Tu, H., Liang, J., and Wang, H. (2020).

A fatigue driving detection algorithm based on fa-

cial motion information entropy. Journal of advanced

transportation, 2020.

Yu, J., Park, S., Lee, S., and Jeon, M. (2016). Represen-

tation learning, scene understanding, and feature fu-

sion for drowsiness detection. In Asian Conference on

Computer Vision, pages 165–177. Springer.

Zhang, J.-Z. and Lin, H.-Y. (2021). Driving behavior analy-

sis and trafﬁc improvement using onboard sensor data

and geographic information. In VEHITS, pages 284–

291.

Zhang, Y.-F., Gao, X.-Y., Zhu, J.-Y., Zheng, W.-L., and Lu,

B.-L. (2015). A novel approach to driving fatigue

detection using forehead eog. In 2015 7th Interna-

tional IEEE/EMBS Conference on Neural Engineer-

ing (NER), pages 707–710. IEEE.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

280