Early Diagnosis of Parkinson’s Disease via Pro-Saccadic

Eye Movement Analysis: Multimodal Intermediate Fusion Framework

Ji-Yun Han

1,3 a

, Dae-Yong Cho

1,3 b

Dallah Yoo

2 c

, Tae-Beom Ahn

2,3 d

and Min-Koo Kang

1,3 e

Intelligence and Interaction Research Center, Korea Institute of Science and Technology (KIST), Seoul 02792,

Republic of Korea

Department of Neurology, Kyung Hee University Hospital, Kyung Hee University College of Medicine, Seoul,

Republic of Korea

KHU-KIST, Department of Converging Science and Technology, Kyung Hee University, Seoul 02447, Republic of Korea

Keywords:

Parkinson’s Disease, Eye-Tracking, Automated Diagnosis, Early Diagnosis, Deep Learning.

Abstract:

Early detection and timely treatment are essential for improving patient outcomes, but the lack of reliable

biomarkers impedes early diagnosis of Parkinson’s Disease (PD). Consequently, eye movement abnormalities,

known as early symptoms of PD, are gaining attention as crucial clues for early diagnosis. This study proposes

a novel multimodal intermediate fusion framework for the early diagnosis of PD using eye-tracking data.

The proposed framework improves the performance of classifying abnormal eye movement patterns in PD by

integrating local features from time-series data and global features from encoded time-series images. Focusing

on pro-saccade eye movements, this framework captures signiﬁcant abnormalities like reduced peak saccadic

velocity and multi-step saccades frequently observed in PD. The experimental results show a precision of 82%

and a recall of 96% for PD, which demonstrates the effectiveness of the framework in minimizing missed

diagnoses during early detection. In addition, this study highlights the potential of eye-tracking data as a

biomarker for the early diagnosis of PD and predicts the advanced application of integrating wearable smart

glasses for daily monitoring of neurodegenerative diseases.

1 INTRODUCTION

Parkinson’s disease (PD) is a progressive neurode-

generative disorder characterized by motor and non-

motor symptoms. Early detection and timely treat-

ment are essential to improve patients quality of life,

as they can reduce symptoms, delay the need for L-

dopa therapy, and slow disease progression (Tinelli

et al., 2016). Despite these beneﬁts, the current lack

of reliable biomarkers with high sensitivity and speci-

ﬁcity in the early stages poses a challenge to the diag-

nosis of PD (Tolosa et al., 2021). Consequently, many

patients are diagnosed after motor symptoms and neu-

rophysiological damage become evident, which leads

to the loss of the opportunity for early treatment.

To overcome these challenges, eye-tracking tech-

nology is gaining attention as an alternative for early

https://orcid.org/0000-0009-8944-3409

https://orcid.org/0000-0002-6685-5306

https://orcid.org/0000-0002-9736-6118

https://orcid.org/0000-0002-7315-6298

https://orcid.org/0000-0003-1109-4818

diagnosis of PD (S¸tef

anescu et al., 2024). This tech-

nology has the potential to detect abnormal eye move-

ment before motor symptoms and provides critical

clues for the early detection. (Turcano et al., 2019).

Moreover, as a non-invasive method without discom-

fort for patients, it shows signiﬁcant potential as a

biomarker for long-term and cost-effective monitor-

ing of the progression of PD.

As a result, previous studies have focused on clas-

sifying PD from HC through analysis of eye move-

ments using machine learning techniques. They ex-

tract features such as saccade amplitude, reaction

time, and error rates from eye-tracking data collected

during speciﬁc visual tasks. However, feature ex-

traction reduces time-series data to summary values,

which limits the ability to capture complex temporal

patterns that are critical to understanding abnormali-

ties related to PD (Yang et al., 2024).

This study proposes a novel multimodal interme-

diate framework for classifying PD and HC by auto-

matically detecting eye movement patterns and cre-

ating features such as speed, acceleration, and eye

Han, J.-Y., Cho, D.-Y., Yoo, D., Ahn, T.-B. and Kang, M.-K.

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework.

DOI: 10.5220/0013321700003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 261-272

ISBN: 978-989-758-731-3; ISSN: 2184-4305

261

movement states. This enables the framework to

identify critical features, such as decreased peak sac-

cadic velocities and multi-step saccades (White et al.,

1983), but are difﬁcult to detect with positional data

alone. Subsequently, it improves performance by us-

ing a multimodal intermediate fusion network that in-

tegrates Convolutional Neural Networks (CNN) and

Transformer Networks.

This paper is structured as follows. Section 2 re-

views the literature on PD-related eye movement ab-

normalities and diagnostic methods. Section 3 pro-

vides the materials and methods, while Section 4

describes data acquisition, feature engineering, and

the proposed deep learning architectures. Section 5

presents the results, and Section 6 discusses implica-

tions, limitations, and future research. Finally, Sec-

tion 7 concludes the paper.

2 BACKGROUND

2.1 Eye Movement Abnormalities in PD

Excessive inhibition of the basal ganglia (BG) and

the superior culculus (SC) caused by dopamine deﬁ-

ciency leads to abnormalities in eye movements asso-

ciated with PD (Pretegiani and Optican, 2017). These

abnormalities vary according to the speciﬁc type of

eye movement. Figure 1 illustrates the representative

abnormalities in eye movement in PD, including ﬁxa-

tion, saccades, and smooth pursuit.

• Fixation: Patients with PD often experience chal-

lenges in maintaining a steady gaze, which leads

to instability or drift. Square Wave Jerks (SWJ),

small involuntary eye movements with deviations

of 0.5 to 5 ° from the ﬁxation point, are particu-

larly common. These movements are character-

ized by a rapid return to the original target (Lal

and Truong, 2019). Figure 1(a) compares ﬁxa-

tion in PD with HC, highlighting the characteristic

SWJ in the red area for PD.

• Saccade: Saccadic eye movements of patients

with PD become slower and less accurate, making

it difﬁcult to quickly refocus the gaze. Reduced

peak velocity and prolonged latencies are com-

mon (Rascol et al., 1989). In addition, PD patients

also frequently exhibit multiple-step saccades,

characterized by small-amplitude sequences. Fig-

ure 1(b) compares saccades in PD and HC, high-

lighting prolonged latencies in red and multiple-

step saccades in blue.

• Smooth pursuit: Patients with PD experience vi-

sual attention deﬁcits caused by saccadic intru-

Figure 1: Comparison of eye movements between HC

(green line) and PD (red line): (a) ﬁxation, (b) saccade, and

highlighted in red and blue areas.

sions during target ﬁxation, which makes tracking

moving objects challenging (Frei, 2020). In ad-

dition, they experience hypometria, in which the

eyes fail to reach the target, leading to premature

stops and affecting the smoothness and accuracy

of eye movements. Figure 1(c) compares smooth

pursuit in PD with HC, highlighting hypometria

in red and saccadic intrusions in blue.

2.2 Related Works

Abnormal eye movements have potential as early

biomarkers for diagnosing PD even before the onset

of motor symptoms (Haslwanter and Clarke, 2010).

These abnormalities include characteristics such as

HEALTHINF 2025 - 18th International Conference on Health Informatics

262

reduced peak saccadic velocity and multiple-step sac-

cades (Ma et al., 2022). As a result, the potential of

abnormal eye movements as biomarkers for the diag-

nosis of PD has driven advancements, such as VOG

(Video-oculography), further expanding its applica-

tions in both clinical and research settings.

VOG is a widely used eye-tracking technology

in clinical settings to analyze abnormalities in eye

movements. Using high-speed cameras and data pro-

cessing software, it non-invasively examines char-

acteristic eye movement abnormalities and is ap-

plied to diagnose neurological diseases such as PD,

Alzheimer’s disease, and progressive supranuclear

palsy (Haslwanter and Clarke, 2010). By recording

and analyzing eye movement waveforms in real time,

clinicians and researchers receive precise information

on patient symptoms.

Recent studies have applied machine learning

techniques to classify PD and HC using eye-tracking

data obtained from systems like VOG. For exam-

ple, Brien et al. (Brien et al., 2023) trained a vot-

ing classiﬁer that combined SVM, logistic regres-

sion, and random forests using both point estimates,

such as mean amplitudes, and functional estimates,

such as blink probabilities. de Villers-Sidani et al.

(de Villers-Sidani et al., 2023) identiﬁed key eye

movement metrics, including accuracy, velocity, and

latency from tasks involving ﬁxation, pro-saccades,

and anti-saccades. Jiang et al. (Jiang et al., 2024) ex-

plored eye movements through a virtual reality-based

game, extracting latency and velocity characteristics

and using models such as k-NN, SVM and random

forests for classiﬁcation. Zhang et al. (Zhang et al.,

2021) analyzed oculomotor parameters such as SWJ

frequency, vertical saccade latency, and smooth pur-

suit gain using VOG data to classify between PD and

HC. Przybyszewski et al. (Przybyszewski et al., 2023)

applied various machine learning algorithms, includ-

ing granular computing, naive bayes, decision trees,

and random forests, to classify the progression of PD

using eye movement parameters. Koch et al. (Koch

et al., 2024) utilized a tablet-based eye-tracking tool

to extract oculomotor parameters and used machine

learning techniques to assess cognitive abilities and

classify stages of disease in PD. The characteristics

and performance of the studies mentioned above are

summarized and compared in Table 4.

2.3 Challenges and Insight

Although machine learning-based approaches show

signiﬁcant diagnostic potential, critical limitations re-

main. Current models primarily rely on feature ex-

traction, which risks losing crucial information and

fails to fully capture the complex eye movement pat-

terns. (Fawaz et al., 2019) Eye-tracking data are

inherently sequential, with complex interactions be-

tween time points being essential. However, existing

models often fail to capture these interactions effec-

tively, which leads to reduced diagnostic accuracy.

To overcome these challenges, this study proposes

a novel deep learning framework to classify PD and

HC using eye-tracking data. It fuse the features from

the CNN and Transformer to effectively capture both

global context and local details simultaneously. En-

coding time-series as images provides a comprehen-

sive view of eye movement patterns, which reduces

information loss. The proposed framework efﬁciently

leverages raw time-series data to detect subtle abnor-

malities, which enhances the potential for early diag-

nosis of PD and maximizes the effectiveness of early

treatment and intervention strategies.

3 MATERIALS

3.1 Participants

Table 1: Demographic details of participants. Continuous

variables: t-test; categorical variables: chi-square test.

Characteristic Parkinson’s Disease Control P-value

Age 66.2 ± 8.8 68.7 ± 8.1 0.051

Sex (Male : Female) 28 : 56 33 : 66 1.000

This study followed the tenets of the Declaration

of Helsinki and was approved by the Institutional

Review Board of Kyung Hee University Hospital

(KHUH 2021-04-032). In this retrospective study,

we enrolled drug-naive patients with PD who vis-

ited the outpatient clinic at the Department of Neu-

rology, Kyung Hee University Hospital, between Jan-

uary 1, 2012, and April 5, 2021. Patients were diag-

nosed with PD based on the UK Parkinson’s Disease

Society Brain Bank diagnostic criteria, conﬁrmed

through expert examination (DY and TA). To be in-

cluded in the study, patients should have undergone

a VOG examination and demonstrated signiﬁcant

terminal dopaminergic loss identiﬁed by dopamine

transporter imaging, or N-3-[18F]ﬂuoropropyl-2β-

carbomethoxy-3β-4-iodophenyl nortropane positron

emission tomography. Patients were excluded if

they had a history of severe neurological or psy-

chiatric conditions requiring regular treatment, such

as dementia, stroke, traumatic brain injury, brain

tumor, history of brain surgery, or major depres-

sive disorder. Additionally, patients with neuro-

ophthalmologic or neuro-otologic comorbidities that

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework

263

could inﬂuence VOG examination results, or those

with distinct structural lesions on brain magnetic res-

onance imaging or computed tomography that could

account for neurological symptoms were excluded.

The control group comprised patients who pre-

sented headache or dizziness and underwent VOG

tests. The controls were matched by sex and age to the

patient group and subject to the same exclusion crite-

ria. In total, the study included 183 participants: 84

with PD and 99 controls. Although there were no sig-

niﬁcant differences in the sex distribution between the

groups (p=1.000), the mean age of the PD group was

66.2 years (standard deviation (SD)=8.8), compared

to 68.7 years (SD=8.1) in the control group. The age

difference was not statistically signiﬁcant (p=0.051).

Table 1 shows the demographic details.

3.2 Data Acquisition

Eye-tracking data are collected through VOG

(SLVNG, SLMED, Seoul, South Korea) during three

visual tasks: ﬁxation, saccade, and smooth pur-

suit. Participants wore cameras-equipped goggles to

record their eye movements, with visual stimuli pre-

sented on a 50-inch monitor positioned 1000mm from

the participants’ eyes. Figure 2 illustrates the experi-

mental setup for the VOG test.

A 5-point calibration is performed prior to each

test to ensure measurement accuracy. The VOG sys-

tem records eye movements at 30 Hz.

• Fixation Task: Participants maintain a gaze on a

stationary target to assess their ﬁxation ability.

• Saccade Task: Participants shift their gaze be-

tween two ﬁxed points to measure saccadic speed

and accuracy. The task includes pro-saccades,

where they look at a sudden target, and anti-

saccades, where they look at the opposite direc-

tion of the target either horizontally or vertically.

• Smooth Pursuit Task: Participants track a mov-

ing target to evaluate smoothness.

This study analyzes the horizontal pro-saccade

task, known to be useful for early diagnosis. (Li et al.,

2023) Eye-tracking data are extracted from VOG test

videos using a CNN-based pupil detection network

(Eivazi et al., 2019), which captures the x and y co-

ordinates and the width of the pupil ellipse at 30

Hz. Figure 3 illustrates the pupil detection results:

(a) shows detection during normal eye movements,

while (b) shows detection during a blink, where the

network misidentiﬁes the pupil, returning abnormally

small pupil ellipse widths. During blinks, the pupil

ellipse width often falls below 1.5 times the average,

Figure 2: Video oculography.

Figure 3: Eye-tracking results using the pupil detection net-

work. (a) Detection results for normal eye movements. (b)

Detection results during a blink.

and these segments are therefore deﬁned as blinks and

adjusted using linear interpolation.

4 METHODS

4.1 Preprocessing

Since raw time-series data may contain some noise

caused by A) participants’ head movements, B) the

resolution and positioning of the camera used to

record eye-tracking data, and C) demographic char-

acteristics such as participants’ age, we ﬁrst reﬁne the

data by performing outlier removal, coordinate nor-

malization, and data cleaning.

Outliers are identiﬁed with a z-score threshold of

2.5 and adjusted using linear interpolation to preserve

the continuity of the time-series data.

Since participant eye positions vary across VOG

recordings, comparing the data is challenging. To ad-

dress this, normalizing the eye center coordinates is

crucial. The central reference point is calculated by

averaging the maximum and minimum x and y coor-

HEALTHINF 2025 - 18th International Conference on Health Informatics

264

Figure 4: Overview of data cleaning using DTW. (a) Comparison of low and high DTW distance sequences with the reference.

(b) Histogram of HC age distribution before cleaning. (c) Age distribution of removed HC (blue) and remaining HC (red)

after cleaning. (d) Q-Q plot of HC age distribution before and after cleaning.

dinates. Subsequently, all coordinates were adjusted

relative to this point, representing deviations from the

center instead of absolute positions, ensuring consis-

tency across videos. After normalization, the adjusted

coordinates are scaled to a range between 0 and 1.

In HC data, factors such as age and lack of atten-

tion can make PD classifying HC patterns challenging

(Hindle, 2010). To identify and exclude such anoma-

lous data, dynamic time warping (DTW) is used.

DTW measures the similarity between time-series

data by minimizing the effects of temporal shifts,

which enables the detection of patterns even when se-

quences are misaligned (M

uller, 2007), (Senin, 2008).

A low similarity indicates a high DTW distance,

while a high similarity corresponds to a low DTW

distance. DTW-based data cleaning uses the HC par-

ticipant with the best task performance as a reference,

excluding the top 25% of HC data with high DTW

distances as anomalies to ensure consistency. Fig-

ure 4 demonstrates the DTW cleaning procedure. In

Figure 4(a), sequences with low (142) and high (353)

DTW distances relative to the reference are shown.

The high-distance sequence resembles the PD data,

so these high-distance HC sequences are excluded.

DTW cleaning removes anomalous patterns without

affecting demographic characteristics such as age and

gender. Figure 4(b)–(d) conﬁrm this, with age dis-

tribution histograms and a Q-Q plot showing similar

distributions before and after cleaning.

4.2 Feature Engineering

Clinical data sets contain physiological states, and

new characteristics from mathematical models or

medical knowledge enhance their value. (Sirocchi

et al., 2024) This study utilizes feature engineering to

generate new variables that capture information such

as speed and eye movement states and extend beyond

basic 2D eye-tracking data.

Patients with PD exhibit saccade abnormalities,

such as delayed onset, prolonged duration, reduced

speed, and decreased accuracy. In particular, reduced

saccade velocity and multiple-step saccades are key

early indicators, but are difﬁcult to accurately capture

in VOG data. To address this, new features are gener-

ated that represent the velocity, acceleration, and eye

movement states, which enhance the network’s ability

to detect PD-related impairments by capturing clini-

cally signiﬁcant abnormalities.

4.2.1 Velocity and Acceleration Features

To design features that effectively reﬂect the reduced

saccadic velocity of the peak in PD, velocity and ac-

celeration are calculated using the x and y coordi-

nates. The velocity variable is determined by measur-

ing the difference in the x- and y-coordinate values

over the time intervals between timestamps. This can

be expressed with the following 1, 2:

∆x

∆t

− x

−t

(1)

∆y

∆t

− y

−t

(2)

where v represents the velocity, ∆x is the change in

the x coordinate, and ∆t t is the time interval.

After creating the velocity variable, the acceler-

ation variable is derived from the rate of change in

velocity, calculated by dividing the change in the ve-

locity of eye movements by the time intervals. This

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework

265

can be represented by the following 3:

a =

∆v

∆t

− v

−t

(3)

where a denotes acceleration and ∆v is the change in

velocity over a speciﬁc time interval.

4.2.2 Eye Movement State Features

Designing features to capture multiple-step saccades

in PD provides crucial clinical insights that are of-

ten difﬁcult to extract from raw data. Based on

I-VT (identifying velocity threshold) (Birawo and

Kasprowski, 2022), Remodnav classiﬁes eye move-

ments as ﬁxation, saccade, post-saccadic oscillation,

and smooth pursuit. Movements above a velocity

threshold are labeled as saccades, while those below

are ﬁxations (Dar et al., 2021). Eye movements that

exceed a speciﬁc velocity threshold are labelled as

saccades, while those below the threshold are clas-

siﬁed as ﬁxations.

In multiple-step saccades, several short ﬁxation

periods typically occur during the saccade. These dis-

ruptions lead to an increased frequency of saccades

and ﬁxations in PD (Blanke and Seeck, 2003). To

analyze multiple-step saccades more effectively, we

modiﬁed the existing algorithm to enable binary de-

tection of only ﬁxation and saccade. By utilizing Re-

modnav, we detect the irregular, multiple short ﬁxa-

tions that occur between saccades.

Figure 5(a) shows the classiﬁcation of eye move-

ments into ﬁxations and saccades during the pro-

saccade task in PD, while Figure 5(b) presents the re-

sults for healthy controls (HC). Figure 5(c) compares

the two groups with a histogram, highlighting higher

counts of saccades and ﬁxations in PD due to frequent

multiple-step saccades.

The proposed framework integrates medical char-

acteristics of eye movements, effectively capturing

clinically signiﬁcant abnormalities and improving

classiﬁcation performance in PD. This feature engi-

neering process resulted in eight-dimensional data:

timestamp, eye position, eye velocity, eye accerler-

ation, and eye movement states.

4.3 Time-Series Data into Images

To leverage the image classiﬁcation capabilities of

CNN, approaches have been developed to transform

time-series data into images, enabling both visualiza-

tion and multi-perspective analysis. Wang and Oates

(Wang and Oates, 2015) encoded time-series data as

Gramian Angular Fields (GAF) and Markov Transi-

tion Fields (MTF), while Hatami et al. (Hatami et al.,

2017) used Recurrence Plots (RP), achieving higher

Figure 5: Results of eye movement state detection. (a) PD

results, (b) HC results, (c) Comparison of PD and HC.

accuracy than traditional methods. Each encoding

technique captures unique features: GAF preserves

temporal correlations, MTF reﬂects dynamic transfor-

mations, and RP captures texture and long-term cor-

relations (Quan et al., 2023).

In this study, we apply these encoding methods

to all variables and combine the resulting images to

represent multidimensional time-series data as a con-

catenated image. Figure 6 presents the encoded PD

and HC images, with encoding techniques detailed in

subsequent sections.

4.3.1 Gramian Angular Field (GAF)

GAF encodes time-series data into polar coordinates

to represent static relationships between data points.

First, the time-series data is normalized to ﬁt within

the range [-1, 1] or [0, 1]. Each data is then encoded

as the cosine of an angle, with time represented as the

radius in the polar coordinate system. The inner prod-

uct between data points is calculated to generate a ma-

trix, which is visualized as an image. GAF preserves

the temporal dependencies of the time-series data and

highlights key features through angular information,

facilitating the analysis of static characteristics. Fig-

ure 6 (a) shows the encoded time-series images of the

PD and HC data using GAF.

HEALTHINF 2025 - 18th International Conference on Health Informatics

266

Figure 6: Encoded time-series images of PD and HC. (a)

GAF. (b) MTF. (c) RP.

4.3.2 Markov Transition Field (MTF)

MTF emphasizes dynamic changes in time series by

encoding temporal transition probabilities. The time-

series data is divided into multiple quantiles, and the

transition probabilities between these quantiles are

calculated. The MTF matrix arranges the transition

probabilities along the time axis, visually encoding

the transitions over time intervals. MTF is effective

in capturing dynamic properties such as temporal pat-

terns and transitions. Figure 6 (b) shows the encoded

time-series images of the PD and HC data using MTF.

4.3.3 Recurrence Plot (RP)

RP encodes repetitive patterns in time series by gen-

erating a matrix based on time-point similarity. Cal-

culates the Euclidean distance and assigns 1 if it is

below a threshold and 0 otherwise. RP emphasizes

periodicity and is useful for analyzing nonlinearity of

dynamic systems. It can also identify various states

such as trends, laminar states, and drifts in the data.

Figure 6 shows the encoded time-series images of PD

and HC data using RP.

4.3.4 Image Concatenation

When all 8 variables are transformed into encoded

time-series images using the three techniques—GAF,

MTF, and RP—they are concatenated along the chan-

nel dimension, resulting in a tensor-shaped (32, 32,

24). The ﬁrst two dimensions represent the spatial

size of the images, while the last dimension corre-

sponds to the number of channels formed by the con-

catenation of images transformed using GAF, MTF,

and RP techniques. Figure 8 illustrates the structure

of the concatenated time-series images.

4.4 Proposed Multimodal Intermediate

Fusion Network

Converting one-dimensional time-series pupil data

into two-dimensional image data effectively reveals

both numerical and temporal information from the

time-series. This conversion provides insights into

correlations, similarities, and quantity of informa-

tion at various time points, thereby improving the

accuracy of classiﬁcation using networks. However,

some detailed information from the original one-

dimensional data may be lost during the encoding

process (Quan et al., 2023).

Therefore, the proposed network consists of two

main components: CNN for processing encoded time-

series image data and Transformer for analyzing time-

series data. The features extracted from both com-

ponents are concatenated, allowing the network to

fuse information from both modalities and enhancing

its ability to capture complex patterns, thus improv-

ing classiﬁcation performance. The overall network

structure is illustrated in Figure 7.

4.4.1 CNN for Encoded Time-Series Images

The ﬁrst component of the proposed network is a

CNN for processing encoded time-series image data.

CNN captures the global information from the eye-

tracking data from encoded time-series images.

After encoding the time-series data into images

using image encoding techniques, the encoded time-

series images serve as input to CNN. The ﬁrst con-

volutional layer uses 64 ﬁlters with a 3x3 kernel to

extract features from the images, applying the ReLU

activation function and L2 regularization to prevent

overﬁtting. A MaxPooling layer is then applied to

reduce spatial dimensions by half, focusing on im-

portant features while reducing computational costs.

The second convolutional layer also uses 32 ﬁlters to

extract additional features, followed by another Max-

Pooling layer to further reduce the dimensions. Sub-

sequently, the Flatten layer converts the output into a

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework

267

Figure 7: Multimodal intermediate fusion network of CNN and Transformer for detecting abnormal eye movements in PD.

Figure 8: All 8 variables are encoded into GAF, MTF, and

RP images of size (32x32), resulting in a total of 24 im-

ages. These 24 images are then concatenated, so each data

instance is a tensor of shape (32, 32, 24).

one-dimensional vector, while the dense and dropout

layers introduce non-linearity and prevent overﬁtting.

4.4.2 Transformer for Time-Series Data

The second component of the proposed network is a

Transformer for processing original time-series data,

focusing on effectively learning the temporal patterns

within the sequence data.

It starts with an input layer deﬁning the input

shape based on the sequence length and features per

time point. Positional encoding preserves tempo-

ral information, and the Multi-Head Attention layer

computes attention scores to focus on relevant parts

of the sequence. Dropout prevents overﬁtting, fol-

lowed by layer normalization and residual connec-

tions to enhance stability. The feed-forward network

includes two dense layers introducing non-linearity,

with dropout after the ﬁrst layer. Finally, a Global

Average Pooling layer summarizes sequence informa-

tion, and a dense layer captures complex features.

4.4.3 Concatenation of Features

The features extracted from CNN and Transformer

are combined through a concatenate layer. This inter-

mediate fusion structure combines information from

two distinct input modalities, helping to enhance the

network’s performance. The combined features are

then passed through the ﬁnal classiﬁcation layer to

generate the prediction outputs. Thus, the proposed

CNN-Transformer intermediate fusion structure ef-

fectively combines image and time-series data, har-

nessing the strengths of both modalities.

5 RESULTS

To verify the effectiveness of multimodal data, we

evaluated the proposed network with three data

modalities: time-series data, encoded time-series im-

age data, and their fusion. For encoded time-series

images and fused data, seven combinations of en-

coded image data were compared, as shown in Table

3, using accuracy, precision, recall, and F1-score.

The network using only time-series data showed

limited results, with 26% precision and 51% recall,

suggesting that temporal information alone is insufﬁ-

cient to capture complex eye movement patterns.

In the case of the network using encoded time-

HEALTHINF 2025 - 18th International Conference on Health Informatics

268

Table 2: Summary of classiﬁcation results by data modality and image encoding techniques, showing performance met-

rics—Accuracy (Acc), Precision (Prec), Recall (Rec), and F1-score (F1). The table illustrates the impact of various combina-

tions on classiﬁcation effectiveness, providing insights into the most effective approaches.

Data Modality Encoding Techniques Acc (%) Prec (%) Rec (%) F1 (%)

Time-series (Transformer) - 57 26 51 35

Encoded time-series image (CNN)

GAF 74 80 74 73

MTF 51 26 51 35

RP 72 72 72 72

GAF + MTF 74 75 74 74

MTF + RP 64 65 64 64

GAF + RP 72 72 72 72

GAF + MTF + RP 81 81 81 81

Fusion of time-series and

encoded time series image

(Multimodal Intermediate Fusion

Network)

GAF 77 (+3) 77 (-3) 77 (+3) 77 (+4)

MTF 57 (+6) 69 (+43) 57 (+6) 51 (+16)

RP 81 (+9) 78 (+6) 77 (+5) 76 (+4)

GAF + MTF 81 (+7) 81 (+6) 81 (+7) 81 (+7)

MTF + RP 77 (+13) 78 (+13) 77 (+13) 76 (+12)

GAF + RP 77 (+5) 77 (+5) 77 (+5) 77 (+5)

GAF + MTF + RP 87 (+6) 88 (+7) 87 (+6) 87 (+6)

Table 3: The classiﬁcation report of the best-performing

network (Fusion of time-series and encoded time-series im-

ages: GAF, MTF, RP).

Class Prec (%) Rec (%) F1 (%)

0.0 (PD) 82 96 88

1.0 (HC) 95 78 86

Accuracy - - 87

Macro avg 88 87 87

Weighted avg 88 87 87

series images only, MTF showed the lowest perfor-

mance (26% precision, 51% recall) among conditions

using a single type of encoded time-series image.

In contrast, GAF achieved the highest performance

(80% precision, 74% recall), indicating better visual

pattern recognition. In the case of RP showed mod-

erate performance (72% precision, 72% recall). The

results using two types of encoded image data did not

show signiﬁcant changes in performance. However,

when all three types of encoded image data were used,

the result showed the best and well-balanced perfor-

mances, and this surpasses all other combinations.

In the case of the intermediate fusion network us-

ing both time-series and encoded time-series images,

performance improved signiﬁcantly across all seven

encoded data combinations compared to the network

using only encoded image data. In particular, when

the three types of encoded image data were used, the

network achieved an accuracy, precision, recall, and

F1-score of 87%, 88%, 87%, and 87%, respectively,

showing an improvement of approximately 6%. This

represents the best performance among all conditions

presented in Table 2. The detailed classiﬁcation report

in Table 3 shows excellent performance in classifying

Figure 9: ROC Curve and confusion matrix of the best-

performing network (Fusion of time-series and encoded

time-series images: GAF, MTF, RP).

PD and HC, with a PD recall of 96%.

6 DISCUSSION

6.1 Performance Analysis

Based on the research ﬁndings, the proposed multi-

modal intermediate fusion network to diagnose PD

using eye movements offers the following insights.

• As shown in Table 2, the proposed frame-

work achieved a particularly high recall of 96%,

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework

269

Table 4: Comparison between Previous studies and Proposed framework. The classiﬁcation report for previous studies is

calculated using the confusion matrix provided in those studies.

(Brien et al., 2023) (de Villers-Sidani

et al., 2023)

(Jiang et al., 2024) Proposed framework

Participants 140 (PD:121, HC:106) 121 (PD:59, HC:62) 66 (PD:44, HC:22) 183 (PD:84, HC:99)

Eye Movements Pro-saccade,

Anti-saccade

Fixation, Pro-saccade,

Anti-saccade

Fixation,

Saccade, Synthetic

Pro-saccade

Experimental Setup 17-inch monitor,

600mm eye-screen

distance, 9-point cali-

bration grid

12.9-inch iPad Pro,

45cm eye-screen dis-

tance, calibration with

moving target

VR headset, seated in

detection range of a

3D locator, calibration

with scene image

50-inch monitor,

1000mm eye-screen

distance, 5-point cali-

bration grid

Data Analysis Method-

ologies

A voting classiﬁer

combining support

vector machine, lo-

gistic regression, and

random forest.

Logistic regression

with ridge regular-

ization and random

undersampling.

K-Nearest Neighbors,

Support Vector Ma-

chine and Random

Forest

Multimodal

Intermediate

Fusion Network

Classiﬁcation Report

Accuracy 81% 90% 83% 87%

PD Precision: 83%

Recall: 79%

F1-Score: 81%

Precision: 93%

Recall: 87%

F1-Score: 90%

Precision: 86%

Recall: 91%

F1-Score: 89%

Precision: 82%

Recall: 96%

F1-Score: 88%

HC Precision: 78%

Recall: 82%

F1-Score: 80%

Precision: 86%

Recall: 92%

F1-Score: 89%

Precision: N/A

Recall: N/A

F1-Score: 82%

Precision: 95%

Recall: 78%

F1-Score: 86%

demonstrating its ability to accurately identify pa-

tients with PD and minimize missed diagnoses.

This suggests that the proposed deep learning-

based approach for the diagnosis of PD can be

highly effective for the early detection of PD.

• In Table 3, the performance of the single network

using only time-series data was insufﬁcient. How-

ever, the multimodal intermediate fusion network

showed an average performance improvement of

over 6%. These results indicate that the fused net-

work effectively addresses the limitations of tra-

ditional machine learning techniques, which often

fail to capture complex interactions between time

points in eye-tracking data.

• In particular, under the condition of using a single

MTF, the fused network outperformed the single

network by using only the encoded image data by

43%. Furthermore, the fused network exhibited

balanced performance in all metrics, including ac-

curacy, precision, and recall, reﬂecting its ability

to generalize effectively.

These insights show that the proposed framework

not only improves diagnostic accuracy, but also pro-

vides a robust and generalized solution to analyze ab-

normalities in eye movement. In addition, the ROC

curve and the confusion matrix for the best perform-

ing model are presented in Figure 9. These demon-

strate that the proposed framework achieves excellent

classiﬁcation performance while maintaining a strong

balance between sensitivity and speciﬁcity, even with

a high recall for PD.

6.2 Limitations

This study demonstrated the effectiveness of using

abnormal eye movement information within a fused

deep learning framework for the early diagnosis of

PD. However, to enable the proposed method to be

used more effectively as a biomarker in practical ap-

plications, additional improvements are required ad-

dressing the following limitations.

• Ensuring consistent performance across diverse

environments and conditions requires more com-

prehensive data. The data used in this study were

collected under speciﬁc experimental conditions,

which may limit the generalizability to clinical

settings. In particular, the DTW-based data clean-

ing process can exclude HC with age-related ab-

normalities in eye movement. These could lead to

overﬁtting or reduced performance, highlighting

the need to integrate data from a broader range of

conditions and environments.

• Although the proposed framework demonstrated

HEALTHINF 2025 - 18th International Conference on Health Informatics

270

better recall to identify PD compared to previous

studies shown in Table 4, further enhancements

in overall performance and interpretability are es-

sential for clinical application. The ”black-box”

nature of deep learning complicates the under-

standing of its decision-making processes, mak-

ing further research necessary to improve the in-

terpretability of the model.

• Also, Table 4 highlights the types of eye move-

ment examined in previous studies versus the pro-

posed framework. While earlier research utilized

ﬁxation, pro-saccade, and anti-saccade move-

ments, this study restricted its focus to horizontal

pro-saccades. Future clinical applications will re-

quire models trained on a broader spectrum of eye

movement types.

6.3 Future Work

To address the limitations mentioned above, this study

proposes a future work utilizing smart glasses, as il-

lustrated in Figure 10. The proposed smart glasses

are equipped with an eye camera to collect eye move-

ment data and a scene camera to gather information

about the surrounding environment. This enables the

following complementary research directions:

• Eye movement data collected using VOG devices

are limited to predeﬁned eye movement tasks per-

formed in a controlled laboratory environment,

which may result in diagnostic errors due to in-

dividual differences such as tension and concen-

tration. However, by utilizing smart glasses, it be-

comes possible to track abnormal eye movements

without requiring outpatient visits. Furthermore,

data collection can occur in a relaxed environment

where users are naturally situated, allowing the

acquisition of purer eye movement information.

• While eye movement abnormalities are a hallmark

prodromal symptom of PD that reﬂects cognitive

mechanism impairments, other indicators of cog-

nitive dysfunction may also be observed. For ex-

ample, motor symptoms such as freezing of gait

and non-motor symptoms such as visual halluci-

nations are key diagnostic clues for PD. By lever-

aging smart glasses, it is possible not only to track

eye movement data but also to use the built-in

front-facing camera and IMU sensors to collect

data on these motor and non-motor symptoms.

By applying the framework of this study to

smart glasses, it becomes possible to develop digital

biomarkers capable of seamless data collection in ev-

eryday environments. This approach takes advantage

of the growing market for smart glasses to enhance

Figure 10: Conﬁguration of the smart glasses-based visual

perception analysis system for early PD diagnosis.

the practical applicability of early PD diagnostic tech-

nologies. Additionally, multimodal data analysis en-

compassing eye movements, motor and non-motor

symptoms can improve the speciﬁcity of PD-focused

biomarkers. This will enable the effective distinction

between PD and other neurodegenerative disorders,

contributing to advancements in related technologies.

7 CONCLUSION

This study presents a novel approach to classify PD

and HC using eye-tracking data obtained via VOG.

The proposed multimodal framework is an intermedi-

ate fusion method that combines a CNN for process-

ing encoded time-series images and a Transformer for

analyzing raw time-series data. The experimental re-

sults demonstrate the ability of the framework to de-

tect subtle abnormalities in eye movements, achiev-

ing a recall rate of 96% for PD. These ﬁndings sug-

gest that eye-tracking data could serve as a biomarker

for early-stage PD diagnosis. Looking ahead, as eye-

tracking technology becomes more common in aug-

mented reality (AR) or virtual reality (VR) devices

like smart glasses, this framework could support early

self-diagnosis, disease progression monitoring, and

remote detection of PD, enabling practical applica-

tions in real-world healthcare scenarios.

ACKNOWLEDGEMENTS

This work was ﬁnancially supported by the Ko-

rea Institute of Science and Technology Institutional

Program (Project No. 2E33841) and the National

Research Foundation of Korea (NRF) grant funded

by the Korea government (MSIT) (No. RS-2023-

00279304).

Early Diagnosis of Parkinson’s Disease via Pro-Saccadic Eye Movement Analysis: Multimodal Intermediate Fusion Framework

271

REFERENCES

Birawo, B. and Kasprowski, P. (2022). Review and eval-

uation of eye movement event detection algorithms.

Sensors.

Blanke, O. and Seeck, M. (2003). Direction of saccadic and

smooth eye movements induced by electrical stimu-

lation of the human frontal eye ﬁeld: effect of orbital

position. Experimental Brain Research, 150:174–183.

Brien, D. C. et al. (2023). Classiﬁcation and staging of

parkinson’s disease using video-based eye tracking.

Parkinsonism & Related Disorders, 110:105316.

Dar, A. H., Wagner, A. S., and Hanke, M. (2021). Remod-

nav: robust eye-movement classiﬁcation for dynamic

stimulation. Behavior Research Methods.

de Villers-Sidani, E. et al. (2023). A novel tablet-based

software for the acquisition and analysis of gaze and

eye movement parameters: a preliminary validation

study in parkinson’s disease. Frontiers in Neurology,

14:1204733.

Eivazi, S., Santini, T., Keshavarzi, A., K

ubler, T., and

Mazzei, A. (2019). Improving real-time cnn-based

pupil detection through domain-speciﬁc data augmen-

tation. In Proceedings of the 2019 Symposium on

Eye Tracking Research and Applications (ETRA ’19),

page 6, Denver, CO, USA. ACM.

Fawaz, H. I., Forestier, G., Weber, J., et al. (2019). Deep

learning for time series classiﬁcation: a review. Data

Mining and Knowledge Discovery.

Frei, K. (2020). Abnormalities of smooth pursuit in parkin-

son’s disease: A systematic review. Clinical Parkin-

sonism & Related Disorders, 4:100085.

Haslwanter, T. and Clarke, A. H. (2010). Chapter 5—eye

movement measurement: Electro-oculography and

video-oculography. In Elsevier, editor, Vertigo and

Imbalance: Clinical Neurophysiology of the Vestibu-

lar System, volume 9, pages 61–79. Amsterdam, The

Netherlands.

Hatami, N., Gavet, Y., and Debayle, J. (2017). Classiﬁ-

cation of time-series images using deep convolutional

neural networks. In Tenth International Conference

on Machine Vision (ICMV 2017), volume 10696, page

106960Y. International Society for Optics and Photon-

ics.

Hindle, J. V. (2010). Ageing, neurodegeneration and

parkinson’s disease. Age and Ageing, 39(2):156–161.

Jiang, M., Liu, Y., Cao, Y., Liu, Y., Wang, J., Li, P., Xia, S.,

et al. (2024). Auxiliary diagnostic method of parkin-

son’s disease based on eye movement analysis in a vir-

tual reality environment. Neuroscience Letters.

Koch, N. A., Voss, P., Cisneros-Franco, J. M., et al. (2024).

Eye movement function captured via an electronic

tablet informs on cognition and disease severity in

parkinson’s disease. Scientiﬁc Reports, 14:9082.

Lal, V. and Truong, D. (2019). Eye movement abnormalities

in movement disorders, volume 1, pages 54–63.

Li, H., Zhang, X., Yang, Y., and Xie, A. (2023). Abnormal

eye movements in parkinson’s disease: From experi-

mental study to clinical application. In Parkinsonism

& Related Disorders. Elsevier.

Ma, W., Li, M., Wu, J., Zhang, Z., Jia, F., Zhang, M.,

Bergman, H., Li, X., Ling, Z., and Xu, X. (2022).

Multiple step saccades in simply reactive saccades

could serve as a complementary biomarker for the

early diagnosis of parkinson’s disease. Frontiers in

Aging Neuroscience, 14:912967.

uller, M. (2007). Dynamic time warping. Springer.

Pretegiani, E. and Optican, L. M. (2017). Eye movements

in parkinson’s disease and inherited parkinsonian syn-

dromes. Frontiers in Neurology, 8:592.

Przybyszewski, A. W.,

Sledzianowski, A., Chudzik, A.,

Szluﬁk, S., and Koziorowski, D. (2023). Ma-

chine learning and eye movements give insights into

neurodegenerative disease mechanisms. Sensors,

23:2145.

Quan, S., Sun, M., Zeng, X., Wang, X., and Zhu, Z. (2023).

Time series classiﬁcation based on multi-dimensional

feature fusion. IEEE Access.

Rascol, O. et al. (1989). Abnormal ocular movements

in parkinson’s disease: Evidence for involvement of

dopaminergic systems. Brain, 112:1193–121.

Senin, P. (2008). Dynamic time warping algorithm review.

In Department of Information and Computer Science.

ResearchGate.

Sirocchi, C., Bogliolo, A., and Montagna, S. (2024).

Medical-informed machine learning: integrating prior

knowledge into medical decision systems. BMC Med-

ical Informatics and Decision Making.

Tinelli, M., Kanavos, P., and Grimaccia, F. (2016). The

value of early diagnosis and treatment in Parkinson’s

disease: A literature review of the potential clinical

and socioeconomic impact of targeting unmet needs

in Parkinson’s disease. London.

Tolosa, E., Garrido, A., Scholz, S. W., and Poewe, W.

(2021). Challenges in the diagnosis of parkinson’s dis-

ease. Lancet Neurology, 20:385–397.

Turcano, P., Chen, J. J., Bureau, B. L., and Savica, R.

(2019). Early ophthalmologic features of parkinson’s

disease: a review of preceding clinical and diagnostic

markers. Journal of Neurology, 266:2103–2111.

Wang, Z. and Oates, T. (2015). Encoding time series as

images for visual inspection and classiﬁcation using

tiled convolutional neural networks. In Workshops at

the Twenty-Ninth AAAI Conference on Artiﬁcial Intel-

ligence. cdn.aaai.org.

White, O. B., Saint-Cyr, J. A., Tomlinson, R. D., and

Sharpe, J. A. (1983). Ocular motor deﬁcits in parkin-

son’s disease. Brain, 106:571–587.

Yang, Z. Y., Zhang, Y., and Yu, L. N. (2024). Predicting

bank users’ time deposits based on lstm-stacked mod-

eling. Acadlore Transactions on Machine Learning.

Zhang, J., Zhang, B., Ren, Q., et al. (2021). Eye movement

especially vertical oculomotor impairment as an aid

to assess parkinson’s disease. Neurological Sciences,

42:2337–2345.

S¸ tef

anescu, E., Chelaru, V. F., Chira, D., and Mures¸anu,

D. (2024). Eye tracking assessment of parkinson’s

disease: a clinical retrospective analysis. Journal of

Medicine and Life.

HEALTHINF 2025 - 18th International Conference on Health Informatics

272