Classification of EEG Motor Imagery Tasks Utilizing 2D Temporal

Patterns with Deep Learning

Anup Ghimire and Kazim Sekeroglu

Department of Computer Science, Southeastern Louisiana University, Hammond, LA, U.S.A.

Keywords: Deep Learning, Brain Computer Interface, Spatiotemporal Deep Learning, Hierarchical Deep Learning, EEG

Activity Recognition, Motor Imagery Task Recognition, Machine Learning, Convolutional Neural Network.

Abstract: This study aims to explore the decoding of human brain activities using EEG signals for Brain Computer

Interfaces by utilizing a multi-view spatiotemporal hierarchical deep learning method. In this study, we ex-

plored the transformation of 1D temporal EEG signals into 2D spatiotemporal EEG image sequences as well

as we explored the use of 2D spatiotemporal EEG image sequences in the proposed multi-view hierarchical

deep learning scheme for recognition. For this work, the PhysioNet EEG Motor Movement/Imagery Dataset

is used. Proposed model utilizes Conv2D layers in a hierarchical structure, where a decision is made at each

level individually by using the decisions from the previous level. This method is used to learn the spatiotem-

poral patterns in the data. Proposed model achieved a competitive performance compared to the current state

of the art EEG Motor Imagery classification models in the binary classification paradigm. For the binary

Imagined Left Fist versus Imagined Right Fist classification, we were able to achieve 82.79% average vali-

dation accuracy. This level of validation accuracy on multiple test dataset proves the robustness of the pro-

posed model. At the same time, the models clearly show an improvement due to the use of the multi-layer

and multi-perspective approach.

1 INTRODUCTION

This work aims to create a deep learning method to

recognize spatial and temporal patterns in EEG sig-

nals generated by the brain. The trained model could

be utilized to make predictions about the motor move-

ments based on the signals received from the EEG

machine. EEG data has been used to analyze brain ac-

tivity to identify neurological disorders and to recog-

nize patterns in brain activities related to various mo-

tor movements or even imagination of such motor ac-

tivities. These signals from the brain are measured at

specific locations on the skull and the usual approach

is to apply signal processing techniques to that data

for classification. Instead of using the usual 1D signal

data in the conventional way, this work attempts to

combine the readings of these sensors to form an “im-

age” of the brain. This opens possibilities of using

computer vision techniques in order to recognize spe-

cific patterns in the brain activities.

Deep learning is a state-of-the-art (SoA) method

in terms of image classification (Voulodimos, Dou-

lamis, Doulamis, & Protopapadakis, 2018). Trans-

forming single dimensional EEG signal data into a 2D

signal data allows the use of various image classifica-

tion techniques like convolutions in order to form

generalized predictions. Furthermore, transforming

the data in this way still preserves the temporal infor-

mation. It has been shown that analyzing both spatial

and temporal information in signals can improve the

accuracy of classification models for time series data

(Saha and Fels, 2019). This work attempts to use a

spatiotemporal deep learning method in order to rec-

ognize brain activities using EEG signals.

A Brain Computer Interface (BCI) is a system that

communicates the patterns of activities of a user’s

brain to an interactive system. In other words, this

could be a system where the only input is the signals

coming from the user’s brain. As an example, this

could be a user controlling the mouse pointer using

only their brain, i.e., imagining the pointer going in a

specific direction in order to make it do so. This

makes BCI an important tool for motor-impaired us-

ers to be able to use assistive systems such as text in-

put, smart prosthetic devices, wheelchairs, etc. Motor

Imagery (MI) is the process of mentally simulating a

given action. For example, moving an arm is a phys-

ical task, whereas imagining or thinking of moving an

182

Ghimire, A. and Sekeroglu, K.

Classiﬁcation of EEG Motor Imagery Tasks Utilizing 2D Temporal Patterns with Deep Learning.

DOI: 10.5220/0011069400003209

In Proceedings of the 2nd International Conference on Image Processing and Vision Engineering (IMPROVE 2022), pages 182-188

ISBN: 978-989-758-563-0; ISSN: 2795-4943

arm is the corresponding MI task. The models trained

in this work can be used to recognize EEG signals for

BCI as well as for the diagnosis of neurological dis-

orders by learning patterns in the EEG MI task data.

There has been some significant work in the field

of EEG MI task classification using deep learning re-

cently. Some of the methods used for these classifica-

tion tasks have a consistent pattern in the use of pre-

processing techniques as well as the methodology for

the classification process. Whenever the dataset in-

cludes a significant number of subjects, it appears

there is minimal need for preprocessing. There is also

a consistent use of pattern recognition methods that

use both spatial and temporal pattern learning tech-

niques in a fusion architecture.

Roots et al. worked with the BCI Competition IV

dataset with 103 subjects (Roots, Muhammad, & Mu-

hammad, 2020). They used bandpass and notch filters

on their time series data and used a fusion architecture

to classify MI Right Fist versus MI Left Fist. Their

model, which uses fusion of spatial and temporal fea-

tures achieved 83% validation accuracy for the binary

model. Wang et al. used the PhysioNet dataset for

their 2-class, 3-class and 4-class classification models

(Wang, et al., 2020). This work used no preprocessing

on the full 109 subject dataset. Their model was based

on the EEGNet structure. It used Conv2D layers to

learn spatiotemporal information with fusion struc-

ture. Their models achieved 75.07% and 82.50% val-

idation accuracy on the 3-class and 2-class models re-

spectively on MI Right Fist, MI Left Fist and MI Feet

labels. Dose et al. also used the full PhysioNet dataset

with 109 subjects (Dose, Møller, & Iverson, 2018).

They used no preprocessing method either. Their

model was trained on the global dataset and then fine-

tuned for each subject separately. Their 3 class clas-

sification had 68.82% validation accuracy while their

binary classification had 80.38% accuracy on their

global classifier.

In this study, we used the EEG Motor Move-

ment/Imagery Dataset that is a collection of 14 exper-

imental runs (Schalk, McFarland, Hinterberger,

Birbaumer, & Wilpaw, 2004). Each run was a motor

imagery recording performed by 109 subjects. This

dataset provides more than 1,500 such EEG record-

ings and is considered the largest EEG motor move-

ments and imagery dataset available (Goldberger, et

al., 2000).The subjects’ brain activity was recorded

while performing each of the four tasks:

1. Open and close the right or left fist

2. Imagine opening and closing the right or left fist

3. Open and close both fists or both feet

4. Imagine opening and closing both fists or both

feet

This paper is organized as follows: the previous

section introduced the problem, described the dataset

and explained some related work performed in the area

of EEG task classification; followed by the next chap-

ter that goes over the tranformation of raw EEG signals

into 3 dimensional image sequences representing each

MI task. The next chapter also describes the structure

of the multi-view hierarchical fusion model. The third

chapter goes over the results and discussions. Finally

the last chapter draws a conclusions and describes

some possible future direction for this work.

2 METHODS

2.1 Creating 2D Spatiotemporal EEG

Image Sequences

The raw EEG signals consist of multiple 1D time se-

ries data that show the electrical activity at specific

locations on the skull. The placement of the elec-

trodes is based on the international 10-10 system as

shown in Figure 1.

This collection of 1D series data is then trans-

formed into a time series of 2D data. The signal ac-

quired over a period [t, t+N] from each channel of the

EEG system can be represented by

=ൣc

, c

t+1

, c

t+2

, c

t+3

, …, c

t+N

൧ (1)

where i is the index of the channel and is the EEG

data acquired from the ith channel at time t. EEG data

collected from n number of channels over a period [t,

t+N] can be represented by matrix S as provided in

Figure 2. Each row of the matrix S is corresponding

to EEG data collected from a single channel over the

period [t, t+N], and each column of the matrix S is

corresponding to EEG data collected through all

channels at time t.

These new spatiotemporal images were created by

transforming each column matrix S into a 2D image,

as shown in Figure 2. This was done by mapping c

into a 9x9 matrix based on the actual location of

the electrodes on the head where the data was ac-

quired, as shown in Figure 1. This is the standard 10-

10 system of placement of electrodes for recording

EEG data. For example, the data acquired from the

first channel at time t is placed in the 3rd row and the

2nd column of the matrix S, which is the same loca-

tion where the first electrode is placed on the skull. In

the same figure, the pixel values marked as x are

empty values as there are no electrodes corresponding

to them. These are placeholders. This transformation

process is illustrated in Figure 2.

Classiﬁcation of EEG Motor Imagery Tasks Utilizing 2D Temporal Patterns with Deep Learning

183

Figure 1: 10-10 system of electrodes.

As seen in Figure 2, a 2D spatiotemporal EEG im-

age sequence is created by transforming each column

of the matrix S into a 2D image. Each of the frames It

to It+N in the given sequences is temporally related.

Figure 2: Transformation of 1D EEG signals to 2D image

sequences.

2.2 Data Preparation

At this point, the data points ranged from -0.000655 to

0.000667. The data needed to be transformed a more

meaningful range. Thus, normalization was required.

Z-score normalization was applied to the data. A z-

score, also known as a standard score, is a measure of

how far from the mean any data point is (Hayes, 2021).

A z-score normalization is a data transformation

method where each data point is replaced by the z-

score. This normalization was performed such that the

placeholder values, shown in Figure 2 as x values, did

not skew the relevant data. Each EEG reading was re-

placed by its z-score, which is given by the formula:

z = (x - ) / 

(2)

In the beginning, the raw EEG data was a collec-

tion of 1D sequences of data recorded by electrodes

at various locations on the human head. Then those

1D sequences were transposed into 2D arrays such

that they became images of the human head looking

from the top. The 1D arrays were transformed in a

logical manner that would clearly represent the posi-

tion of the electrodes on the head where the readings

were taken. Transforming the 1D data into 2D data

was a way to preserve the features while creating new

features that would represent the spatial correlations

between nearby electrodes as well as preserve the

temporal information from the original sequences.

The image sequences now represent the brain ac-

tivity of the subjects while performing the specific

motor actions with relation to time. Those sequences

of images can be thought of as a video of the activity

of the brain while those motor movements were per-

formed or imagined. Each action was carried out by

the subjects for a short period of time, which means

each action that was performed would be represented

by a series of those images. These images would be

back-to-back, creating one cohesive chunk of se-

quences that would be corresponding to only one ac-

tion that was performed. Because of this, it would

make more sense to unify all the images that were part

of just one action. This would mean separating the

images into one more dimension that would represent

the activity from start to finish. Analyzing the dataset

and the rate of recording of the data, it seemed every

single action was represented by roughly 650 data

points. This meant a collection of 650 back-to-back

9x9 images represented one single action. This data

transformation is further illustrated in Figure 3.

Figure 3: Creating 3D temporal data from 2D sequences.

In Figure 3 (a) the image sequences with the same

label are back-to-back with each other. Those image

IMPROVE 2022 - 2nd International Conference on Image Processing and Vision Engineering

184

sequences are combined in a stack of 650 images. In-

stead of each of the 650 images individually having a

label of 0, the whole block now has the label 0 as

shown in Figure 3 (b). Each of those blocks now has

shape (650,9,9).

2.3 Multiview Spatiotemporal

Hierarchical Deep Fusion Learning

Model for BCI

Although the transformed EEG data consists of 2D

spatiotemporal EEG images, EEG data collected over

a period [t, t+N] can be considered as 3D data in

which two of the dimensions are on the spatial do-

main and the third dimension is on the time domain,

as illustrated in Figure 2. In order to learn the spatio-

temporal patterns in the image sequences of the EEG

dataset, we required Deep Learning models capable

of modeling 3-dimensional data. As seen in the intro-

duction section of this work, there are two common

approaches for deep learning models for classifica-

tion of EEG data. The first approach is models that

learn spatial and temporal patterns in the data. The

second approach is models that utilize the fusion ar-

chitecture where different parts of the system learn

different patterns. In this approach, all the learned

patterns are then added to make a more cohesive sys-

tem. This work seeks to make use of both approaches

for modeling our EEG data.

We propose a custom hierarchical model that con-

sists of several 2D Convolutional models working to-

gether to model data from different perspectives. The

idea for the custom hierarchical model was to be able

to learn spatial patterns in the data from 3 different

perspectives, which made the hierarchical system a

spatiotemporal model (Sekeroglu, Soysal and Li,

2019).

The proposed custom hierarchical model aims to

examine the data from 3 different perspectives as

shown in Figure 4. Until this point, the input data was

4-dimensional, which separates each action into 650

images. These 650 images are treated as one singular

data point. The new hierarchical model would create

2 new data points for the same action. These new data

points would be the same as above but with the axes

swapped. As shown in Figure 6, the view from S

plane provides the information regarding the col-

lected data from all channels at time t. However, the

views from TS

and TS

planes provide information

regarding the collected data from certain channels

over a period. Thus, the new proposed models will

learn patterns in the data from three different views:

the first view based on S

plane, the second view

based on TS

plane and the third view based on TS

plane. Since the number of frames in the first view,

which is based on S

plane, will be much greater

than the number of frames in the second and the third

view, we need to split the data collected over a period

of time [t, t+N] into smaller time slots by a sliding

window approach where the window size is 650.

Figure 4: Creating multiple perspectives from EEG data.

The S

data pipeline feeds in the “main view”

models where the spatiotemporal relationship in the

EEG data is learned by looking at each “frame” of the

data from the top. The other two “side view” models

where the EEG data is viewed from the side as well

as from below, represent complex temporal infor-

mation and create some distinct patterns in the data.

These two models represent the TS

and TS

planes

in the above figure. These three perspectives and

these 3 models should learn the features in the EEG

data in a cohesive manner that would make the mod-

els perform well.

As Figure 5 shows the full structure of the pro-

posed hierarchical model. It consists of multiple Con-

volutional Neural Networks arranged in terms of lev-

els. Each of these levels learn different patterns in the

data. The first one is called the MF (Module Frame)

level. The models in this level (MF1, MF2, and MF3)

learn patterns directly from the image sequences or

image frames. These are the convolutional deep

learning models. The next level is called MP (Module

Plane) level. This level does not directly learn pat-

terns in the EEG data but learns patterns in the pre-

dictions made by the MF models. Then the MT (Mod-

ule Temporal) level is a concatenation of the 2 MP

models that specifically learn the temporal patterns,

i.e., MP2 and MP3. Then the last level is MST (Mod-

ule Spatio-Temporal). This is a concatenation of the

predictions from the MT model and the spatial pre-

diction from the MP1 model.

After each level, the outputs of the models are

concatenated and those predictions are used as input

for the models of the next level. The detailed structure

and hyper parameters for all three levels are provided

in Table 1.

Classiﬁcation of EEG Motor Imagery Tasks Utilizing 2D Temporal Patterns with Deep Learning

185

Table 1: Model layers and hyperparameters.

MF Models MP Models MT/MST

ut In

Conv2D

(8 filters)

Dense

(128 neurons)

Dense

(64 neurons)

MaxPoolin

2D Dro

out

(

0.4

)

Dro

out

(

0.4

)

Conv2D

(12 filters)

Dense

(64 neurons)

Dense

MaxPooling2D Dropout (0.4) Output

Flatten Flatten

Dense Dense

Out

ut Out

3 RESULTS AND DISCUSSION

As shown in Table 2, the data consisted of 9 labels.

Label 0 was the resting (control) action. Each subject

was asked to rest by the on-screen prompt before each

task was performed. This means that the label 0 is rep-

resented in the dataset more than any other label.

Table 2: Labels and corresponding tasks.

Label Tas

0 Rest

1 Open/Close Right Fist

2 O

en/Close Left Fist

3 Ima

ine O

en/Close Ri

ht Fist

4 Ima

ine O

en/Close Left Fist

5 Open/Close Both Feet

6 Open/Close Both Fists

7 Imagine Open/Close Both Fists

8 Ima

ine O

en/Close Both Feet

First, a baseline model with the full dataset and all

labels was trained. Training in this way, almost 50%

of the labels were 0. Because of the large class imbal-

ance as well as the high intra-class similarity between

the labels for physical and imaginary tasks, the opti-

mization of the loss was impossible. Hence, the

model never learned any features.

After this, most of the focus was switched to the

Motor Imagery (MI) tasks. The classification of EEG

patterns while the subjects imagined the tasks being

performed was the primary concern for this work. The

applications of this work are more dependent on the

accuracy of classification of the MI tasks than the

physical tasks. This is also consistent with the current

trend with the research work that was discussed ear-

lier in the paper. So, for this goal, the three most im-

portant labels were Imagine Open/Close Right Fist

(label 3), Imagine Open/Close Left Fist (label 4) and

Imagine Open/Close Both Feet (label 8). For training,

the first 10% and last 10% subjects were separated for

validation alternatively and the average accuracy

scores from the two were recorded.

Table 3: Accuracy for classification of 3 labels.

Models Accuracy (%)

MF1 36.97

MF2 54.4

MF3 48.03

MP1 46.42

MP2 67.08

MP3 56.18

MT 68.39

MST 69.08

Figure 5: Hierarchical fusion model architecture.



IMPROVE 2022 - 2nd International Conference on Image Processing and Vision Engineering

186

First the models were trained for 3 labels, MI

Right Fist (R), MI Left Fist(L) and MI Feet (F). Table

3 shows the accuracy values for each of the models

for the 3-label softmax classification. This tops out at

69.07% but the performance improvement from the

MF models to the MP models can better be seen in

Figure 6. There is also an improvement of the accu-

racy at the fusion models MT and MST. It is evident

that the fusion architecture with the multiple perspec-

tives helps mitigate the lower performance of the

MF1 and MP1 models. There is a clear improvement

in accuracy scores as we go further in the hierarchy

of the models.

Figure 6: Accuracy chart for 3-label classification.

Then the binary classification for the MI Right

Fist(R) and MI Left Fist(L) classes was performed.

Similar to above, the first 10% and last 10% of the

subjects were separated as a validation dataset and the

average scores for both the training runs were rec-

orded. Table 4 shows the accuracy as well as the F1

scores for the R versus L classification. The binary

classification achieves an average global validation

accuracy of 82.79% at the last fusion level. This is a

respectable score for an EEG MI task classification

for a global model with more than 100 subjects.

In addition to the models showing good perfor-

mance, in Figure 7 we can also see the model to model

improvement from the MF models to the MP models.

Also, similar to the 3 class classification, there is also

an improvement in the performance at the fusion levels

of the models. The first model (MF1) starts at around

50% accuracy, but the fusion structure means that the

overall system is able to compensate for its poor per-

formance. At the end of the hierarchical structure, the

other models completely make up for its lost perfor-

mance. This is yet another validating argument for the

fusion structure using the multi perspective approach.

Table 4: Accuracy and F1 score for binary classification.

Models F1 Score Accuracy (%)

MF1 53.67 52.76

MF2 70.99 74.41

MF3 61.44 60.77

MP1 41.56 59.91

MP2 81.12 83.98

MP3 65.83 69.21

MT 80.37 83.26

MST 79.60 82.79

Figure 7: Accuracy and F1 score chart for binary classifica-

tion.

4 CONCLUSION AND FUTURE

WORK

In this work, we used hierarchical deep learning mod-

els that learned spatiotemporal patterns in EEG data

for classification of motor imagery tasks. This work

has achieved robust performance in terms of binary

and respectable performance in terms of multi class

classification of those MI tasks.

Moreover, this work aimed to investigate the use-

fulness of fusion architecture and the multi view ap-

proach of learning the spatiotemporal information

from EEG data. The performance of the models has

shown that in general, a fusion architecture performs

better than a stand-alone model. Furthermore, we

were able to demonstrate an improvement in perfor-

mance of the models in the lower levels of the hierar-

chy. This validates the effectiveness of the hierar-

chical approach of the models in this work. In the re-

sults, we can also see that the side view models almost

always perform better than the main view models.

This has validated the use of the multi-view approach.

Classiﬁcation of EEG Motor Imagery Tasks Utilizing 2D Temporal Patterns with Deep Learning

187

Table 5: Performance comparison of this work with recent related works.

Work Preprocessing Model Architecture Dataset Classification Accuracy

Roots

et al.

Notch Filter

Bandpass Filter

Conv2D with different Kernel Sizes

Features fused together for softmax

BCI Competition

103 subjects

2 classes 83.00%

Wang

et al.

preprocessing

Conv2D

Temporal and Spatial

Fused together

Based on EEGNET

PhysioNet

109 subjects

4 classes

3 classes

2 classes

65.07%

75.07%

82.50%

Dose

et al.

preprocessing

1D CNN on raw EEG signals

Learn Spatial and Temporal Features on global dataset

Finetune globally trained model for per subject training

PhysioNet

109 subjects

4 classes

3 classes

2 classes

58.58%

68.82%

80.38%

This

work

Z-Score

Normalization

Hierarichal 2D CNNs

PhysioNet

109 subjects

3 classes

2 classes

69.08%

82.79%

Table 5 shows the comparison between the per-

formance of the model in this work and some recent

works in the field of EEG Motor Imagery task recog-

nition. The binary classification of MI Right Fist ver-

sus MI Left Fist has achieved competitive results

compared to recently published works.

However, there is some room for improvement in

the approach used in this work. Here, we only used

Convolutions. Even for learning time-dependent pat-

terns, 2D Convolutions were used from different per-

spectives. A more complex form of convolutions

could be used to learn spatial and temporal infor-

mation at the same time without needing to break up

the dataset into individual frames. This could be ac-

complished by using the Conv3D layers. Effectively

using the multi view approach in the same manner,

but instead of analyzing each frame from the three

perspectives, 3D convolutions would look at all the

frames as one. Also, convolutions are not the only

way to learn patterns in data. They are not even the

most used method for time sensitive data. True spati-

otemporal models use a fusion of Conv layers for spa-

tial information and LSTM layers for temporal infor-

mation. So, a fusion architecture between a Conv2D

and an LSTM layer could be investigated. There is

also room for investigation with a sliding window ap-

proach using the TimeDistributed layers.

ACKNOWLEDGEMENTS

Research reported in this publication was supported

by the Louisiana Board of Regents Research Compet-

itiveness Subprogram (RCS) under the contract num-

ber LEQSF(2019-20)-RD-A-19.

REFERENCES

Dose, H., Møller, J. S., & Iverson, H. K. (2018). An End-

to-end Deep Learning Approach to MI-EEG Signal

Classification for BCIs. Expert Systems With Applica-

tions, 114, 532-542.

Goldberger, A., Amaral, L., Glass, L., Haussdorf, J.,

Ivanov, P. C., Mark, R., & Stanley, H. E. (2000). Phys-

ioBank, PhysioToolkit, and PhysioNet: Components of

a new research resource for complex physiologic sig-

nals. Circulation, 101(23), e215-e220.

Hayes, A. (2021, February 20). Z-Score: Definition. In-

vestopedia. Retrieved January 1, 2022, from

https://www.investopedia.com/terms/z/zscore.asp

Roots, K., Muhammad, N., & Muhammad, Y. (2020). Fu-

sion Convolutional Neural Network for Cross-Subject

EEG Motor Imagery Classification. Computers, 9(72).

Saha, P., & Fels, S. (2019). Hierarchical Deep Feature

Learning for Decoding Imagined Speech from EEG.

The Thirty-Third AAAI Conference on Artificial Intelli-

gence (AAAI-19). Honolulu.

Sekeroglu, K., Soysal, O. M. & Li X. (2019). Hierarchical

Deep-Fusion Learning Framework for Lung Nodule

Classification. 15th International Conference on Ma-

chine Learning and Data Mining (MLDM 2019).New

York, USA

Schalk, G., McFarland, D. J., Hinterberger, T., Birbaumer,

T., & Wilpaw, N. (2004). J.R. BCI2000: A General-

Purpose Brain-Computer Interface (BCI) System. IEEE

Transactions on Biomedical Engineering, 51(6), 1034-

1043.

Voulodimos, A., Doulamis, N., Doulamis, A., & Protopa-

padakis, E. (2018). Deep Learning for Computer Vi-

sion: A Brief Review. Computational intelligence and

neuroscience.

Wang, X., Hershe, M., Tömekce, B., Kaya, B., Magno, M.,

& Benini, L. (2020). An Accurate EEGNet-based Mo-

tor-Imagery Brain–Computer Interface for Low-Power

Edge Computing. 2020 IEEE International Symposium

on Medical Measurements and Applications (MeMeA).

Virtual.

IMPROVE 2022 - 2nd International Conference on Image Processing and Vision Engineering

188