MA-VAE: Multi-Head Attention-Based Variational Autoencoder

Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

Lucas Correia

1,2 a

, Jan-Christoph Goos

1 b

, Philipp Klein

, Thomas B

ack

2 c

and Anna V. Kononova

2 d

Mercedes-Benz AG, Stuttgart, Germany

Leiden University, Leiden, The Netherlands

Keywords:

Anomaly Detection, Multivariate, Time Series, Automotive, Test Bench, Variational Autoencoder, Bypass

Phenomenon.

Abstract:

A clear need for automatic anomaly detection applied to automotive testing has emerged as more and more

attention is paid to the data recorded and manual evaluation by humans reaches its capacity. Such real-world

data is massive, diverse, multivariate and temporal in nature, therefore requiring modelling of the testee be-

haviour. We propose a variational autoencoder with multi-head attention (MA-VAE), which, when trained

on unlabelled data, not only provides very few false positives but also manages to detect the majority of the

anomalies presented. In addition to that, the approach offers a novel way to avoid the bypass phenomenon, an

undesirable behaviour investigated in literature. Lastly, the approach also introduces a new method to remap

individual windows to a continuous time series. The results are presented in the context of a real-world in-

dustrial data set and several experiments are undertaken to further investigate certain aspects of the proposed

model. When conﬁgured properly, it is 9% of the time wrong when an anomaly is ﬂagged and discovers 67%

of the anomalies present. Also, MA-VAE has the potential to perform well with only a fraction of the training

and validation subset, however, to extract it, a more sophisticated threshold estimation method is required.

1 INTRODUCTION

Powertrain testing is an integral part of the wider auto-

motive powertrain development and is undertaken at

different stages of development. Each of these stages

is composed of many integration levels. These inte-

gration levels range from powertrain sub-component

testing, such as the electric drive unit (EDU) con-

troller or high-voltage battery (HVB) management

system, to whole vehicle powertrain testing. Each

of these has its special type of controlled environ-

ment, called a test bench. The use-case in this paper

is on an endurance powertrain test bench, where the

EDU and HVB on their own are tested under differ-

ent conditions and loads for longer periods to simulate

wear over time. Given the costly maintenance and up-

keep costs of such test benches, it is desirable to keep

https://orcid.org/0000-0002-3653-5934

https://orcid.org/0009-0002-7063-3615

https://orcid.org/0000-0001-6768-1478

https://orcid.org/0000-0002-4138-7024

downtime at a minimum and to avoid faulty measure-

ments. Also, it is desirable to detect problems early

to prevent damage to the testee. Given that evalua-

tion is done manually by inspection, it is not feasi-

ble to analyse every single measurement, also evalua-

tion tends to be delayed, only being undertaken days

after the measurement is recorded, hence there is a

clear need for automatic, fast and unsupervised eval-

uation methodology which can ﬂag anomalous mea-

surements before the next measurement is started.

To achieve this, we propose a multi-head attention

variational autoencoder (MA-VAE). MA-VAE con-

sists of a bidirectional long short-term memory (BiL-

STM) variational autoencoder architecture that maps

a time-series window into a temporal latent distribu-

tion (Park et al., 2018) (Su et al., 2019). Also, a multi-

head attention (MA) mechanism is added to further

enhance the sampled latent matrix before it is passed

on to the decoder. As shown in the ablation study, this

approach avoids the so-called bypassed phenomenon

(Bahuleyan et al., 2018), which is the ﬁrst contribu-

Correia, L., Goos, J., Klein, P., Bäck, T. and Kononova, A.

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to Automotive Endurance Powertrain Testing.

DOI: 10.5220/0012163100003595

In Proceedings of the 15th International Joint Conference on Computational Intelligence (IJCCI 2023), pages 407-418

ISBN: 978-989-758-674-3; ISSN: 2184-3236

407

tion. Furthermore, this paper offers a unique method-

ology for the reverse-window process. It is used

for remapping the ﬁxed-length windows the model is

trained on to continuous variable-length sequences.

This paper is structured as follows: First, a short

background is provided in Section 2 on the power-

train testing methodology speciﬁc to this use case,

as well as the theory behind VAE and MA mecha-

nisms. Then, related work in variational autoencoder-

based time-series anomaly detection is presented in

Section 3, followed by an in-depth introduction of

the real-world data set and the approach we propose

in Section 4. Then, several experiments testing dif-

ferent aspects of the proposed method are conducted

and discussed in Section 5, along with the ﬁnal re-

sults. Finally, conclusions from this work are drawn

and an outlook into future work is provided in Sec-

tion 6. The source code for the data pre-processing,

model training as well as evaluation can be found un-

der https://github.com/lcs-crr/MA-VAE.

2 BACKGROUND

2.1 Real-World Application

During endurance testing a portfolio of different driv-

ing cycles is run, where a cycle is a standardised driv-

ing pattern, which enables repeatability of measure-

ments. For this type of testing the portfolio con-

sists exclusively of proprietary cycles, which differ

from the public cycles used, for example, for vehicle

fuel/energy consumption certiﬁcation like the New

European Driving Cycle (NEDC) or the Worldwide

Harmonised Light Vehicles Test Cycle (WLTC). The

reason why proprietary cycles are used for endurance

runs is that they allow for more extensive loading of

the powertrain.

Given the presence of a battery in the testee, some

time has to be dedicated to battery soaking (sitting

idle) and charging. These procedures are also stan-

dardised using cycles, although, for the intents and

purposes of this paper, they are omitted. What is

left are the eight dynamic driving cycles representing

short, long, fast, slow and dynamic trips ranging from

5 to 30 minutes. There are multiple versions of the

same cycle, which mostly differ in starting conditions

such as state-of-charge (SoC) and temperature of the

battery.

On powertrain test benches, there are several con-

trol methods to ensure the testee maintains the given

driving cycle. In this particular test bench, the regu-

lation is done by the acceleration pedal and the EDU

revolutions-per-minute (rpm), which is nothing more

than a non-SI version of the angular velocity.

2.2 Data Set

This real-world data set consists of 3385 normal mea-

surement ﬁles, each of which contains hundreds of

(mostly redundant or empty) channels. A measure-

ment is considered normal when the testee behaviour

conforms to the norm. For this work, a list of d

= 13

channels was hand-picked in consultation with the

test bench engineers to choose a reasonable and rep-

resentative number of channels. This list includes

the vehicle speed, EDU torque, current, voltage, ro-

tor temperature and stator temperature, left and right

wheel shaft torque, HVB current, voltage, tempera-

ture and SoC and inverter temperature. Given that

some channels (such as torque) are sampled much

faster than others (like temperature and SoC), a com-

mon sampling rate of 2Hz is chosen. Channels sam-

pled slower than 2Hz are linearly interpolated, which

is seen as permissible due to the lower amplitude res-

olution of those channels. Channels sampled faster

than 2Hz are passed through a low-pass ﬁlter with a

cut-off frequency of 1Hz and then resampled to 2Hz,

as is consistent with the Whittaker–Nyquist–Shannon

theorem (Shannon, 1949). Then the driving cycles

are z-score normalised, i.e. transformed such that the

mean for each channel lies at 0 and the standard de-

viation at 1. Lastly, the driving cycles (generally re-

ferred to as sequences in this paper) are windowed

to create a set of ﬁxed-length sub-sequences, or win-

dows. First, each channel is auto-correlated to obtain

the number of lags of the slowest dynamic process

present in the signal. Then, the window size W is set

as the smallest power of two larger than the longest

lag, in this case, W = 256 time steps or 128 seconds.

Each window overlaps their preceding and succeed-

ing windows by half a window, i.e. the shift between

windows is W /2 = 128 time steps, in order to reduce

computational load compared to a shift of one time

step.

Due to the absence of labelled anomalies in the

dataset, realistic anomalous events are intentionally

simulated and recorded following the advice of test

bench engineers. To this end, ﬁve anomaly types were

recorded. In the ﬁrst type, the virtual wheel diameter

is changed, such that the resulting vehicle speed devi-

ates from the norm. The wheel diameter is a param-

eter as resistances are connected to the shafts rather

than actual wheels. The second type of anomaly in-

volves changing the driving mode from comfort to

sport, which leads to a higher HVB SoC drop over

the cycle and a different torque response. In the third

anomaly, the recuperation level is turned from maxi-

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

408

mum to zero, hence the minimum EDU torque is al-

ways non-negative and the HVB SoC experiences a

higher drop in SoC. In the case of the fourth anomaly,

the HVB is swapped for a battery simulator, where the

HVB voltage behaviour deviates from a real battery.

The inverter and EDU share a cooling loop, whose

cooling capacity is reduced at the beginning or mid-

dle of the cycle, leading to higher EDU rotor, EDU

stator and inverter temperatures than normal. Every

anomaly type is recorded during every cycle at least

once, leading to 60 anomalous driving cycles that are

all used as the anomalous subset of the test set.

A plot of one normal and one wheel-diameter

anomalous cycle is shown in Figure 1. Due to the

long channel names, the plot only shows the channel

indices, a table containing the legend is shown in Ta-

ble 1 for context. Visual inspection may suggest that

the red plot is anomalous, since the EDU and HVB

voltage, temperature and state of charge deviate from

the black plot. This deviation is to be expected be-

cause they depend on how charged the battery is and

on how much the battery is used previous to the cur-

rent cycle. In the case of this anomaly, the only chan-

nel that demonstrates anomalous behaviour is the ve-

hicle speed, since:

vehicle

= r ×ω (1)

where r is the wheel radius and ω the angular velocity.

Evidently, the anomalous behaviour is most visible at

higher speeds.

In an operative environment, it is desirable to ﬁnd

out whether the previously recorded sequence had any

problems to analyse before the next measurement is

recorded. Also, a model that performs as well as

possible with as little data as possible translates to

faster deployment. Good performance is indicated by

a model that can detect as many anomalies as possi-

ble and rarely labels normal measurements wrongly.

Table 1: Legend for the channel names in Figure 1.

No. Name

1 Vehicle Speed

2 EDU Torque

3 Left Axle Torque

4 Right Axle Torque

5 EDU Current

6 EDU Voltage

7 HVB Current

8 HVB Voltage

9 HVB Temperature

10 HVB State of Charge

11 EDU Rotor Temperature

12 EDU Stator Temperature

13 Inverter Temperature

0 200 400 600 800 1000

Time [s]

1 [-]2 [-]3 [-]4 [-]5 [-]6 [-]7 [-]8 [-]9 [-]10 [-]11 [-]12 [-]13 [-]

Figure 1: Features of a normal (black) and an anomalous

(red) cycle plotted with respect to time. The anomalous cy-

cle plotted represents a scenario where the wheel diameter

has not been set correctly. The amplitude axis is z-score

normalised to comply with conﬁdentiality guidelines.

To investigate the required training subset size of the

model, it is trained with 1h, 8h, 64h, and 512h worth

of dynamic testing data, which corresponds to the

ﬁrst 6, 44, 348, and 2785 driving cycles, respectively.

The results are also presented in Section 5. In each

of the above-mentioned cases, the training subset is

further split into a training (80%) and a validation

(20%) subsets. Both the training and validation sub-

sets are batched to sets of 512 windows. Given the

anomalous subset size of 60 driving cycles, 600 nor-

mal driving cycles recorded after the ones in the train-

ing subset are chosen to make up the normal test sub-

set. This would imply that 9% of measurements at

the test bench are anomalous, however in reality this

value is estimated to be much lower. This amount of

anomalous data in relation to normal data is used as

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

409

it approximately matches the anomaly ratio in public

data sets and because the data set is not large enough

to create a larger normal test subset.

2.3 Variational Autoencoders

The variational autoencoder (Kingma and Welling,

2014)(Rezende et al., 2014) is a generative model that

structurally resembles an autoencoder, but is theoreti-

cally derived from variational Bayesian statistics. As

opposed to the regular deterministic autoencoder, the

VAE uses the evidence lower bound (ELBO), which

is a lower bound approximation of the so-called log

evidence log p

(X), as its objective function. The

ELBO, Equation 2, can be expressed as the recon-

struction log-likelihood and the negative Kullback-

Leibler Divergence (D

) between the approximate

posterior q

(Z|X) and the prior p

(Z), which is typi-

cally assumed to be a Gaussian distribution (Goodfel-

low et al., 2016).

θ,φ

(X) = E

Z∼q

(Z|X)

[log p

(X|Z)]

−D

(Z|X)||p

(Z))

(2)

where Z ∈ R

W ×d

is the sampled latent matrix and

X ∈ R

W ×d

is the input window. W refers to the

window length, whereas d

and d

refer to the in-

put window and latent matrix dimensionality, respec-

tively. Gradient-based optimisation minimises an ob-

jective function and the goal is the maximisation of

the ELBO, hence the ﬁnal loss function is deﬁned as

the negative of Equation 2, shown in Equation 3.

VAE

= −L

θ,φ

(X) (3)

Finally, to enable the backpropagation through

the otherwise intractable gradient of the ELBO, the

reparametrisation trick (Kingma and Welling, 2014)

is applied, shown in Equation 4.

Z = µ

+ ε ·σ

(4)

where ε ∼ N (0,1) and (µ

,logσ

) = q

(X).

2.4 Multi-Head Attention Mechanism

To simplify the explanation of MA as employed in

this work, multi-head self-attention (MS) will be ex-

plained instead with the small difference between MA

and MS being pointed out at the end.

MS consists of two different concepts: self-

attention and its multi-head extension. Self-attention

is nothing more than scaled dot-product attention

(Vaswani et al., 2017) where the key, query and value

are the same. The scaled dot-product attention score

is the softmax (Bridle, 1990) of the product between

query matrix Q and key matrix K which is scaled by

√

. The product between the attention score and the

value matrix V yields the context matrix C, as shown

in Equation 5.

C = Softmax



√



V (5)

Compared to recurrent or convolutional layers, self-

attention offers a variety of beneﬁts, such as the re-

duction of computational complexity, as well as an in-

creased amount of operations that can be parallelised.

(Vaswani et al., 2017). Also, self-attention inherits an

advantage over Bahdanau-style attention (Bahdanau

et al., 2015) from the underlying scaled dot-product

attention mechanism: it can run efﬁciently in matrix

multiplication manner (Vaswani et al., 2017).

Multi-head self-attention then allows the attention

model to attend to different representation subspaces

(Vaswani et al., 2017), in addition to learning useful

projections rather than it being a stateless transforma-

tion (Chollet, 2021). This is achieved using weight

matrices W

, W

, which contain trainable pa-

rameters and are unique for each head i, as shown in

Equation6.

= QW

= KW

= VW

(6)

Once the query, key and value matrices are linearly

transformed via the weight matrices, the context ma-

trix C

for each head i is computed using Equation 7.

= Softmax



√



(7)

Then, for h heads, the different context matrices are

concatenated and linearly transformed again via the

weight matrix W

, resulting in the multi-head context

matrix C ∈ R

W ×d

, Equation 8.

C = [C

,...,C

(8)

The underlying mechanism of MA is identical to MS,

with the only difference being that K = Q ̸= V. The

beneﬁt of this alteration is discussed in Section 4.

3 RELATED WORK

MA-VAE belongs to the so-called generative model

class, which encompasses both variational autoen-

coders, as well as generative adversarial networks.

This section focuses solely on the work on VAE pro-

posed in the context of time-series anomaly detection.

In time-series anomaly detection literature, the

only other model that uses the combination of a VAE

and an attention mechanism is by (Pereira and Sil-

veira, 2018). For the purpose of our paper, it is named

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

410

VS-VAE. Their approach consists of a BiLSTM en-

coder and decoder, where, for an input window of

length W, the t = W encoder hidden states of each di-

rection are passed on to the variational self-attention

(VS) mechanism (Bahuleyan et al., 2018). The result-

ing context vector is then concatenated with the sam-

pled latent vector and then passed on to the decoder.

The author claims that applying VS to the VAE model

solves the bypass phenomenon, however, no evidence

for this claim is provided.

The ﬁrst published time-series anomaly detection

approach based on VAE was LSTM-VAE (Park et al.,

2018). One of the contributions is its use of a dynamic

prior N (µ

,1), rather than a static one N (0, 1). In

addition to that, they introduce a state-based thresh-

old estimation method consisting of a support-vector

regressor (SVR), which maps the latent distribution

parameters (µ

,σ

) to the resulting anomaly score us-

ing the validation data. Hence, the dynamic threshold

can be obtained through Equation 9.

= SVR(µ

z,t

,σ

z,t

) + c (9)

where c is a pre-deﬁned constant to control sensitivity.

OmniAnomaly (Su et al., 2019) attempts to cre-

ate a temporal connection between latent distributions

by applying a linear Gaussian state space model to

them. For the purpose of this paper, it is called (Om-

niA). Also, it concatenates the last gated recurrent

unit (GRU) hidden state with the latent vector sam-

pled in the previous time step. In addition to that,

it uses planar normalising ﬂow (Rezende and Mo-

hamed, 2015) by applying K transformations to the

latent vector in order to approximate a non-Gaussian

posterior, as shown in Equation 10.

k−1

) = u tanh(wz

k−1

) + b (10)

where u, w and b are trainable parameters.

A simpliﬁed VAE architecture (Pereira and Sil-

veira, 2019) based on BiLSTM layers is also pro-

posed. For the purpose of our paper, it is called

W-VAE. Unlike its predecessor (Pereira and Silveira,

2018), it drops the attention mechanism but provides

contributions elsewhere. It offers two strategies to de-

tect anomalies based on the VAE outputs. The ﬁrst in-

volves clustering the space characterised by the mean

parameter of the latent distribution into two clusters

and labelling the larger one as normal. This strategy

has a few weaknesses: it cannot be used in an opera-

tive environment as it requires some sort of history of

test windows to form the clusters and it assumes that

there are always anomalous samples present. The sec-

ond strategy ﬁnds the Wasserstein similarity measure

(hence the W in the name) between the latent mean

space mapping of the test window in question and the

respective mapping i resulting from a representative

data subset, such as the validation subset. Equation

11 shows how the Wasserstein similarity measure is

computed

test

) = ∥µ

test

−µ

∥

+ ∥Σ

1/2

test

−Σ

1/2

∥

(11)

where the ﬁrst term represents the L2-Norm between

the mean distribution parameters resulting from the

test window and each point of the representative sub-

set. The second term represents the Frobenius norm

between the covariance matrix resulting from the test

window and each point of the representative subset.

SWCVAE (Chen et al., 2020) is the ﬁrst that ap-

plies convolutional neural networks (CNN) to VAE

for multivariate time-series anomaly detection. Pe-

culiarly, 2D CNN layers are used with the justiﬁca-

tion of being able to process the input both spatially

and temporally. We, however, doubt the ability of

the model to properly detect anomalies through spa-

tial processing, as a kernel moving along the feature

axis can only capture features adjacent to each other.

To create a continuous anomaly score from windows

they append the last value of each window to the pre-

vious one. For the purpose of this paper, this process

is referred to as last-type reverse-windowing.

SISVAE (Li et al., 2021) tries to improve the mod-

elling robustness by the addition of a smoothing term

in the loss function which contributes to the reduction

of sudden changes in the reconstructed signal, making

it less sensitive to noisy time steps.

As part of the VASP framework (von Schleinitz

et al., 2021), a variational autoencoder architecture is

proposed to increase the robustness of time-series pre-

diction when faced with anomalies. While the main

contribution is attributed to the framework itself, not

the VAE, it should be noted that during inference only

the mean parameter of the latent distribution is passed

to the decoder.

4 PROPOSED APPROACH

4.1 Overview

To detect anomalies in multivariate time-series data,

we propose a variational autoencoder architecture

consisting of BiLSTM layers. The model architecture

is illustrated in Figure 2. During training, the encoder

maps multivariate input window X to a temporal

distribution with parameters µ

and log σ

in the for-

ward pass, Equation 12.

(µ

,logσ

) = q

(X) (12)

Given the latent distribution parameters µ

and

logσ

, the latent matrix is sampled from the result-

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

411

BiLSTM

Encoder

BiLSTM

Decoder

logσ

W ×d

Multi-head

Attention

Q,K

Figure 2: An illustration of the proposed MA-VAE model. Blue shapes designate trainable models, orange deterministic

tensors and green distribution parameters. The shape of each tensor is designated below it. During training Z is used as the

value matrix, denoted by the solid arrow, whereas during inference µ

is used as the value matrix, denoted by the traced arrow.

ing distribution, as shown in Equation 13.

Z ∼ N (µ

,logσ

) (13)

Then, the input window X is linearly transformed to

obtain the query matrices Q

and key matrices K

for

each head i . Likewise, the sampled latent matrix Z is

also transformed to the value matrix V

, as shown in

Equation 14.

= XW

= ZW

(14)

To output the context matrix C

for each head i, the

softmax of the through

√

normalised query and

key product is multiplied with the value matrix, Equa-

tion 15.

= Softmax



√



(15)

The ﬁnal context matrix C is the result of the linearly-

transformed concatenation of each head-speciﬁc con-

text matrix C

, as expressed in Equation 16.

C = [C

,...,C

(16)

The decoder p

then maps the context matrix C to an

output distribution with parameters µ

and log σ

, as

shown in Equation 17.

(µ

,logσ

) = p

4.2 Inference Mode

Despite the generative capabilities of VAE, MA-VAE

does not leverage generation for anomaly detection.

Rather than sampling a latent matrix as shown in

Equation 13 during inference, sampling is disabled

and only µ

is taken as the input for the multi-head

attention mechanism, like in (von Schleinitz et al.,

2021). Equation 13 in the forward pass, therefore, is

replaced by Equation 18.

Z = µ

(18)

This not only accelerates inference by eliminating the

sampling process but is also empirically found to be

a good approximation of an averaged latent matrix if

it were sampled several times like in (Pereira and Sil-

veira, 2018). The MA-VAE layout during inference

is shown in Figure 2, where the traced arrow desig-

nates the information ﬂow from the encoder to the

MA mechanism.

4.3 Threshold Estimation Method

Anomalies are by deﬁnition very rare events, hence

an ideal anomaly detector only ﬂags measurements

very rarely but accurately. Test bench engineers pre-

fer an algorithm that only ﬂags a sequence it is sure

is an anomaly, in other words, an algorithm that out-

puts very few to no false positives. A high false posi-

tive count would lead to a lot of stoppages and there-

fore lost testing time and additional cost. Of course,

the vast majority of measurements evaluated will be

normal and hence it is paramount to classify them

correctly, naturally leading to a high precision value.

Also, there is no automatic evaluation methodology

currently running at test benches, other than rudi-

mentary rule-based methods, therefore a solution that

plugs into the existing system that automatically de-

tects some or most anomalies undetectable by rules is

already a gain. To achieve this, the threshold τ is set

as the maximum log probability observed when the

model is fed with validation data.

4.4 Bypass Phenomenon

VAE, when combined with an attention mecha-

nism, can exhibit a behaviour called the bypass phe-

nomenon (Bahuleyan et al., 2018). When the bypass

phenomenon happens the latent path between encoder

and decoder is ignored and information ﬂow occurs

mostly or exclusively through the attention mecha-

nism, as it has deterministic access to the encoder hid-

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

412

den states and therefore avoids regularisation through

the D

term. In an attempt to avoid this, (Bahuleyan

et al., 2018) propose variational attention, which, like

the VAE, maps the input to a distribution rather than a

deterministic vector. Applied to natural language pro-

cessing, (Bahuleyan et al., 2018) demonstrate that this

leads to a diversiﬁed generated portfolio of sentences,

indicating alleviation of the bypassing phenomenon.

As previously mentioned, only (Pereira and Silveira,

2018) applies this insight in the anomaly detection do-

main, however, they do not present any proof that it

alleviates the bypass phenomenon in their work. MA-

VAE on the other hand, cannot suffer from the bypass

phenomenon in the sense that information ﬂow ig-

nores the latent variational path between encoder and

decoder since the MA mechanism requires the value

matrix V from the encoder to output the context ma-

trix. Assuming the bypass phenomenon also applies

to a case where information ﬂow ignores the atten-

tion mechanism, one could claim that MA-VAE is not

immune. To disprove this claim, the attention mecha-

nism is removed from the model in an ablation study

to see if anomaly detection performance remains the

same. In this case, V is instead directly input into the

decoder. If it drops, it is evidence of the contribution

of the attention mechanism to the model performance

and hence is not bypassed. The results for this abla-

tion study are shown and discussed in Section 5.

4.5 Impact of Seed Choice

Given the stochastic nature of the VAE, the chosen

seed can impact the anomaly detection performance

as it can lead to a different local minimum during

training. To investigate the impact the seed choice

has on model training, MA-VAE is trained on three

different seeds, the respective results are also shown

in Section 5.

4.6 Reverse-Window Process

Since the model is trained to reconstruct ﬁxed-length

windows, the same applies during inference. How-

ever, to decide whether a given measurement se-

quence S ∈ R

T ×d

is anomalous, a continuous recon-

struction of the measurement is required. The easiest

way to do so would be to window the input measure-

ment using a shift of 1, input the windows into the

model and chain the last time step from each output

window to obtain a continuous sequence (Chen et al.,

2020). Considering the BiLSTM nature of the en-

coder and decoder, the ﬁrst and last time steps of a

window can only be computed given the states from

one direction, making these values, in theory, less ac-

Algorithm 1: Anomaly Detection Process.

Input: Sequence S ∈ R

T ×d

, Threshold τ

Result: Label l

windows

← T −W + 1

X,temp

← zeros(n

windows

,T,d

) + NaN

X,temp

← zeros(n

windows

,T,d

) + NaN

for i = 1 → n

windows

X ← S [i : W + i]

(µ

,log σ

) ← q

(X)

C ← MA(X, X,µ

)

(µ

,log σ

) ← p

(C)

X,temp

[i,i : i +W ] ← µ

X,temp

[i,i : i +W ] ← σ

end for

X,seq

← nanmean(µ

X,temp

)

X,seq

← nanmean(σ

X,temp

)

s ← −log p(X|µ

X,seq

,σ

X,seq

)

l ← max(s) > τ

curate, however. To overcome this, we propose aver-

aging matching time steps in overlapping windows,

which is called mean-type reverse-window method.

This is done by pre-allocating an array with NaN val-

ues, ﬁlling it, and taking the mean for each time step

while ignoring the NaN values. This process and the

general anomaly detection process are described in

Algorithm 1. This reverse-window process is done for

the mean and variance parameters of the output distri-

bution, then the variance is converted to standard de-

viation since two distributions cannot be combined by

averaging the standard deviations. With a continuous

mean and standard deviation, the continuous negative

log probability, i.e. the anomaly score s, is computed

for the respective measurement. A comparison be-

tween the mean, last and ﬁrst reverse-window process

is provided in Section 5.

5 RESULTS

5.1 Setup

The encoder and decoder both consist of two BiLSTM

layers, with the outer ones having 512 hidden- and

cell-state sizes and the inner ones 256. All other pa-

rameters are left as the default in the TensorFlow API.

During training only, input windows are corrupted

using Gaussian noise using 0.01 standard deviation to

increase robustness to noise.

Key factors that are investigated in Section 5 are

given a default value which applies to all experiments

unless otherwise speciﬁed. These factors are train-

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

413

ing/validation subset size, which is set to 512h, seed

choice, which has been kept at 1, reverse-window

method, where the mean-type is used, the latent di-

mension size, which is set to d

= 16 and the MA

mechanism, which is set up as proposed in (Vaswani

et al., 2017) with a head count of h = 8 and a key

dimension size d

= ⌊d

/h⌋ = 1.

The optimiser used is the AMSGrad optimiser

with the default parameters in the TensorFlow API.

Cyclical D

annealing (Fu et al., 2019) is ap-

plied to the training of MA-VAE, to avoid the D

vanishing problem. The D

vanishing problem oc-

curs when regularisation is too strong at the beginning

of training, i.e. the Kullback-Leibler divergence term

has a larger magnitude in relation to the reconstruc-

tion term. Cyclical D

annealing allows the model

to weigh the Kullback-Leibler divergence lower than

the reconstruction term in a cyclical manner through a

weight β. This callback is conﬁgured with a grace pe-

riod of 25 epochs, where β is linearly increased from

0 to 10

−8

. After the grace period, β is set to 10

−8

and is gradually increased linearly to 10

−2

throughout

the following 25 epochs, representing one loss cycle.

This loss cycle is repeated until the training stops.

All priors in this work are set as standard Gaussian

distributions, i.e. p = N (0, 1).

To prevent overﬁtting, early stopping is imple-

mented. It works by monitoring the log probability

component of the validation loss during training and

stopping if it does not improve for 250 epochs. Logi-

cally, the model weights at the lowest log probability

validation loss are saved.

Training is done on a workstation conﬁgured with

an NVIDIA RTX A6000 GPU. The library used for

model training is TensorFlow 2.10.1 on Python 3.10

on Windows 10 Enterprise LTSC version 21H2.

The results provided are given in the form of the

calibrated and uncalibrated anomaly detection perfor-

mance, i.e. with and without consideration of thresh-

old τ, respectively. Recall that the threshold used is

the absolute maximum negative log probability ob-

tained from the validation set. Calibrated metrics are

the precision, recall and F

score. Precision P rep-

resents the ratio between correctly identiﬁed anoma-

lies (true positives) and all positives (true and false),

shown in Equation 19, recall R represents the ratio

between true positives and all anomalies, shown in

Equation 19, and F

score represents the harmonic

mean of the precision and recall, shown in Equation

20. The underlying metrics used to calculate all of

the below are the true positives (T P), false negatives

(FN) and false positives (FP).

P =

T P

T P + FP

R =

T P

T P + FN

(19)

T P

T P + 0.5 ∗(FP + FN)

= 2 ∗

P ∗R

P + R

(20)

The theoretical maximum F

score, F

1,best

, is also pro-

vided to aid discussion. This represents the best pos-

sible score achievable by the approach if the ideal

threshold were known, i.e. the point on the precision-

recall curve that comes closest to the P = R = 1 point,

though, in reality, this value is not observable and

hence cannot be obtained in an unsupervised manner.

The uncalibrated anomaly detection performance,

i.e. the performance for a range of thresholds, each

0.1 apart, is represented by the area under the contin-

uous precision-recall curve A

cont

PRC

, Equation 21.

cont

PRC

PdR (21)

As the integral cannot be computed for the continuous

function, the area under the discrete precision-recall

curve A

disc

PRC

is used which is done using the trape-

zoidal rule, Equation 22.

disc

PRC

∑

k=1

f (R

k−1

) + f (R

)

∆R

(22)

where N is the number of discrete sub-intervals, k

the index of sub-intervals and ∆R

the sub-intervals

length at index k. Precision is a function of recall, i.e.

P = f (R).

5.2 Ablation Study

MA-VAE is tested without the MA mechanism and

with a direct connection from the encoder to the de-

coder to observe whether it impacts results.

The anomaly detection performance of MA-VAE

and its counterpart without MA, henceforth referred

to as No MA model, are shown in Table 2. While

the precision value of the No MA model is slightly

higher than the MA-VAE, the recall value on the other

hand is much lower. Overall, MA-VAE has a higher

score, as well as a higher theoretical maximum F

score, although both values are so close enough to

each other that one could claim the threshold is near

ideal. The uncalibrated performance is also higher in

the case of the MA-VAE, as evident in the precision-

recall plot in Figure 3. Interestingly, MA-VAE may

feature a lower precision value for the chosen un-

supervised threshold but has the potential to have a

higher maximum recall at P = 1.

The results hence point towards an improvement

brought about by the addition of the MA mechanism

and therefore the bypass phenomenon can be ruled

out.

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

414

Table 2: Precision P, recall R, F

score, theoretical best F

score F

1,best

and area under the precision-recall curve A

PRC

results for the model variant without the MA mechanism

and MA-VAE. The best values for each metric are given in

bold.

Model P R F

1,best

PRC

No MA 1.00 0.35 0.52 0.54 0.52

MA-VAE 0.92 0.55 0.69 0.70 0.66

Figure 3: Precision-recall curves for the model variation

without MA and MA-VAE.

5.3 Data Set Size Requirements

To evaluate how much data is required to train MA-

VAE to a point of adequate anomaly detection perfor-

mance, it has been trained with 1h, 8h, 64h, and 512h

worth of dynamic testing data.

The results for this experiment are presented in

Table 3. On the one hand, as the training/validation

subset increases in size, the precision value improves,

with the largest jump occurring when the dynamic

testing time goes from 1h to 8h. The recall value on

the other hand decreases as the subset grows. This can

be attributed to the fact that smaller subset sizes lead

to a small validation set and therefore less data to ob-

tain a threshold from. With a limited amount of data

to obtain a threshold from, it is more difﬁcult to get

a representative error distribution, leading to a thresh-

old that is very small and hence marks most anoma-

lies correctly but also leads to a lot of false positives.

score reaches a point of diminishing returns with

the 8h subset onwards, this can also be observed in

the case of the theoretical maximum F

score, F

1,best

as well as in the A

PRC

value, further supported by

the precision-recall plot in Figure 4. Lastly, the F

score seems to approach the F

1,best

score as the subset

grows, also backing the fact that with a small subset

size, a good threshold cannot easily be obtained.

Table 3: Precision P, recall R, F

score, threoretical best F

score F

1,best

and area under the precision-recall curve A

PRC

results for the different training/validation subset sizes. The

best values for each metric are given in bold.

Size P R F

1,best

PRC

1h 0.09 0.88 0.17 0.55 0.49

8h 0.66 0.63 0.64 0.72 0.69

64h 0.71 0.57 0.63 0.69 0.68

512h 0.92 0.55 0.69 0.70 0.66

Figure 4: Precision-recall curves for the model trained on

different training/validation subset sizes.

Therefore, for application at the test bench, the

largest subset size is desirable due to the higher pre-

cision value and a closer-to-ideal threshold value.

5.4 Impact of Seed Choice

To illustrate the impact it has on the performance met-

rics, they are presented using three different seeds.

Table 4 shows that while the precision values are

roughly in the same range for all seeds, the recall val-

ues vary more signiﬁcantly, which also reﬂects on the

score. However, by inspecting the F

1,best

and A

PRC

values it becomes clear that the seeds are not as far

apart as the recall value suggests and that the issue

may lie with the threshold choice. Figure 5 further

supports this, as all lines have roughly the same path,

with the exception of seed 3 at very high precision

values. The plot clearly shows that a more suitable

(lower) threshold would lead to seed 3 having a com-

parable recall value to the other seeds while maintain-

ing high precision.

Some differences can be observed between the

seeds, especially in the recall values, however, this

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

415

can be attributed to the unsupervised threshold choice.

Table 4: Precision P, recall R, F

score, threoretical best

score F

1,best

and area under the precision-recall curve

PRC

results for the different seeds. The best values for

each metric are given in bold.

Seed P R F

1,best

PRC

1 0.92 0.55 0.69 0.70 0.66

2 0.90 0.60 0.72 0.73 0.67

3 0.96 0.40 0.56 0.70 0.64

Figure 5: Precision-recall curves for the model trained on

different seeds.

5.5 Reverse-Window Process

To investigate the effect of the mean-type reverse-

window method, it is compared with the ﬁrst-type and

last-type methods where the ﬁrst and last values of

each window are carried over, respectively.

The results in this subsection, Table 5 and Figure

6, tell a similar story to the previous subsection. The

metrics independent of the chosen threshold are very

similar regardless of the reverse-window method, im-

plying that they are comparable and that any differ-

ences in the calibrated metrics can be attributed to

the chosen threshold. The mean-type reverse-window

method results in a higher computational load, though

negligible. For a rather long sequence of 4000 time

steps, i.e. around 33 minutes long, the mean-type

method only takes around 2 seconds longer. One

source of delay that can appear, however, is during

online anomaly detection. An online anomaly detec-

tion algorithm is deﬁned as an algorithm which eval-

uates the sequence as it is being recorded. To obtain

time step t using the mean-type (or the ﬁrst-type) you

have to wait for time step t +W while t < W . This

Table 5: Precision P, recall R, F

score, threoretical best

score F

1,best

and area under the precision-recall curve

PRC

results for the different reverse-window types. The

best values for each metric are given in bold.

Type P R F

1,best

PRC

ﬁrst 0.97 0.48 0.64 0.69 0.64

last 0.88 0.58 0.71 0.71 0.67

mean 0.92 0.55 0.69 0.70 0.66

Figure 6: Precision-recall curves for different reverse-

window methods.

translates to a delay of around 2 minutes in the real

world, given the chosen window size. If the evalua-

tion is done ofﬂine, i.e. when t = W , then this delay

is eliminated since the last value does not have other

overlapping values to compute the mean.

5.6 Hyperparameter Optimisation

As part of the hyperparameter optimisation of MA-

VAE, a list of latent dimension sizes d

in combina-

tion with a list of key dimension sizes d

is tested.

Despite the larger learning capacity associated with a

higher d

, the concatenation is always transformed to

a matrix of size d

= d

. For the two variables, values

of 1, 4, 16, and 64 are tested.

The best result is achieved with d

= d

= 64.

Given that they are the respective highest values of d

and d

, even higher values should be experimented

with in the future, though they will lead to higher

model complexity and training/inference time. The

attention head count h was also experimented with us-

ing the same range of values as for d

and d

, how-

ever, none performed better than the h = 8 conﬁgu-

ration. The results are presented in Table 6, the cor-

responding precision-recall plot is shown in Figure 7.

91% of the sequences marked as normal were actually

normal and 67% of the total number of anomalous se-

quences in the test set were detected. One example of

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

416

Table 6: Precision P, recall R, F

score, threoretical best F

score F

1,best

and area under the precision-recall curve A

PRC

result for the best d

, d

and h values.

h P R F

1,best

PRC

64 64 8 0.91 0.67 0.77 0.79 0.74

Figure 7: Precision-recall curve for the ﬁnal MA-VAE.

the anomalous cycles and the respective reconstruc-

tions is plotted in Figure 8.

5.7 Benchmarking

Of course, MA-VAE is not the ﬁrst model proposed

for time-series anomaly detection. To underline its

anomaly detection performance, it is compared with

a series of other models based on variational autoen-

coders. The chosen subset of models is based on

the work discussed in Section 3 which either linked

source code or contained enough information for im-

plementation. The models are implemented using hy-

perparameters speciﬁed in their respective publica-

tions. To even the playing ﬁeld, the models are trained

on the 512h subset with early stopping, which is

parametrised equally across all models. The anomaly

detection process speciﬁed in Algorithm 1 is also ap-

plied to all models, along with the threshold estima-

tion method. The results can be seen in Table 7.

Table 7: Precision P, recall R, F

score, threoretical best F

score F

1,best

and area under the precision-recall curve A

PRC

results for competing models and MA-VAE (Ours). The

best values for each metric are given in bold.

Model P R F

1,best

PRC

VS-VAE 1.00 0.33 0.50 0.56 0.51

W-VAE 1.00 0.30 0.46 0.46 0.41

OmniA 0.96 0.37 0.53 0.58 0.53

SISVAE 1.00 0.30 0.46 0.50 0.51

MA-VAE 0.91 0.67 0.77 0.79 0.74

Figure 8: Wheel diameter anomaly plotted in black and the

output distribution in red, as well as anomaly score plotted

in blue and the threshold as a straight line in orange.

As is evident, MA-VAE outperforms all other

models in every metric, except for precision. As

stated in Section 4 a high precision ﬁgure is impor-

tant in this type of powertrain testing, however, the

reduced precision is still considered tolerable. Also,

it comes at the beneﬁt of a much higher recall ﬁgure,

which is reﬂected in the superior F

ﬁgure. Further-

more, the F

1,best

ﬁgure, which is obtained at P = 0.98

and R = 0.67, suggests that MA-VAE has the poten-

tial to achieve even higher precision without sacriﬁc-

ing recall if the threshold were optimised. The higher

PRC

also shows that MA-VAE has a higher range of

thresholds at which it performs well.

MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to

Automotive Endurance Powertrain Testing

417

6 CONCLUSION AND OUTLOOK

In this paper, a multi-head attention variational au-

toencoder (MA-VAE) for anomaly detection in auto-

motive testing is proposed. It not only features an

attention conﬁguration that avoids the bypass phe-

nomenon but also introduces a novel method of

remapping windows to whole sequences. A num-

ber of experiments are conducted to demonstrate the

anomaly detection performance of the model, as well

as to underline the beneﬁts of key aspects introduced

with the model.

From the results obtained, MA-VAE clearly ben-

eﬁts from the MA mechanism, indicating the avoid-

ance of the bypass phenomenon. Moreover, the

proposed approach only requires a small train-

ing/validation subset size but fails to obtain a suit-

able threshold, as with increasing subset size only the

calibrated anomaly detection performance increases.

Training with different seeds also is shown to have

little impact on the anomaly detection metrics, pro-

vided the threshold is chosen suitably, further under-

lining the previous point. Moreover, mean-type re-

verse windowing fails to signiﬁcantly outperform its

ﬁrst-type and last-type counterparts, while introduc-

ing additional lag if it is applied to online anomaly

detection. Lastly, the hyperparameter optimisation re-

vealed that the MA-VAE variant with the largest latent

dimension and attention key dimension resulted in the

best anomaly detection performance. It is only 9% of

the time wrong when an anomaly is ﬂagged and man-

ages to discover 67% of the anomalies present in the

test data set. Also, it outperforms all other competing

models it is compared with.

In the future, a method of threshold choice involv-

ing active learning will be investigated, which can use

user feedback to hone in on a better threshold. Also,

MA-VAE is set to be tested in the context of online

anomaly detection, i.e. during the driving cycle mea-

surement.

REFERENCES

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Ma-

chine Translation by Jointly Learning to Align and

Translate. In International Conference on Learning

Representations (ICLR).

Bahuleyan, H., Mou, L., Vechtomova, O., and Poupart,

P. (2018). Variational Attention for Sequence-to-

Sequence Models. In International Conference on

Computational Linguistics (COLING).

Bridle, J. S. (1990). Probabilistic Interpretation of Feedfor-

ward Classiﬁcation Network Outputs, with Relation-

ships to Statistical Pattern Recognition. Neurocom-

puting, pages 227–236.

Chen, T., Liu, X., Xia, B., Wang, W., and Lai, Y.

(2020). Unsupervised Anomaly Detection of In-

dustrial Robots Using Sliding-Window Convolutional

Variational Autoencoder. IEEE Access, 8:47072–

47081.

Chollet, F. (2021). Deep Learning with Python. Manning

Publications.

Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin,

L. (2019). Cyclical Annealing Schedule: A Simple

Approach to Mitigating. In Conference of the Associa-

tion for Computational Linguistics: Human Language

Technologies (NAACL-HLT).

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press.

Kingma, D. P. and Welling, M. (2014). Auto-Encoding

Variational Bayes. In International Conference on

Learning Representations (ICLR).

Li, L., Yan, J., Wang, H., and Jin, Y. (2021). Anomaly

Detection of Time Series With Smoothness-Inducing

Sequential Variational Auto-Encoder. Transactions on

Neural Networks and Learning Systems, 32(3):1177–

1191.

Park, D., Hoshi, Y., and Kemp, C. C. (2018). A Multimodal

Anomaly Detector for Robot-Assisted Feeding Us-

ing an LSTM-Based Variational Autoencoder. IEEE

Robotics and Automation Letters, 3(3):1544–1551.

Pereira, J. and Silveira, M. (2018). Unsupervised Anomaly

Detection in Energy Time Series Data Using Varia-

tional Recurrent Autoencoders with Attention. In In-

ternational Conference on Machine Learning and Ap-

plications (ICMLA).

Pereira, J. and Silveira, M. (2019). Unsupervised repre-

sentation learning and anomaly detection in ECG se-

quences. International Journal of Data Mining and

Bioinformatics, 22(4):389.

Rezende, D. J. and Mohamed, S. (2015). Variational Infer-

ence with Normalizing Flows. In International Con-

ference on Machine Learning (ICML).

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

Stochastic Backpropagation and Approximate Infer-

ence in Deep Generative Models. In International

Conference on Machine Learning (ICML).

Shannon, C. (1949). Communication in the Presence of

Noise. Proceedings of the IRE, 37(1):10–21.

Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., and Pei, D.

(2019). Robust Anomaly Detection for Multivariate

Time Series through Stochastic Recurrent Neural Net-

work. In International Conference on Knowledge Dis-

covery & Data Mining (KDD).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention Is All You Need. In Conference

on Neural Information Processing Systems (NIPS).

von Schleinitz, J., Graf, M., Trutschnig, W., and Schr

oder,

A. (2021). VASP: An autoencoder-based approach

for multivariate anomaly detection and robust time

series prediction with application in motorsport.

Engineering Applications of Artiﬁcial Intelligence,

104:104354.

NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications

418