MA-VAE: Multi-Head Attention-Based Variational Autoencoder
Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
Lucas Correia
1,2 a
, Jan-Christoph Goos
1 b
, Philipp Klein
1
, Thomas B
¨
ack
2 c
and Anna V. Kononova
2 d
1
Mercedes-Benz AG, Stuttgart, Germany
2
Leiden University, Leiden, The Netherlands
Keywords:
Anomaly Detection, Multivariate, Time Series, Automotive, Test Bench, Variational Autoencoder, Bypass
Phenomenon.
Abstract:
A clear need for automatic anomaly detection applied to automotive testing has emerged as more and more
attention is paid to the data recorded and manual evaluation by humans reaches its capacity. Such real-world
data is massive, diverse, multivariate and temporal in nature, therefore requiring modelling of the testee be-
haviour. We propose a variational autoencoder with multi-head attention (MA-VAE), which, when trained
on unlabelled data, not only provides very few false positives but also manages to detect the majority of the
anomalies presented. In addition to that, the approach offers a novel way to avoid the bypass phenomenon, an
undesirable behaviour investigated in literature. Lastly, the approach also introduces a new method to remap
individual windows to a continuous time series. The results are presented in the context of a real-world in-
dustrial data set and several experiments are undertaken to further investigate certain aspects of the proposed
model. When configured properly, it is 9% of the time wrong when an anomaly is flagged and discovers 67%
of the anomalies present. Also, MA-VAE has the potential to perform well with only a fraction of the training
and validation subset, however, to extract it, a more sophisticated threshold estimation method is required.
1 INTRODUCTION
Powertrain testing is an integral part of the wider auto-
motive powertrain development and is undertaken at
different stages of development. Each of these stages
is composed of many integration levels. These inte-
gration levels range from powertrain sub-component
testing, such as the electric drive unit (EDU) con-
troller or high-voltage battery (HVB) management
system, to whole vehicle powertrain testing. Each
of these has its special type of controlled environ-
ment, called a test bench. The use-case in this paper
is on an endurance powertrain test bench, where the
EDU and HVB on their own are tested under differ-
ent conditions and loads for longer periods to simulate
wear over time. Given the costly maintenance and up-
keep costs of such test benches, it is desirable to keep
a
https://orcid.org/0000-0002-3653-5934
b
https://orcid.org/0009-0002-7063-3615
c
https://orcid.org/0000-0001-6768-1478
d
https://orcid.org/0000-0002-4138-7024
downtime at a minimum and to avoid faulty measure-
ments. Also, it is desirable to detect problems early
to prevent damage to the testee. Given that evalua-
tion is done manually by inspection, it is not feasi-
ble to analyse every single measurement, also evalua-
tion tends to be delayed, only being undertaken days
after the measurement is recorded, hence there is a
clear need for automatic, fast and unsupervised eval-
uation methodology which can flag anomalous mea-
surements before the next measurement is started.
To achieve this, we propose a multi-head attention
variational autoencoder (MA-VAE). MA-VAE con-
sists of a bidirectional long short-term memory (BiL-
STM) variational autoencoder architecture that maps
a time-series window into a temporal latent distribu-
tion (Park et al., 2018) (Su et al., 2019). Also, a multi-
head attention (MA) mechanism is added to further
enhance the sampled latent matrix before it is passed
on to the decoder. As shown in the ablation study, this
approach avoids the so-called bypassed phenomenon
(Bahuleyan et al., 2018), which is the first contribu-
Correia, L., Goos, J., Klein, P., Bäck, T. and Kononova, A.
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to Automotive Endurance Powertrain Testing.
DOI: 10.5220/0012163100003595
In Proceedings of the 15th International Joint Conference on Computational Intelligence (IJCCI 2023), pages 407-418
ISBN: 978-989-758-674-3; ISSN: 2184-3236
Copyright © 2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
407
tion. Furthermore, this paper offers a unique method-
ology for the reverse-window process. It is used
for remapping the fixed-length windows the model is
trained on to continuous variable-length sequences.
This paper is structured as follows: First, a short
background is provided in Section 2 on the power-
train testing methodology specific to this use case,
as well as the theory behind VAE and MA mecha-
nisms. Then, related work in variational autoencoder-
based time-series anomaly detection is presented in
Section 3, followed by an in-depth introduction of
the real-world data set and the approach we propose
in Section 4. Then, several experiments testing dif-
ferent aspects of the proposed method are conducted
and discussed in Section 5, along with the final re-
sults. Finally, conclusions from this work are drawn
and an outlook into future work is provided in Sec-
tion 6. The source code for the data pre-processing,
model training as well as evaluation can be found un-
der https://github.com/lcs-crr/MA-VAE.
2 BACKGROUND
2.1 Real-World Application
During endurance testing a portfolio of different driv-
ing cycles is run, where a cycle is a standardised driv-
ing pattern, which enables repeatability of measure-
ments. For this type of testing the portfolio con-
sists exclusively of proprietary cycles, which differ
from the public cycles used, for example, for vehicle
fuel/energy consumption certification like the New
European Driving Cycle (NEDC) or the Worldwide
Harmonised Light Vehicles Test Cycle (WLTC). The
reason why proprietary cycles are used for endurance
runs is that they allow for more extensive loading of
the powertrain.
Given the presence of a battery in the testee, some
time has to be dedicated to battery soaking (sitting
idle) and charging. These procedures are also stan-
dardised using cycles, although, for the intents and
purposes of this paper, they are omitted. What is
left are the eight dynamic driving cycles representing
short, long, fast, slow and dynamic trips ranging from
5 to 30 minutes. There are multiple versions of the
same cycle, which mostly differ in starting conditions
such as state-of-charge (SoC) and temperature of the
battery.
On powertrain test benches, there are several con-
trol methods to ensure the testee maintains the given
driving cycle. In this particular test bench, the regu-
lation is done by the acceleration pedal and the EDU
revolutions-per-minute (rpm), which is nothing more
than a non-SI version of the angular velocity.
2.2 Data Set
This real-world data set consists of 3385 normal mea-
surement files, each of which contains hundreds of
(mostly redundant or empty) channels. A measure-
ment is considered normal when the testee behaviour
conforms to the norm. For this work, a list of d
X
= 13
channels was hand-picked in consultation with the
test bench engineers to choose a reasonable and rep-
resentative number of channels. This list includes
the vehicle speed, EDU torque, current, voltage, ro-
tor temperature and stator temperature, left and right
wheel shaft torque, HVB current, voltage, tempera-
ture and SoC and inverter temperature. Given that
some channels (such as torque) are sampled much
faster than others (like temperature and SoC), a com-
mon sampling rate of 2Hz is chosen. Channels sam-
pled slower than 2Hz are linearly interpolated, which
is seen as permissible due to the lower amplitude res-
olution of those channels. Channels sampled faster
than 2Hz are passed through a low-pass filter with a
cut-off frequency of 1Hz and then resampled to 2Hz,
as is consistent with the Whittaker–Nyquist–Shannon
theorem (Shannon, 1949). Then the driving cycles
are z-score normalised, i.e. transformed such that the
mean for each channel lies at 0 and the standard de-
viation at 1. Lastly, the driving cycles (generally re-
ferred to as sequences in this paper) are windowed
to create a set of fixed-length sub-sequences, or win-
dows. First, each channel is auto-correlated to obtain
the number of lags of the slowest dynamic process
present in the signal. Then, the window size W is set
as the smallest power of two larger than the longest
lag, in this case, W = 256 time steps or 128 seconds.
Each window overlaps their preceding and succeed-
ing windows by half a window, i.e. the shift between
windows is W /2 = 128 time steps, in order to reduce
computational load compared to a shift of one time
step.
Due to the absence of labelled anomalies in the
dataset, realistic anomalous events are intentionally
simulated and recorded following the advice of test
bench engineers. To this end, five anomaly types were
recorded. In the first type, the virtual wheel diameter
is changed, such that the resulting vehicle speed devi-
ates from the norm. The wheel diameter is a param-
eter as resistances are connected to the shafts rather
than actual wheels. The second type of anomaly in-
volves changing the driving mode from comfort to
sport, which leads to a higher HVB SoC drop over
the cycle and a different torque response. In the third
anomaly, the recuperation level is turned from maxi-
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
408
mum to zero, hence the minimum EDU torque is al-
ways non-negative and the HVB SoC experiences a
higher drop in SoC. In the case of the fourth anomaly,
the HVB is swapped for a battery simulator, where the
HVB voltage behaviour deviates from a real battery.
The inverter and EDU share a cooling loop, whose
cooling capacity is reduced at the beginning or mid-
dle of the cycle, leading to higher EDU rotor, EDU
stator and inverter temperatures than normal. Every
anomaly type is recorded during every cycle at least
once, leading to 60 anomalous driving cycles that are
all used as the anomalous subset of the test set.
A plot of one normal and one wheel-diameter
anomalous cycle is shown in Figure 1. Due to the
long channel names, the plot only shows the channel
indices, a table containing the legend is shown in Ta-
ble 1 for context. Visual inspection may suggest that
the red plot is anomalous, since the EDU and HVB
voltage, temperature and state of charge deviate from
the black plot. This deviation is to be expected be-
cause they depend on how charged the battery is and
on how much the battery is used previous to the cur-
rent cycle. In the case of this anomaly, the only chan-
nel that demonstrates anomalous behaviour is the ve-
hicle speed, since:
v
vehicle
= r ×ω (1)
where r is the wheel radius and ω the angular velocity.
Evidently, the anomalous behaviour is most visible at
higher speeds.
In an operative environment, it is desirable to find
out whether the previously recorded sequence had any
problems to analyse before the next measurement is
recorded. Also, a model that performs as well as
possible with as little data as possible translates to
faster deployment. Good performance is indicated by
a model that can detect as many anomalies as possi-
ble and rarely labels normal measurements wrongly.
Table 1: Legend for the channel names in Figure 1.
No. Name
1 Vehicle Speed
2 EDU Torque
3 Left Axle Torque
4 Right Axle Torque
5 EDU Current
6 EDU Voltage
7 HVB Current
8 HVB Voltage
9 HVB Temperature
10 HVB State of Charge
11 EDU Rotor Temperature
12 EDU Stator Temperature
13 Inverter Temperature
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
0 200 400 600 800 1000
5
0
5
Time [s]
1 [-]2 [-]3 [-]4 [-]5 [-]6 [-]7 [-]8 [-]9 [-]10 [-]11 [-]12 [-]13 [-]
Figure 1: Features of a normal (black) and an anomalous
(red) cycle plotted with respect to time. The anomalous cy-
cle plotted represents a scenario where the wheel diameter
has not been set correctly. The amplitude axis is z-score
normalised to comply with confidentiality guidelines.
To investigate the required training subset size of the
model, it is trained with 1h, 8h, 64h, and 512h worth
of dynamic testing data, which corresponds to the
first 6, 44, 348, and 2785 driving cycles, respectively.
The results are also presented in Section 5. In each
of the above-mentioned cases, the training subset is
further split into a training (80%) and a validation
(20%) subsets. Both the training and validation sub-
sets are batched to sets of 512 windows. Given the
anomalous subset size of 60 driving cycles, 600 nor-
mal driving cycles recorded after the ones in the train-
ing subset are chosen to make up the normal test sub-
set. This would imply that 9% of measurements at
the test bench are anomalous, however in reality this
value is estimated to be much lower. This amount of
anomalous data in relation to normal data is used as
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
409
it approximately matches the anomaly ratio in public
data sets and because the data set is not large enough
to create a larger normal test subset.
2.3 Variational Autoencoders
The variational autoencoder (Kingma and Welling,
2014)(Rezende et al., 2014) is a generative model that
structurally resembles an autoencoder, but is theoreti-
cally derived from variational Bayesian statistics. As
opposed to the regular deterministic autoencoder, the
VAE uses the evidence lower bound (ELBO), which
is a lower bound approximation of the so-called log
evidence log p
θ
(X), as its objective function. The
ELBO, Equation 2, can be expressed as the recon-
struction log-likelihood and the negative Kullback-
Leibler Divergence (D
KL
) between the approximate
posterior q
φ
(Z|X) and the prior p
θ
(Z), which is typi-
cally assumed to be a Gaussian distribution (Goodfel-
low et al., 2016).
L
θ,φ
(X) = E
Zq
φ
(Z|X)
[log p
θ
(X|Z)]
D
KL
(q
φ
(Z|X)||p
θ
(Z))
(2)
where Z R
W ×d
Z
is the sampled latent matrix and
X R
W ×d
X
is the input window. W refers to the
window length, whereas d
X
and d
Z
refer to the in-
put window and latent matrix dimensionality, respec-
tively. Gradient-based optimisation minimises an ob-
jective function and the goal is the maximisation of
the ELBO, hence the final loss function is defined as
the negative of Equation 2, shown in Equation 3.
L
VAE
= L
θ,φ
(X) (3)
Finally, to enable the backpropagation through
the otherwise intractable gradient of the ELBO, the
reparametrisation trick (Kingma and Welling, 2014)
is applied, shown in Equation 4.
Z = µ
Z
+ ε ·σ
Z
(4)
where ε N (0,1) and (µ
Z
,logσ
2
Z
) = q
φ
(X).
2.4 Multi-Head Attention Mechanism
To simplify the explanation of MA as employed in
this work, multi-head self-attention (MS) will be ex-
plained instead with the small difference between MA
and MS being pointed out at the end.
MS consists of two different concepts: self-
attention and its multi-head extension. Self-attention
is nothing more than scaled dot-product attention
(Vaswani et al., 2017) where the key, query and value
are the same. The scaled dot-product attention score
is the softmax (Bridle, 1990) of the product between
query matrix Q and key matrix K which is scaled by
d
K
. The product between the attention score and the
value matrix V yields the context matrix C, as shown
in Equation 5.
C = Softmax
QK
T
d
K
V (5)
Compared to recurrent or convolutional layers, self-
attention offers a variety of benefits, such as the re-
duction of computational complexity, as well as an in-
creased amount of operations that can be parallelised.
(Vaswani et al., 2017). Also, self-attention inherits an
advantage over Bahdanau-style attention (Bahdanau
et al., 2015) from the underlying scaled dot-product
attention mechanism: it can run efficiently in matrix
multiplication manner (Vaswani et al., 2017).
Multi-head self-attention then allows the attention
model to attend to different representation subspaces
(Vaswani et al., 2017), in addition to learning useful
projections rather than it being a stateless transforma-
tion (Chollet, 2021). This is achieved using weight
matrices W
Q
i
, W
K
i
, W
V
i
, which contain trainable pa-
rameters and are unique for each head i, as shown in
Equation6.
Q
i
= QW
Q
i
K
i
= KW
K
i
V
i
= VW
V
i
(6)
Once the query, key and value matrices are linearly
transformed via the weight matrices, the context ma-
trix C
i
for each head i is computed using Equation 7.
C
i
= Softmax
Q
i
K
T
i
d
K
V
i
(7)
Then, for h heads, the different context matrices are
concatenated and linearly transformed again via the
weight matrix W
O
, resulting in the multi-head context
matrix C R
W ×d
Z
, Equation 8.
C = [C
1
,...,C
h
]W
O
(8)
The underlying mechanism of MA is identical to MS,
with the only difference being that K = Q ̸= V. The
benefit of this alteration is discussed in Section 4.
3 RELATED WORK
MA-VAE belongs to the so-called generative model
class, which encompasses both variational autoen-
coders, as well as generative adversarial networks.
This section focuses solely on the work on VAE pro-
posed in the context of time-series anomaly detection.
In time-series anomaly detection literature, the
only other model that uses the combination of a VAE
and an attention mechanism is by (Pereira and Sil-
veira, 2018). For the purpose of our paper, it is named
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
410
VS-VAE. Their approach consists of a BiLSTM en-
coder and decoder, where, for an input window of
length W, the t = W encoder hidden states of each di-
rection are passed on to the variational self-attention
(VS) mechanism (Bahuleyan et al., 2018). The result-
ing context vector is then concatenated with the sam-
pled latent vector and then passed on to the decoder.
The author claims that applying VS to the VAE model
solves the bypass phenomenon, however, no evidence
for this claim is provided.
The first published time-series anomaly detection
approach based on VAE was LSTM-VAE (Park et al.,
2018). One of the contributions is its use of a dynamic
prior N (µ
p
,1), rather than a static one N (0, 1). In
addition to that, they introduce a state-based thresh-
old estimation method consisting of a support-vector
regressor (SVR), which maps the latent distribution
parameters (µ
z
,σ
z
) to the resulting anomaly score us-
ing the validation data. Hence, the dynamic threshold
can be obtained through Equation 9.
η
t
= SVR(µ
z,t
,σ
z,t
) + c (9)
where c is a pre-defined constant to control sensitivity.
OmniAnomaly (Su et al., 2019) attempts to cre-
ate a temporal connection between latent distributions
by applying a linear Gaussian state space model to
them. For the purpose of this paper, it is called (Om-
niA). Also, it concatenates the last gated recurrent
unit (GRU) hidden state with the latent vector sam-
pled in the previous time step. In addition to that,
it uses planar normalising flow (Rezende and Mo-
hamed, 2015) by applying K transformations to the
latent vector in order to approximate a non-Gaussian
posterior, as shown in Equation 10.
f
k
(z
k1
t
) = u tanh(wz
k1
t
) + b (10)
where u, w and b are trainable parameters.
A simplified VAE architecture (Pereira and Sil-
veira, 2019) based on BiLSTM layers is also pro-
posed. For the purpose of our paper, it is called
W-VAE. Unlike its predecessor (Pereira and Silveira,
2018), it drops the attention mechanism but provides
contributions elsewhere. It offers two strategies to de-
tect anomalies based on the VAE outputs. The first in-
volves clustering the space characterised by the mean
parameter of the latent distribution into two clusters
and labelling the larger one as normal. This strategy
has a few weaknesses: it cannot be used in an opera-
tive environment as it requires some sort of history of
test windows to form the clusters and it assumes that
there are always anomalous samples present. The sec-
ond strategy finds the Wasserstein similarity measure
(hence the W in the name) between the latent mean
space mapping of the test window in question and the
respective mapping i resulting from a representative
data subset, such as the validation subset. Equation
11 shows how the Wasserstein similarity measure is
computed
W
i
(z
test
,z
i
) = µ
z
test
µ
z
i
2
2
+ Σ
1/2
z
test
Σ
1/2
z
i
2
F
(11)
where the first term represents the L2-Norm between
the mean distribution parameters resulting from the
test window and each point of the representative sub-
set. The second term represents the Frobenius norm
between the covariance matrix resulting from the test
window and each point of the representative subset.
SWCVAE (Chen et al., 2020) is the first that ap-
plies convolutional neural networks (CNN) to VAE
for multivariate time-series anomaly detection. Pe-
culiarly, 2D CNN layers are used with the justifica-
tion of being able to process the input both spatially
and temporally. We, however, doubt the ability of
the model to properly detect anomalies through spa-
tial processing, as a kernel moving along the feature
axis can only capture features adjacent to each other.
To create a continuous anomaly score from windows
they append the last value of each window to the pre-
vious one. For the purpose of this paper, this process
is referred to as last-type reverse-windowing.
SISVAE (Li et al., 2021) tries to improve the mod-
elling robustness by the addition of a smoothing term
in the loss function which contributes to the reduction
of sudden changes in the reconstructed signal, making
it less sensitive to noisy time steps.
As part of the VASP framework (von Schleinitz
et al., 2021), a variational autoencoder architecture is
proposed to increase the robustness of time-series pre-
diction when faced with anomalies. While the main
contribution is attributed to the framework itself, not
the VAE, it should be noted that during inference only
the mean parameter of the latent distribution is passed
to the decoder.
4 PROPOSED APPROACH
4.1 Overview
To detect anomalies in multivariate time-series data,
we propose a variational autoencoder architecture
consisting of BiLSTM layers. The model architecture
is illustrated in Figure 2. During training, the encoder
q
φ
maps multivariate input window X to a temporal
distribution with parameters µ
Z
and log σ
2
Z
in the for-
ward pass, Equation 12.
(µ
Z
,logσ
2
Z
) = q
φ
(X) (12)
Given the latent distribution parameters µ
Z
and
logσ
2
Z
, the latent matrix is sampled from the result-
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
411
BiLSTM
Encoder
BiLSTM
Decoder
X
µ
X
logσ
2
X
µ
Z
logσ
2
Z
W ×d
X
W ×d
Z
W ×d
Z
W ×d
X
W ×d
X
W ×d
Z
W ×d
Z
Z
Multi-head
Attention
C
VV
Q,K
Figure 2: An illustration of the proposed MA-VAE model. Blue shapes designate trainable models, orange deterministic
tensors and green distribution parameters. The shape of each tensor is designated below it. During training Z is used as the
value matrix, denoted by the solid arrow, whereas during inference µ
Z
is used as the value matrix, denoted by the traced arrow.
ing distribution, as shown in Equation 13.
Z N (µ
Z
,logσ
2
Z
) (13)
Then, the input window X is linearly transformed to
obtain the query matrices Q
i
and key matrices K
i
for
each head i . Likewise, the sampled latent matrix Z is
also transformed to the value matrix V
i
, as shown in
Equation 14.
Q
i
= XW
Q
i
K
i
= XW
K
i
V
i
= ZW
V
i
(14)
To output the context matrix C
i
for each head i, the
softmax of the through
d
K
normalised query and
key product is multiplied with the value matrix, Equa-
tion 15.
C
i
= Softmax
Q
i
K
T
i
d
K
V
i
(15)
The final context matrix C is the result of the linearly-
transformed concatenation of each head-specific con-
text matrix C
i
, as expressed in Equation 16.
C = [C
1
,...,C
h
]W
O
(16)
The decoder p
θ
then maps the context matrix C to an
output distribution with parameters µ
X
and log σ
2
X
, as
shown in Equation 17.
(µ
X
,logσ
2
X
) = p
θ
(C) (17)
4.2 Inference Mode
Despite the generative capabilities of VAE, MA-VAE
does not leverage generation for anomaly detection.
Rather than sampling a latent matrix as shown in
Equation 13 during inference, sampling is disabled
and only µ
Z
is taken as the input for the multi-head
attention mechanism, like in (von Schleinitz et al.,
2021). Equation 13 in the forward pass, therefore, is
replaced by Equation 18.
Z = µ
Z
(18)
This not only accelerates inference by eliminating the
sampling process but is also empirically found to be
a good approximation of an averaged latent matrix if
it were sampled several times like in (Pereira and Sil-
veira, 2018). The MA-VAE layout during inference
is shown in Figure 2, where the traced arrow desig-
nates the information flow from the encoder to the
MA mechanism.
4.3 Threshold Estimation Method
Anomalies are by definition very rare events, hence
an ideal anomaly detector only flags measurements
very rarely but accurately. Test bench engineers pre-
fer an algorithm that only flags a sequence it is sure
is an anomaly, in other words, an algorithm that out-
puts very few to no false positives. A high false posi-
tive count would lead to a lot of stoppages and there-
fore lost testing time and additional cost. Of course,
the vast majority of measurements evaluated will be
normal and hence it is paramount to classify them
correctly, naturally leading to a high precision value.
Also, there is no automatic evaluation methodology
currently running at test benches, other than rudi-
mentary rule-based methods, therefore a solution that
plugs into the existing system that automatically de-
tects some or most anomalies undetectable by rules is
already a gain. To achieve this, the threshold τ is set
as the maximum log probability observed when the
model is fed with validation data.
4.4 Bypass Phenomenon
VAE, when combined with an attention mecha-
nism, can exhibit a behaviour called the bypass phe-
nomenon (Bahuleyan et al., 2018). When the bypass
phenomenon happens the latent path between encoder
and decoder is ignored and information flow occurs
mostly or exclusively through the attention mecha-
nism, as it has deterministic access to the encoder hid-
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
412
den states and therefore avoids regularisation through
the D
KL
term. In an attempt to avoid this, (Bahuleyan
et al., 2018) propose variational attention, which, like
the VAE, maps the input to a distribution rather than a
deterministic vector. Applied to natural language pro-
cessing, (Bahuleyan et al., 2018) demonstrate that this
leads to a diversified generated portfolio of sentences,
indicating alleviation of the bypassing phenomenon.
As previously mentioned, only (Pereira and Silveira,
2018) applies this insight in the anomaly detection do-
main, however, they do not present any proof that it
alleviates the bypass phenomenon in their work. MA-
VAE on the other hand, cannot suffer from the bypass
phenomenon in the sense that information flow ig-
nores the latent variational path between encoder and
decoder since the MA mechanism requires the value
matrix V from the encoder to output the context ma-
trix. Assuming the bypass phenomenon also applies
to a case where information flow ignores the atten-
tion mechanism, one could claim that MA-VAE is not
immune. To disprove this claim, the attention mecha-
nism is removed from the model in an ablation study
to see if anomaly detection performance remains the
same. In this case, V is instead directly input into the
decoder. If it drops, it is evidence of the contribution
of the attention mechanism to the model performance
and hence is not bypassed. The results for this abla-
tion study are shown and discussed in Section 5.
4.5 Impact of Seed Choice
Given the stochastic nature of the VAE, the chosen
seed can impact the anomaly detection performance
as it can lead to a different local minimum during
training. To investigate the impact the seed choice
has on model training, MA-VAE is trained on three
different seeds, the respective results are also shown
in Section 5.
4.6 Reverse-Window Process
Since the model is trained to reconstruct fixed-length
windows, the same applies during inference. How-
ever, to decide whether a given measurement se-
quence S R
T ×d
X
is anomalous, a continuous recon-
struction of the measurement is required. The easiest
way to do so would be to window the input measure-
ment using a shift of 1, input the windows into the
model and chain the last time step from each output
window to obtain a continuous sequence (Chen et al.,
2020). Considering the BiLSTM nature of the en-
coder and decoder, the first and last time steps of a
window can only be computed given the states from
one direction, making these values, in theory, less ac-
Algorithm 1: Anomaly Detection Process.
Input: Sequence S R
T ×d
X
, Threshold τ
Result: Label l
n
windows
T W + 1
µ
X,temp
zeros(n
windows
,T,d
X
) + NaN
σ
2
X,temp
zeros(n
windows
,T,d
X
) + NaN
for i = 1 n
windows
do
X S [i : W + i]
(µ
Z
,log σ
2
Z
) q
φ
(X)
C MA(X, X,µ
Z
)
(µ
X
,log σ
2
X
) p
θ
(C)
µ
X,temp
[i,i : i +W ] µ
X
σ
2
X,temp
[i,i : i +W ] σ
2
X
end for
µ
X,seq
nanmean(µ
X,temp
)
σ
2
X,seq
nanmean(σ
2
X,temp
)
s log p(X|µ
X,seq
,σ
X,seq
)
l max(s) > τ
curate, however. To overcome this, we propose aver-
aging matching time steps in overlapping windows,
which is called mean-type reverse-window method.
This is done by pre-allocating an array with NaN val-
ues, filling it, and taking the mean for each time step
while ignoring the NaN values. This process and the
general anomaly detection process are described in
Algorithm 1. This reverse-window process is done for
the mean and variance parameters of the output distri-
bution, then the variance is converted to standard de-
viation since two distributions cannot be combined by
averaging the standard deviations. With a continuous
mean and standard deviation, the continuous negative
log probability, i.e. the anomaly score s, is computed
for the respective measurement. A comparison be-
tween the mean, last and first reverse-window process
is provided in Section 5.
5 RESULTS
5.1 Setup
The encoder and decoder both consist of two BiLSTM
layers, with the outer ones having 512 hidden- and
cell-state sizes and the inner ones 256. All other pa-
rameters are left as the default in the TensorFlow API.
During training only, input windows are corrupted
using Gaussian noise using 0.01 standard deviation to
increase robustness to noise.
Key factors that are investigated in Section 5 are
given a default value which applies to all experiments
unless otherwise specified. These factors are train-
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
413
ing/validation subset size, which is set to 512h, seed
choice, which has been kept at 1, reverse-window
method, where the mean-type is used, the latent di-
mension size, which is set to d
Z
= 16 and the MA
mechanism, which is set up as proposed in (Vaswani
et al., 2017) with a head count of h = 8 and a key
dimension size d
K
= d
X
/h = 1.
The optimiser used is the AMSGrad optimiser
with the default parameters in the TensorFlow API.
Cyclical D
KL
annealing (Fu et al., 2019) is ap-
plied to the training of MA-VAE, to avoid the D
KL
vanishing problem. The D
KL
vanishing problem oc-
curs when regularisation is too strong at the beginning
of training, i.e. the Kullback-Leibler divergence term
has a larger magnitude in relation to the reconstruc-
tion term. Cyclical D
KL
annealing allows the model
to weigh the Kullback-Leibler divergence lower than
the reconstruction term in a cyclical manner through a
weight β. This callback is configured with a grace pe-
riod of 25 epochs, where β is linearly increased from
0 to 10
8
. After the grace period, β is set to 10
8
and is gradually increased linearly to 10
2
throughout
the following 25 epochs, representing one loss cycle.
This loss cycle is repeated until the training stops.
All priors in this work are set as standard Gaussian
distributions, i.e. p = N (0, 1).
To prevent overfitting, early stopping is imple-
mented. It works by monitoring the log probability
component of the validation loss during training and
stopping if it does not improve for 250 epochs. Logi-
cally, the model weights at the lowest log probability
validation loss are saved.
Training is done on a workstation configured with
an NVIDIA RTX A6000 GPU. The library used for
model training is TensorFlow 2.10.1 on Python 3.10
on Windows 10 Enterprise LTSC version 21H2.
The results provided are given in the form of the
calibrated and uncalibrated anomaly detection perfor-
mance, i.e. with and without consideration of thresh-
old τ, respectively. Recall that the threshold used is
the absolute maximum negative log probability ob-
tained from the validation set. Calibrated metrics are
the precision, recall and F
1
score. Precision P rep-
resents the ratio between correctly identified anoma-
lies (true positives) and all positives (true and false),
shown in Equation 19, recall R represents the ratio
between true positives and all anomalies, shown in
Equation 19, and F
1
score represents the harmonic
mean of the precision and recall, shown in Equation
20. The underlying metrics used to calculate all of
the below are the true positives (T P), false negatives
(FN) and false positives (FP).
P =
T P
T P + FP
R =
T P
T P + FN
(19)
F
1
=
T P
T P + 0.5 (FP + FN)
= 2
P R
P + R
(20)
The theoretical maximum F
1
score, F
1,best
, is also pro-
vided to aid discussion. This represents the best pos-
sible score achievable by the approach if the ideal
threshold were known, i.e. the point on the precision-
recall curve that comes closest to the P = R = 1 point,
though, in reality, this value is not observable and
hence cannot be obtained in an unsupervised manner.
The uncalibrated anomaly detection performance,
i.e. the performance for a range of thresholds, each
0.1 apart, is represented by the area under the contin-
uous precision-recall curve A
cont
PRC
, Equation 21.
A
cont
PRC
=
Z
1
0
PdR (21)
As the integral cannot be computed for the continuous
function, the area under the discrete precision-recall
curve A
disc
PRC
is used which is done using the trape-
zoidal rule, Equation 22.
A
disc
PRC
=
N
k=1
f (R
k1
) + f (R
k
)
2
R
k
(22)
where N is the number of discrete sub-intervals, k
the index of sub-intervals and R
k
the sub-intervals
length at index k. Precision is a function of recall, i.e.
P = f (R).
5.2 Ablation Study
MA-VAE is tested without the MA mechanism and
with a direct connection from the encoder to the de-
coder to observe whether it impacts results.
The anomaly detection performance of MA-VAE
and its counterpart without MA, henceforth referred
to as No MA model, are shown in Table 2. While
the precision value of the No MA model is slightly
higher than the MA-VAE, the recall value on the other
hand is much lower. Overall, MA-VAE has a higher
F
1
score, as well as a higher theoretical maximum F
1
score, although both values are so close enough to
each other that one could claim the threshold is near
ideal. The uncalibrated performance is also higher in
the case of the MA-VAE, as evident in the precision-
recall plot in Figure 3. Interestingly, MA-VAE may
feature a lower precision value for the chosen un-
supervised threshold but has the potential to have a
higher maximum recall at P = 1.
The results hence point towards an improvement
brought about by the addition of the MA mechanism
and therefore the bypass phenomenon can be ruled
out.
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
414
Table 2: Precision P, recall R, F
1
score, theoretical best F
1
score F
1,best
and area under the precision-recall curve A
PRC
results for the model variant without the MA mechanism
and MA-VAE. The best values for each metric are given in
bold.
Model P R F
1
F
1,best
A
PRC
No MA 1.00 0.35 0.52 0.54 0.52
MA-VAE 0.92 0.55 0.69 0.70 0.66
Figure 3: Precision-recall curves for the model variation
without MA and MA-VAE.
5.3 Data Set Size Requirements
To evaluate how much data is required to train MA-
VAE to a point of adequate anomaly detection perfor-
mance, it has been trained with 1h, 8h, 64h, and 512h
worth of dynamic testing data.
The results for this experiment are presented in
Table 3. On the one hand, as the training/validation
subset increases in size, the precision value improves,
with the largest jump occurring when the dynamic
testing time goes from 1h to 8h. The recall value on
the other hand decreases as the subset grows. This can
be attributed to the fact that smaller subset sizes lead
to a small validation set and therefore less data to ob-
tain a threshold from. With a limited amount of data
to obtain a threshold from, it is more difficult to get
a representative error distribution, leading to a thresh-
old that is very small and hence marks most anoma-
lies correctly but also leads to a lot of false positives.
F
1
score reaches a point of diminishing returns with
the 8h subset onwards, this can also be observed in
the case of the theoretical maximum F
1
score, F
1,best
,
as well as in the A
PRC
value, further supported by
the precision-recall plot in Figure 4. Lastly, the F
1
score seems to approach the F
1,best
score as the subset
grows, also backing the fact that with a small subset
size, a good threshold cannot easily be obtained.
Table 3: Precision P, recall R, F
1
score, threoretical best F
1
score F
1,best
and area under the precision-recall curve A
PRC
results for the different training/validation subset sizes. The
best values for each metric are given in bold.
Size P R F
1
F
1,best
A
PRC
1h 0.09 0.88 0.17 0.55 0.49
8h 0.66 0.63 0.64 0.72 0.69
64h 0.71 0.57 0.63 0.69 0.68
512h 0.92 0.55 0.69 0.70 0.66
Figure 4: Precision-recall curves for the model trained on
different training/validation subset sizes.
Therefore, for application at the test bench, the
largest subset size is desirable due to the higher pre-
cision value and a closer-to-ideal threshold value.
5.4 Impact of Seed Choice
To illustrate the impact it has on the performance met-
rics, they are presented using three different seeds.
Table 4 shows that while the precision values are
roughly in the same range for all seeds, the recall val-
ues vary more significantly, which also reflects on the
F
1
score. However, by inspecting the F
1,best
and A
PRC
values it becomes clear that the seeds are not as far
apart as the recall value suggests and that the issue
may lie with the threshold choice. Figure 5 further
supports this, as all lines have roughly the same path,
with the exception of seed 3 at very high precision
values. The plot clearly shows that a more suitable
(lower) threshold would lead to seed 3 having a com-
parable recall value to the other seeds while maintain-
ing high precision.
Some differences can be observed between the
seeds, especially in the recall values, however, this
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
415
can be attributed to the unsupervised threshold choice.
Table 4: Precision P, recall R, F
1
score, threoretical best
F
1
score F
1,best
and area under the precision-recall curve
A
PRC
results for the different seeds. The best values for
each metric are given in bold.
Seed P R F
1
F
1,best
A
PRC
1 0.92 0.55 0.69 0.70 0.66
2 0.90 0.60 0.72 0.73 0.67
3 0.96 0.40 0.56 0.70 0.64
Figure 5: Precision-recall curves for the model trained on
different seeds.
5.5 Reverse-Window Process
To investigate the effect of the mean-type reverse-
window method, it is compared with the first-type and
last-type methods where the first and last values of
each window are carried over, respectively.
The results in this subsection, Table 5 and Figure
6, tell a similar story to the previous subsection. The
metrics independent of the chosen threshold are very
similar regardless of the reverse-window method, im-
plying that they are comparable and that any differ-
ences in the calibrated metrics can be attributed to
the chosen threshold. The mean-type reverse-window
method results in a higher computational load, though
negligible. For a rather long sequence of 4000 time
steps, i.e. around 33 minutes long, the mean-type
method only takes around 2 seconds longer. One
source of delay that can appear, however, is during
online anomaly detection. An online anomaly detec-
tion algorithm is defined as an algorithm which eval-
uates the sequence as it is being recorded. To obtain
time step t using the mean-type (or the first-type) you
have to wait for time step t +W while t < W . This
Table 5: Precision P, recall R, F
1
score, threoretical best
F
1
score F
1,best
and area under the precision-recall curve
A
PRC
results for the different reverse-window types. The
best values for each metric are given in bold.
Type P R F
1
F
1,best
A
PRC
first 0.97 0.48 0.64 0.69 0.64
last 0.88 0.58 0.71 0.71 0.67
mean 0.92 0.55 0.69 0.70 0.66
Figure 6: Precision-recall curves for different reverse-
window methods.
translates to a delay of around 2 minutes in the real
world, given the chosen window size. If the evalua-
tion is done offline, i.e. when t = W , then this delay
is eliminated since the last value does not have other
overlapping values to compute the mean.
5.6 Hyperparameter Optimisation
As part of the hyperparameter optimisation of MA-
VAE, a list of latent dimension sizes d
Z
in combina-
tion with a list of key dimension sizes d
K
is tested.
Despite the larger learning capacity associated with a
higher d
K
, the concatenation is always transformed to
a matrix of size d
O
= d
Z
. For the two variables, values
of 1, 4, 16, and 64 are tested.
The best result is achieved with d
Z
= d
K
= 64.
Given that they are the respective highest values of d
Z
and d
K
, even higher values should be experimented
with in the future, though they will lead to higher
model complexity and training/inference time. The
attention head count h was also experimented with us-
ing the same range of values as for d
Z
and d
K
, how-
ever, none performed better than the h = 8 configu-
ration. The results are presented in Table 6, the cor-
responding precision-recall plot is shown in Figure 7.
91% of the sequences marked as normal were actually
normal and 67% of the total number of anomalous se-
quences in the test set were detected. One example of
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
416
Table 6: Precision P, recall R, F
1
score, threoretical best F
1
score F
1,best
and area under the precision-recall curve A
PRC
result for the best d
Z
, d
K
and h values.
d
Z
d
K
h P R F
1
F
1,best
A
PRC
64 64 8 0.91 0.67 0.77 0.79 0.74
Figure 7: Precision-recall curve for the final MA-VAE.
the anomalous cycles and the respective reconstruc-
tions is plotted in Figure 8.
5.7 Benchmarking
Of course, MA-VAE is not the first model proposed
for time-series anomaly detection. To underline its
anomaly detection performance, it is compared with
a series of other models based on variational autoen-
coders. The chosen subset of models is based on
the work discussed in Section 3 which either linked
source code or contained enough information for im-
plementation. The models are implemented using hy-
perparameters specified in their respective publica-
tions. To even the playing field, the models are trained
on the 512h subset with early stopping, which is
parametrised equally across all models. The anomaly
detection process specified in Algorithm 1 is also ap-
plied to all models, along with the threshold estima-
tion method. The results can be seen in Table 7.
Table 7: Precision P, recall R, F
1
score, threoretical best F
1
score F
1,best
and area under the precision-recall curve A
PRC
results for competing models and MA-VAE (Ours). The
best values for each metric are given in bold.
Model P R F
1
F
1,best
A
PRC
VS-VAE 1.00 0.33 0.50 0.56 0.51
W-VAE 1.00 0.30 0.46 0.46 0.41
OmniA 0.96 0.37 0.53 0.58 0.53
SISVAE 1.00 0.30 0.46 0.50 0.51
MA-VAE 0.91 0.67 0.77 0.79 0.74
Figure 8: Wheel diameter anomaly plotted in black and the
output distribution in red, as well as anomaly score plotted
in blue and the threshold as a straight line in orange.
As is evident, MA-VAE outperforms all other
models in every metric, except for precision. As
stated in Section 4 a high precision figure is impor-
tant in this type of powertrain testing, however, the
reduced precision is still considered tolerable. Also,
it comes at the benefit of a much higher recall figure,
which is reflected in the superior F
1
figure. Further-
more, the F
1,best
figure, which is obtained at P = 0.98
and R = 0.67, suggests that MA-VAE has the poten-
tial to achieve even higher precision without sacrific-
ing recall if the threshold were optimised. The higher
A
PRC
also shows that MA-VAE has a higher range of
thresholds at which it performs well.
MA-VAE: Multi-Head Attention-Based Variational Autoencoder Approach for Anomaly Detection in Multivariate Time-Series Applied to
Automotive Endurance Powertrain Testing
417
6 CONCLUSION AND OUTLOOK
In this paper, a multi-head attention variational au-
toencoder (MA-VAE) for anomaly detection in auto-
motive testing is proposed. It not only features an
attention configuration that avoids the bypass phe-
nomenon but also introduces a novel method of
remapping windows to whole sequences. A num-
ber of experiments are conducted to demonstrate the
anomaly detection performance of the model, as well
as to underline the benefits of key aspects introduced
with the model.
From the results obtained, MA-VAE clearly ben-
efits from the MA mechanism, indicating the avoid-
ance of the bypass phenomenon. Moreover, the
proposed approach only requires a small train-
ing/validation subset size but fails to obtain a suit-
able threshold, as with increasing subset size only the
calibrated anomaly detection performance increases.
Training with different seeds also is shown to have
little impact on the anomaly detection metrics, pro-
vided the threshold is chosen suitably, further under-
lining the previous point. Moreover, mean-type re-
verse windowing fails to significantly outperform its
first-type and last-type counterparts, while introduc-
ing additional lag if it is applied to online anomaly
detection. Lastly, the hyperparameter optimisation re-
vealed that the MA-VAE variant with the largest latent
dimension and attention key dimension resulted in the
best anomaly detection performance. It is only 9% of
the time wrong when an anomaly is flagged and man-
ages to discover 67% of the anomalies present in the
test data set. Also, it outperforms all other competing
models it is compared with.
In the future, a method of threshold choice involv-
ing active learning will be investigated, which can use
user feedback to hone in on a better threshold. Also,
MA-VAE is set to be tested in the context of online
anomaly detection, i.e. during the driving cycle mea-
surement.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Ma-
chine Translation by Jointly Learning to Align and
Translate. In International Conference on Learning
Representations (ICLR).
Bahuleyan, H., Mou, L., Vechtomova, O., and Poupart,
P. (2018). Variational Attention for Sequence-to-
Sequence Models. In International Conference on
Computational Linguistics (COLING).
Bridle, J. S. (1990). Probabilistic Interpretation of Feedfor-
ward Classification Network Outputs, with Relation-
ships to Statistical Pattern Recognition. Neurocom-
puting, pages 227–236.
Chen, T., Liu, X., Xia, B., Wang, W., and Lai, Y.
(2020). Unsupervised Anomaly Detection of In-
dustrial Robots Using Sliding-Window Convolutional
Variational Autoencoder. IEEE Access, 8:47072–
47081.
Chollet, F. (2021). Deep Learning with Python. Manning
Publications.
Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin,
L. (2019). Cyclical Annealing Schedule: A Simple
Approach to Mitigating. In Conference of the Associa-
tion for Computational Linguistics: Human Language
Technologies (NAACL-HLT).
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press.
Kingma, D. P. and Welling, M. (2014). Auto-Encoding
Variational Bayes. In International Conference on
Learning Representations (ICLR).
Li, L., Yan, J., Wang, H., and Jin, Y. (2021). Anomaly
Detection of Time Series With Smoothness-Inducing
Sequential Variational Auto-Encoder. Transactions on
Neural Networks and Learning Systems, 32(3):1177–
1191.
Park, D., Hoshi, Y., and Kemp, C. C. (2018). A Multimodal
Anomaly Detector for Robot-Assisted Feeding Us-
ing an LSTM-Based Variational Autoencoder. IEEE
Robotics and Automation Letters, 3(3):1544–1551.
Pereira, J. and Silveira, M. (2018). Unsupervised Anomaly
Detection in Energy Time Series Data Using Varia-
tional Recurrent Autoencoders with Attention. In In-
ternational Conference on Machine Learning and Ap-
plications (ICMLA).
Pereira, J. and Silveira, M. (2019). Unsupervised repre-
sentation learning and anomaly detection in ECG se-
quences. International Journal of Data Mining and
Bioinformatics, 22(4):389.
Rezende, D. J. and Mohamed, S. (2015). Variational Infer-
ence with Normalizing Flows. In International Con-
ference on Machine Learning (ICML).
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic Backpropagation and Approximate Infer-
ence in Deep Generative Models. In International
Conference on Machine Learning (ICML).
Shannon, C. (1949). Communication in the Presence of
Noise. Proceedings of the IRE, 37(1):10–21.
Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., and Pei, D.
(2019). Robust Anomaly Detection for Multivariate
Time Series through Stochastic Recurrent Neural Net-
work. In International Conference on Knowledge Dis-
covery & Data Mining (KDD).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention Is All You Need. In Conference
on Neural Information Processing Systems (NIPS).
von Schleinitz, J., Graf, M., Trutschnig, W., and Schr
¨
oder,
A. (2021). VASP: An autoencoder-based approach
for multivariate anomaly detection and robust time
series prediction with application in motorsport.
Engineering Applications of Artificial Intelligence,
104:104354.
NCTA 2023 - 15th International Conference on Neural Computation Theory and Applications
418