Extracting Dynamics from Multi-dimensional Time-evolving Data
using a Bag of Higher-order Linear Dynamical Systems
Kosmas Dimitropoulos, Panagiotis Barmpoutis, Alexandors Kitsikidis and Nikos Grammalidis
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Keywords: Linear Dynamical Systems, Human Action Recognition, Dynamic Texture Analysis, Higher Order
Decomposition.
Abstract: In this paper we address the problem of extracting dynamics from multi-dimensional time-evolving data. To
this end, we propose a linear dynamical model (LDS), which is based on the higher order decomposition of
the observation data. In this way, we are able to extract a new descriptor for analyzing data of multiple
elements coming from of the same or different data sources. Each sequence of data is modeled as a
collection of higher order LDS descriptors (h-LDSs), which are estimated in equally sized temporal
segments of data. Finally, each sequence is represented as a term frequency histogram following a bag-of-
systems approach, in which h-LDSs are used as feature descriptors. For evaluating the performance of the
proposed methodology to extract dynamics from time evolving multidimensional data and using them for
classification purposes in various applications, in this paper we consider two different cases: dynamic
texture analysis and human motion recognition. Experimental results with two datasets for dynamic texture
analysis and two datasets for human action recognition demonstrate the great potential of the proposed
method.
1 INTRODUCTION
Machine learning problems often involve sequences
of real-valued multivariate observations. To model
the statistical properties of such data, it is assumed
that each observation is correlated to the value of an
underlying latent variable that is evolving over the
course of the sequence. If the state is real-valued and
the noise terms are assumed to be Gaussian, the
model is called a linear dynamical system (LDS)
(Boots, 2009). Thus, a linear dynamical system is
associated with a first order ARMA process with
white zero means IID Gaussian input (Doretto et al.,
2003). Linear dynamical systems are an important
tool for modeling time series in engineering,
controls and economics, as well as the physical and
social sciences and they have been successfully used
in the past for various vision tasks such as: dynamic
texture analysis, synthesis, segmentation,
registration and categorization (Soatto et al., 2001).
They have also been employed for the categorization
of video sequences in multimedia databases and
more recently in human action recognition tasks.
More specifically, in the field of video
categorization a lot of methods have adopted LDSs
focusing mainly on the definition of a suitable
distance or kernel between the model parameters of
two dynamical systems (Doretto et al., 2003); (Chan
and Vasconcelos, 2005); (Chan and Vasconcelos,
2007); (Vishwanathan et al., 2007). In addition,
Turaga et al., (2011) showed that the parameters of
linear dynamic models are finite dimensional linear
subspaces that can be described using the unified
framework of Grassmann and Stiefel manifolds and
proposed algorithms for supervised and
unsupervised clustering for activity recognition, face
recognition and video clustering. More recently, a
new method was introduced by Ravichandran et al.,
(2013) aiming to model video sequences with a
collection of LDSs, which are then used as features
in a bag of systems approach, while Luo et al.,
proposed the modelling of motion dynamics with
robust LDSs using the model parameters as motion
descriptors.
Nevertheless, a limitation of linear dynamical
systems is that they exploit information from only
one element, i.e., channel, thus, in the case of
multidimensional data the concatenation of different
components into one single element is required. To
this end, in this paper we propose a more efficient
Dimitropoulos, K., Barmpoutis, P., Kitsikidis, A. and Grammalidis, N.
Extracting Dynamics from Multi-dimensional Time-evolving Data using a Bag of Higher-order Linear Dynamical Systems.
DOI: 10.5220/0005844006830688
In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 683-688
ISBN: 978-989-758-175-5
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
683
way to model dynamics by taking advantage of the
multidimensionality of data. More specifically, we
present a higher-order LDS model in order to extract
a new descriptor for analyzing data coming from
multiple elements, e.g., channels in the case of video
sequences or joint coordinates in the case of skeleton
animation data. The proposed model is based on the
higher order decomposition of the multidimensional
data and enables the analysis of dynamic time-series
using information from the same or different data
sources, e.g., colour visible range cameras, infrared
sensors of various spectral ranges, or even
synthesized images.
The proposed h-LDS descriptors are estimated in
equally sized temporal segments, while a bag of
systems approach is adopted, in which the h-LDSs
are used as feature descriptors. For the formation of
the codebook, a k-medoids (Kaufman and
Rousseeuw, 1987) classification method is applied,
where the K codewords correspond to K
representative higher order LDSs. Each data
sequence is then represented as a Term Frequency
(TF) histogram of the predefined codeword of h-
LDSs and is provided to a SVM classifier.
For evaluating the performance of the proposed
methodology to extract dynamics from time series
and using them for classification, in this paper we
deal with the problems of dynamic texture analysis
and human action recognition.
2 HIGHER-ORDER LINEAR
DYNAMICAL ANALYSIS
2.1 Estimation of the h-LDS Descriptor
As was mentioned above a linear dynamical system
is associated with a first order ARMA process with
white zero mean IID Gaussian input. More
specifically, the stochastic modeling of both
dynamics and appearance is encoded by two
stochastic processes, in which dynamics are
represented as a time-evolving hidden state process
x(t)∈
and observed data y(t)∈
as a linear
function of the state vector:
1

(1)


(2)
where

is the transition matrix of the
hidden state (n is the dimension of hidden state with
nd), while ∈

is the mapping matrix of the
hidden state to the output of the system. The
quantities w(t) and Bv(t) are the measurement and
process noise respectively, with w(t)~N(0,R) and
Bv(t)~N(0,Q). The main advantage of the LDS
descriptor ,, is that it contains both the
appearance information of the data segment, which
is modeled by , and its dynamics that are
represented by .
In the case of multidimensional data,
observations can be represented by a tensor Y∈
….
of order n, where d
1
, d
2
,....,d
n
are integer
numbers indicating the number of elements in each
dimension. For instance, if we consider a colour
video sequence of F frames, the order of tensor Y (in
the rest of the paper the term 'tensor' is used to
indicate a matrix of order higher than two) is four,
where d
1
and d
2
indicate the width and height of the
image respectively, d
3
is the number of image
elements (d
3
=3) and d
4
is the number of frames. In
order to estimate the matrices A and C containing the
dynamics and appearance information respectively
(see Figure 1), we need to decompose the n-order
tensor Y, so that the columns of the mapping matrix
C are orthonormal (Doretto et al., 2003).
Figure 1: A graphical representation of the h-LDS model.
To satisfy the aforementioned requirement, we
use the HOSVD (Kuo, 2013), which is a
generalization of the singular value decomposition
for higher order tensors. More specifically, we first
subtract from Y the temporal data average
in order
to construct a zero mean matrix in the time axis,
where the temporal average is computed as:
1

,
,…..,

,


(3)
and then we decompose tensor Y as follows:



…

(4)
where U
(1)
,U
(2)
,....U
(n)
are orthogonal matrices
containing the orthonormal vectors spanning the
column space of the i-mode matrix unfolding Y
(i)
and
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
684
denotes the i-mode product between a tensor and
a matrix (Kuo, 2013), with i=1,2...n. Since, the
choice of matrices A, C and Q in equations (1) and
(2) is not unique, in the sense that there are infinitely
many such matrices that give rise to exactly the
same sample paths starting from suitable initial
conditions (Doretto et al., 2003), we can consider
C=
)(n
U
, where
)(n
U
is an orthogonal matrix and
X=


…


(5)
Hence, equation (4) can be reformulated as follows:

(6)
The n-mode product of tensor X∈
….
with
matrix C∈

can be defined as:

⇔



(7)
The transition matrix A, containing the dynamics of
the multidimensional data, can then be easily
computed by using least squares:



(8)
where the matrices
1
,
2
,…,
1
and

2
,
3
,…,
are formed from the
unfolding X
(n)
of tensor X along the n
th
dimension.
2.2 Codebook Creation and
Classification
For the formation of the codebook, k-medoids
clustering is applied, however, before that we need
to define a similarity metric between two descriptors

,
and

,
that will be
applicable to the non-Euclidean space of h-LDSs.
Since h-LDS descriptor consists of a pair of two-
dimensional matrices, we can easily use as a
similarity metric the Martin distance between
and
:

,
ln
(9)
where θ
i
are the subspace angles (Cock and Moor,
2002) between the two models. The cosine of θ
i
can
be calculated as the square root of the i-th
eigenvalue of the matrix P


P

P


P

:
cos
θ
i

eigenvalue
P


P

P


P

(10)
where the estimation of matrix
2221
1211
PP
PP
P
is performed by solving the Lyapunov equation
A
PAP C
C, where
2
1
0
0
A
A
A
and
21
CCC
The codebook can then be created by using h-LDS
as feature descriptors. Specifically, the training
dataset of the h-LDS descriptors is fed into k-
medoids algorithm for the creation of a codebook of
K codewords corresponding to K representative h-
LDSs, as shown in Figure 2. Finally, each data
segment is then represented as a Term Frequency
(TF) histogram of the predefined codeword of h-
LDSs.
Figure 2: h-LDS codebook creation and classification.
3 EXPERIMENTAL RESULTS
For evaluating the performance of the proposed
methodology to extract dynamics from time
evolving multidimensional data and using them for
classification, in this paper we consider two different
applications: i) dynamic texture analysis and ii)
human motion recognition. In the former case, the
proposed high order LDS descriptor is used to model
the temporal evolution of pixels intensities, while in
the latter the evolution of skeleton joints positions
during the performance of a motion is considered as
a multidimensional time series.
3.1 Dynamic Texture Analysis
In this section we present experimental results using
two video datasets for dynamic texture analysis.
More specifically the first dataset (Dimitropoulos et
al., 2015) contains videos with flame and flame-
Extracting Dynamics from Multi-dimensional Time-evolving Data using a Bag of Higher-order Linear Dynamical Systems
685
colored objects, while the second one contains
videos with smoke and non-smoke frames
(Barmpoutis et al., 2014). In both cases, each frame
of the video sequence is divided into image patches
of size 16x16, which is a typical approach in video-
based fire detection systems and then a pre-
processing step is applied aiming to identify
candidate image patches i.e., patches containing a
sufficient number of flame or smoke colored moving
pixels. To this end, we initially apply an Adaptive
Median algorithm (McFarlane and Schofield, 1995),
(Dimitropoulos et al., 2012), which is fast and very
efficient algorithm for detecting moving pixels, and
then we use a fire probability model (Dimitropoulos
et al., 2015) or a HSV smoke color model
(Avgerinakis et al., 2012) to identify candidate flame
or smoke image patches respectively.
Figure 3: Comparison of LDS and h-LDS with grayscale,
RGB and RGBH data using the dataset containing flame
and flame colored objects.
Figure 4: Comparison of LDS and h-LDS with grayscale,
RGB and RGBH data using the dataset containing smoke
and smoke colored objects.
For each candidate image patch we estimate a h-
LDS descriptor using a temporal length of 16
frames. In addition, for the classification of each
frame, we create histogram representations
corresponding to the sub-sequences of T previous
frames (in our experiments T=100). In the
experimental results, we estimated the number of
correctly detected frames out of the total number of
frames in each dataset. As can be seen in Figures 3
and 4, the proposed descriptor outperforms standard
LDSs in both cases i.e., flame and smoke
identification respectively. In order to validate the
performance of both descriptors with a different
number of elements, apart from the use of grayscale
and RGB data (i.e., three elements), we also created
a fourth channel by visualizing the feature space of
HOG descriptor as in (Vondrick et al., 2013) (i.e.,
RGBH data). Especially in the case of smoke, the
fourth channel seems to improve significantly the
results, while LDS descriptor does not seem to
change significantly its detection rate.
3.2 Human Action Recognition
In this section we deal with the problem of human
action recognition in game-like applications. More
specifically, for capturing the human motion, depth
sensors are used, while the evolution of skeleton
joints positions during the performance of a motion
is considered as a multidimensional time series. To
extract the dynamics of the body motion, we
segment the multidimensional signal into equally
sized elementary segments using a sliding time
window of 16 frames. In this way we accomplish a
better representation of human motion, instead of
using the whole non-linear sequence of data, as each
elementary segment can be efficiently modelled by a
linear dynamical system.
Experimental results with two datasets for human
action recognition show that the proposed method
outperforms the different variants of LDSs on the
recognition task of body motion. More specifically,
for the validation of the proposed method we created
a new Kinect gesture dataset consisting of 360
motions, while we also used a well-known dataset
such as MSRC-12 (Fothergill et al., 2012). More
specifically the new dataset contains 6 actions (bend
forward, left kick, right kick, raise hands, hand
wave, push with hands) performed by 6 subjects,
each repeated 10 times (360 motions in total) and the
Microsoft Research Cambridge-12 Kinect gesture
data set (MSRC-12) comprises of 594 sequences
collected from 30 people performing 12 gestures.
The MSRC-12 dataset is partitioned along different
methods of instruction given to the subjects such as
text and video. We used the part of the dataset where
video only instructions were given. Both datasets
contain tracks of 20 skeleton joint position
coordinates estimated using the Kinect Pose
Estimation pipeline.
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
686
Figure 5: Comparison of LDS and H-LDS descriptor
performance on our dataset.
Figure 6: Comparison of LDS and H-LDS descriptor
performance on MSRC-12 dataset.
As seen in Figures 5 and 6 the histogram of
LDSs offer an improvement in classification results
compared to using a single descriptor for the whole
motion. Additionally, the h-LDS descriptor clearly
outperforms the simple LDS descriptor in each case.
This extends to the case of histogram of LDSs,
where the same behavior can be observed.
4 CONCLUSIONS
In this paper, we introduced a higher order linear
dynamical systems (h-LDS) descriptor for extracting
dynamics from multidimensional time evolving data.
By applying higher order decomposition in the
observation data, we showed that we can achieve
higher detection rates than standard linear dynamical
systems both in the case of dynamic texture analysis
and human action recognition. In the future, we are
planning to use data from different sources, e.g.,
multispectral imaging in the case of flame detection
or skeletal data and depth data in the case of human
action recognition.
ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the European Community's Seventh
Framework Programme (FP7-ICT-2011-9) under
grant agreement no FP7-ICT-600676 ''i-Treasures:
Intangible Treasures - Capturing the Intangible
Cultural Heritage and Learning the Rare Know-How
of Living Human Treasures''.
REFERENCES
Avgerinakis, K., Briassouli, A., Kompatsiaris, I., 2012.
"Smoke Detection Using Temporal HOGHOF
Descriptors and Energy Colour Statistics from Video,"
in Int'l Workshop on Multi-Sensor Systems and
Networks for Fire Detection and Management.
Barmpoutis, P., Dimitropoulos, K., Grammalidis, N.,
2014. "Smoke Detection Using Spatio-Temporal
Analysis, Motion Modeling and Dynamic Texture
Recognition", 22nd European Signal Processing
Conference (EUSIPCO 2014), Lisbon, Portugal, 1-5
September.
Boots, B., 2009. Learning stable linear dynamical systems.
M.S. Thesis in Machine Learning, Carnegie Mellon
University.
Chan, A., Vasconcelos, N., 2005. "Probabilistic Kernels
for the Classification of Auto-Regressive Visual
Processes," in IEEE Conf. Computer Vision and
Pattern Recognition.
Chan, A., Vasconcelos, N., 2007. "Classifying Video with
Kernel Dynamic Textures," in IEEE Conf. Computer
Vision and Pattern Recognition.
Cock, K. D., Moor, B. D., 2002. "Subspace angles and
distances between ARMA models," System and
Control Letters, vol. 4, pp. 265-270.
Dimitropoulos, K., Tsalakanidou, F., Grammalidis, N.,
2012. "Flame detection for video-based early fire
warning systems and 3D visualization of fire
propagation," in 13th IASTED Int'l Conf. on Computer
Graphics and Imaging.
Dimitropoulos, K., Barboutis, P., Grammalidis, N., 2015.
"Spatio-Temporal Flame Modeling and Dynamic
Texture Analysis for Automatic Video-Based Fire
Detection", IEEE Transactions on Circuits and
Systems for Video Technology, vol. 25, no. 2, pp.
339-351.
Doretto, G., Chiuso, A., Wu, Y. N., Soatto, S., 2003.
"Dynamic Textures," Int'l J. of Computer Vision, vol.
51, no. 2, pp. 91-109.
Fothergill, S., Mentis, H. M., Kohli, P., Nowozin, S.,
2012. Instructing people for training gestural
interactive systems. In J. A. Konstan, E. H. Chi, and
K. Hook, editors, CHI, pages 1737–1746. ACM.
Kaufman, L., Rousseeuw, P.J., 1987. Clustering by means
of Medoids. In Statistical Data Analysis Based on the
L1–Norm and Related Methods, edited by Y. Dodge,
Extracting Dynamics from Multi-dimensional Time-evolving Data using a Bag of Higher-order Linear Dynamical Systems
687
North-Holland, 405–416.
Kuo, C. T., 2013. "Higher order SVD: theory and
algorithms".
McFarlane, N., Schofield, C., 1995. "Segmentation and
tracking of piglets in images," British Machine Vision
and Applications, vol. 8, pp. 187-193.
Ravichandran, A., Chaudhry, R., Vidal, R., 2013.
"Categorizing dynamic textures using a bag of
dynamical systems," IEEE Trans. on Pattern Analysis
and Machine Intelligence, vol. 35, no. 2, pp. 342-353,
February.
Soatto, S., Doretto, G., Wu, Y., 2001. Dynamic Textures.
Intl. Conf. on Computer Vision.
Turaga, P., Veeraraghavan, A., Srivastava A. Chellappa
R., 2011. "Statistical Computations on Grassmann and
Stiefel Manifolds for Image and Video based
Recognition," IEEE Trans. on Pattern Analysis and
Machine Intelligence, November.
Vishwanathan, S., Smola, A., Vidal, R., 2007. "Binet-
Cauchy Kernels on Dynamical Systems and Its
Application to the Analysis of Dynamic Scenes," Int'l
J. Computer Vision, vol. 73, no. 1, pp. 95-119.
Vondrick, C., Khosla, A., Malisiewicz T., Torralba, A.,
2013. "HOGgles: Visualizing Object Detection
Features," in Int'l Conf. on Computer Vision, Sydney,
Australia, December.
RGB-SpectralImaging 2016 - Special Session on RBG and Spectral Imaging for Civil/Survey Engineering, Cultural, Environmental,
Industrial Applications
688