Large-scale Retrieval of Bayesian Machine Learning Models

for Time Series Data via Gaussian Processes

Fabian Berns

and Christian Beecks

1,2

Department of Computer Science, University of M

unster, Germany

Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany

christian.beecks@ﬁt.fraunhofer.de

Keywords:

Bayesian Machine Learning, Gaussian Process, Data Modeling, Knowledge Discovery.

Abstract:

Gaussian Process Models (GPMs) are widely regarded as a prominent tool for learning statistical data models

that enable timeseries interpolation, regression, and classiﬁcation. These models are frequently instantiated

by a Gaussian Process with a zero-mean function and a radial basis covariance function. While these default

instantiations yield acceptable analytical quality in terms of model accuracy, GPM retrieval algorithms au-

tomatically search for an application-speciﬁc model ﬁtting a particular dataset. State-of-the-art methods for

automatic retrieval of GPMs are searching the space of possible models in a rather intricate way and thus

result in super-quadratic computation time complexity for model selection and evaluation. Since these proper-

ties only enable processing small datasets with low statistical versatility, we propose the Timeseries Automatic

GPM Retrieval (TAGR) algorithm for efﬁcient retrieval of large-scale GPMs. The resulting model is composed

of independent statistical representations for non-overlapping segments of the given data and reduces compu-

tation time by orders of magnitude. Our performance analysis indicates that our proposal is able to outperform

state-of-the-art algorithms for automatic GPM retrieval with respect to the qualities of efﬁciency, scalability,

and accuracy.

1 INTRODUCTION

Applying Gaussian Process Models (GPMs) for inter-

polation (Roberts et al., 2013; Li and Marlin, 2016),

regression (Titsias, 2009; Duvenaud et al., 2013), and

classiﬁcation (Li and Marlin, 2016; Hensman et al.,

2013) of timeseries necessitates to instantiate the un-

derlying Gaussian Process by a covariance function

and a mean function. While the latter is typically in-

stantiated by a constant, zero-mean function, the co-

variance function is modelled either by (i) a general-

purpose kernel (Wilson and Adams, 2013), (ii) a

domain-speciﬁc kernel that is individually tailored to

a speciﬁc application by a domain expert (Wilson and

Adams, 2013; Abrahamsen and Petter, 1997; Ras-

mussen and Williams, 2006), or (iii) a composite ker-

nel that is computed automatically by means of a

GPM retrieval process (Duvenaud et al., 2013; Lloyd

et al., 2014) in order to decompose the statistical

characteristics underlying the data into multiple sub-

models. Although the ﬁrst option (i) produces GPMs

of sufﬁciently high model accuracy, both (ii) and (iii)

encapsulate data’s peculiarities and versatilities in a

more detailed manner. Making use of domain-speciﬁc

kernels requires laborious ﬁne-tuning and extensive

expert knowledge. Employing an automatic, domain-

agnostic GPM retrieval process on the other hand pro-

duces expressive composite kernels without requiring

any human interference.

State-of-the-art automatic GPM retrieval algo-

rithms, such as Compositional Kernel Search (CKS)

(Duvenaud et al., 2013), Automatic Bayesian Covari-

ance Discovery (ABCD) (Lloyd et al., 2014; Stein-

ruecken et al., 2019) and Scalable Kernel Compo-

sition (SKC) (Kim and Teh, 2018), apply an open-

ended greedy search in the space of all feasible com-

posite covariance functions in order to determine an

optimal, e.g. in terms of maximum likelihood, in-

stantiation of the underlying Gaussian Process. The

resulting inextricable GPM needs to account for all

different kinds of statistical peculiarities of the un-

derlying data and thus lacks expressiveness especially

with regards to local phenomena. Additionally, evalu-

ating a GPM by usual measures such as the likelihood

(Malkomes et al., 2016) function entails calculations

of cubic computation time complexity, which limits

Berns, F. and Beecks, C.

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes.

DOI: 10.5220/0010109700710080

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 71-80

ISBN: 978-989-758-474-9

the application of GPM retrieval algorithms to small-

to-moderate data collections (Kim and Teh, 2018).

In this paper, we aim to overcome the perfor-

mance limitations of state-of-the-art GPM retrieval al-

gorithms and propose the Timeseries Automatic GPM

Retrieval (TAGR) algorithm for large-scale statistical

data modeling based on Gaussian Processes. We aim

to limit the algorithm’s overall complexity by concur-

rently initiating multiple kernel searches on dynami-

cally partitioned data (cf. Berns and Beecks, 2020b,a;

Berns et al., 2019). In this way, TAGR describes

large-scale datasets by means of a concatenation of

independent sub-models. Our approach is able to out-

perform existing kernel search methods in terms of

expressiveness, while reducing execution time by a

factor of up to 500.

The paper is structured as follows. Section 2

presents related work and background information.

Section 3 introduces large-scale GPMs, while Section

4 proposes the corresponding TAGR algorithm. Its

performance and explanatory capacity are evaluated

in Section 5. We conclude our paper with an outlook

on future work in Section 6.

2 BACKGROUND AND RELATED

WORK

2.1 Gaussian Process

A Gaussian Process (Rasmussen and Williams,

2006) is a stochastic process over random variables

{ f (x) | x ∈ X }, indexed by a set X , where every sub-

set of random variables follows a multivariate normal

distribution. The distribution of a Gaussian Process is

the joint distribution of all of these random variables

and it is thus a probability distribution over the space

of functions { f | f : X → R}. A Gaussian Process is

formalized as follows:

f (·) ∼ GP



m(·), k(·, ·)



(1)

where the mean function m : X → R and the co-

variance function k : X × X → R are deﬁned ∀x ∈ X

as follows:

m(x) = E [ f (x)] (2)

k(x, x

) = E



( f (x) − m(x)) ·



f (x

) − m(x

)



(3)

Given a ﬁnite dataset D = {X,Y } with X =

| x

∈ X ∧ 1 ≤ i ≤ n} representing the underlying

input data (such as timestamps) and Y = { f (x

) | x

∈

X } representing the target data values, such as sen-

sor values or other complex measurements, the mean

and covariance functions are determined by maxi-

mizing the log marginal likelihood (Rasmussen and

Williams, 2006) of the Gaussian Process, which is de-

ﬁned as follows:

L(m, k, θ | D) = −



(y − µ)

−1

(y − µ)+

logdet(K) +nlog(2π)] (4)

As can be seen in Equation 4, the marginaliza-

tion of a Gaussian Process for a given dataset D of

n records results in a ﬁnite data vector y ∈ R

, a mean

vector µ ∈ R

, and a covariance matrix K ∈ R

n×n

which are deﬁned as y[i] = f (x

), µ[i] = m(x

), and

K[i, j] = k(x

, x

) for 1 ≤ i, j ≤ n, respectively. We use

the short-hand notation GP

to denote a GPM whose

underlying mean function m and covariance function

k are parameterized via hyperparameters θ (cf. Cheng

and Boots, 2017; Csat

o and Opper, 2000).

While the covariance function is frequently used

as major data modeling entity it often lacks in de-

scribing individual statistical behaviors in a structured

way. To this end, Duvenaud et al. (2013) propose

to approximate and structure the covariance function

via multiple compositional kernel expressions in or-

der to obtain the resulting compositional kernel-based

covariance function. The corresponding algorithms

for automatically retrieving a GPM given an arbitrary

dataset D are summarized in the following section.

2.2 Retrieval Algorithms

GPM retrieval algorithms, such as CKS (Duvenaud

et al., 2013), ABCD (Lloyd et al., 2014; Steinruecken

et al., 2019), and SKC (Kim and Teh, 2018), aim to

discover the statistical structure of a dataset D by de-

termining an optimal covariance function k

. For this

purpose, the mean function of the Gaussian Process

is commonly instantiated by a constant zero function

(Duvenaud et al., 2013; Rasmussen and Williams,

2006), so as to correspond to an additional data nor-

malization step. The covariance function is algorith-

mically composed via operators implementing addi-

tion and multiplication among different (composed)

base kernels b ∈ B. Prominent base kernels include

the linear kernel, radial basis function kernel, and pe-

riodic kernel, which are able to capture for instance

smooth, jagged, and periodic behavior (Duvenaud

et al., 2013).

The algorithms mentioned above apply an open-

ended, greedy search in the space of all feasible ker-

nel combinations in order to progressively compute a

GPM ﬁtting the entire dataset D, respectively Y ∈ D.

In (Berns and Beecks, 2020a), we detail the problem

and deﬁnition of automatic GPM retrieval.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

The CKS algorithm (Duvenaud et al., 2013) follows

a simple iterative procedure, called Basic Kernel Ex-

pansion Strategy (BES), which aims to improve the

best kernel expressions k retrieved in one iteration

over the course of the next one. Therefore, this strat-

egy produces a set of candidate kernel expressions C

based on k (Duvenaud et al., 2013):

{

k b | b ∈ B ∧  ∈ {+, ×}

}

(5)

Starting with a set of base kernels B, the candi-

dates of every following iteration are generated via

expanding upon the best candidate of the previous it-

eration. Moreover, a new candidate is also generated

for every replacement of a base kernel b with another

one b

. ABCD extends that grammar with regards to

a sigmoid change point operator | , in order to locally

restrict the effect of kernel expressions (Lloyd et al.,

2014; Steinruecken et al., 2019):

{

k b | b ∈ B ∧  ∈ {+, ×, | }

}

(6)

The performance of those given automatic GPM

retrieval algorithms is impeded by two major bottle-

necks. Evaluating every expansion of every subex-

pression poses one of these bottlenecks. Thus, reduc-

ing the set of candidates per iteration in order to cover

just the most promising candidates improves on the

performance of the respective algorithms, presumably

without affecting model quality. Moreover, assess-

ing model quality in terms of log marginal likelihood

epitomizes another bottleneck of current algorithms,

which is intrinsic to the framework of Gaussian Pro-

cesses itself (Hensman et al., 2013). The cubic run-

time complexity of that basic measure inhibits analy-

sis of large-scale datasets (cf. Kim and Teh, 2018).

Kim and Teh (2018) present SKC as a way to

overcome at least the latter one of those bottlenecks.

The algorithm improves the efﬁciency of the ker-

nel search process by accelerating CKS using the

Nystr

om method (cf. Williams and Seeger, 2000) for

global approximations (Kim and Teh, 2018). This

low-rank matrix approximations constructs an ap-

proximate covariance matrix by means of a small sub-

set of the given data (cf. Gittens and Mahoney, 2016)

and allows to retrieve GPMs for datasets of up to

100,000 records in a reasonable amount of time ac-

cording to Kim and Teh (2018).

To summarize, state-of-the-art GPM retrieval al-

gorithms are not suited for the analysis of large-scale

datasets beyond 100,000 records due to their cubic

computation time complexity as well as due to their

large candidate sets, which need to be evaluated dur-

ing the search procedure (cf. Kim and Teh, 2018).

3 LARGE-SCALE GAUSSIAN

PROCESS MODELS

In this section, we propose the structural design of

large-scale GPMs. It is designed to reduce computa-

tional complexity of GPM evaluations as well as to

mitigate the amount of kernel expression candidates,

which need to be evaluated as part of the respective re-

trieval algorithm. Since these two issues are the main

bottlenecks of current GPM retrieval algorithms, re-

spective solution strategies are separately covered in

the following two subsections.

3.1 Reduction of Computational

Complexity

Liu et al. (2020) as well as Rivera and Burnaev

(2017) highlight local approximations as a key pos-

sibility to reduce complexity of common GPM eval-

uations based on likelihood measures. Instead of in-

fering an approximate GPM based on a data subset

(cf. global approximations), they construct a holis-

tic GPM from locally-specialized models trained on

non-overlapping segments of the data. Thus, the co-

variance matrix of the holistic GPM is composed of

the respective matrices of its local sub-models (Liu

et al., 2020). This divide-&-conquer approach accel-

erates calculation of log marginal likelihood, since the

resulting covariance matrix is a block diagonal matrix

(Rivera and Burnaev, 2017), whose inverse and deter-

minant can be computed efﬁciently (Park and Apley,

2018). Rivera and Burnaev (2017) emphasize that lo-

cal approximations allow to ”model rapidly-varying

functions with small correlations” in contrast to low-

rank matrix approximations.

Embedding the concept of local approximations

into GPM retrieval algorithms requires a globally

partitioning covariance function. While in principle

change point operators facilitate a global data par-

titioning, their nature of fading kernel expressions

into one another (Lloyd et al., 2014; Steinruecken

et al., 2019) does neither produce clear boundaries be-

tween sub-models nor enables independence of these

models. Therefore, we adapt the given notion of a

change point operator to utilize indicator functions

(Iliev et al., 2017) instead of sigmoid functions (cf.

Steinruecken et al., 2019) to separate kernel expres-

sions. This leads us to the deﬁnition of a parti-

tioning covariance function K for large-scale GPMs

(0, K ):

K (x, x

| {k

}

b=1

, {τ

}

b=0

) =

∑

b=1

(x, x

) · 1

{τ

b−1

<x≤τ

}

(x) · 1

{τ

b−1

≤τ

}

) (7)

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes

The parameter m ∈ N deﬁnes the number of sub-

models k

: X ×X → R, where each sub-model k

can

be thought of as a local covariance function model-

ing the restricted domain [τ

b−1

, τ

] ⊆ X . In this way,

each sub-model k

is only responsible for a certain

coherent fraction of the data, which is delimited by

the change points τ

b−1

and τ

. The speciﬁc change

points τ

= x

and τ

= x

denote the start and end of

the data set D. The usage of indicator functions (thus

having disjoint data segments) produces a block diag-

onal covariance matrix (cf. Low et al., 2015), which

facilitates independently searching the most likely lo-

cal kernel expression (i.e. sub-model) per data seg-

ment.

In order to mathematically show the implication

of aforementioned independence, we have to inves-

tigate the internal structure of the corresponding log

marginal likelihood function L(m, k, θ | D). To this

end, we assume the inherent covariance matrix K ∈

n×n

(cf. Equation 4) to be a block diagonal matrix.

That is, it holds that K = diag(K

, ..., K

), where each

matrix K

∈ R

×n

for 1 ≤ b ≤ m is of individual ar-

bitrary size n

∈ N such that

∑

= n. Furthermore,

let (I

, ..., I

) denote the partitioning of index set

{1, ..., n} ⊆ N according to the given blocks, i.e. for a

submatrix of K over index set I

the following holds:

= K

∀b ∈ [1, m]. Given this notation, one can

show (i) that the quadratic form (y − µ)

−1

(y − µ)

of the log marginal likelihood function L(m, k, θ | D)

is equivalent to

∑

b=1

( ˜y

− ˜µ

)

−1

( ˜y

− ˜µ

), where

˜y

= (y

)

i∈I

∈ R

and ˜µ

= (µ

)

i∈I

∈ R

are de-

ﬁned according to the block diagonal structure and

(ii) that the logarithmic determinant can be decom-

posed as follows: log det(K) = log

∏

b=1

det(K

) =

∑

b=1

logdet(K

). By making use of the aforemen-

tioned algebraic simpliﬁcations, we can simplify the

calculation of the log marginal likelihood function as

shown below:

L(m, k, θ | D)

= −



(y − µ)

−1

(y − µ) + log det(K) + n log(2π)



= −

∑

b=1

( ˜y

− ˜µ

)

−1

( ˜y

− ˜µ

)

∑

b=1

logdet(K

) +

∑

b=1

· log(2π)

∑

b=1

−

( ˜y

− ˜µ

)

−1

( ˜y

− ˜µ

)

+ log det(K

) + n

log(2π)

∑

b=1

L(m, k

, θ | D

)

(8)

As shown above, the equivalence L(m, k, θ | D) =

∑

b=1

L(m, k

, θ | D

) affects the computation of a

large-scale GPM by means of K for a given data

set D and its non-overlapping partitions D

. Instead

of jointly optimizing a holistic statistical model ﬁt-

ting the entire data set D, we are able to indepen-

dently optimize individual sub-models k

. This will

not only increase the expressiveness of the overall

GPM, but also reduce the computation time required

for calculating such a model. Moreover, by splitting a

GPM into smaller parts, we enable a large-scale GPM

to be computed by means of the divide-&-conquer

paradigm. Before we investigate the proposed algo-

rithmic solution, we continue with describing strate-

gies for efﬁciently reducing the size of candidate sets

for sub-models k

3.2 Expansion Strategies

In contrast to the previous section, where we have

shown how to reduce the complexity of evaluating a

single GPM via a divide-&-conquer-based approach,

we investigate measures to reduce the amount of

to-be-evaluated candidates per sub-model covariance

function k

in the remainder of this section. Equation

5 and 6 illustrate the state-of-the-art strategy used by

CKS (Duvenaud et al., 2013), SKC (Kim and Teh,

2018) and ABCD (Lloyd et al., 2014). While this

strategy allows to exhaustively search the space of

possible kernel expressions for the most appropriate

one, it entails to evaluate a lot of inferior candidates

as a by-product. Since the sub-models k

of K (cf.

Equation 7) are designed to represent independent and

inseparable behavioral patterns, introducing further

change points on that layer of the covariance function

would contradict the atomicity assumption of every

sub-model k

. Furthermore, by focusing on compos-

ite covariance functions which are only derived by ad-

ditions and multiplications of base kernels, the num-

ber of candidate models can be further reduced.

Although state-of-the-art algorithms consider

every possible kernel expansion regardless of their

structure up to a certain depth (cf. Subsection

2.2), they retroactively enforce a hierarchy, i.e., the

so-called Sum-of-Products (SoP) hierarchy (Duve-

naud et al., 2013) (cf. Equation 9), among additive

operators and multiplicative operators. Thus, these

algorithms structure the resulting kernel expressions

in order to foster comprehensibility in hindsight. We

incorporate that SoP hierarchy into our kernel expan-

sion strategy to prohibit functionally redundant com-

posite kernel expressions, which result from different

expressions producing the same SoP form, and deﬁne

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

the set of SoP kernel expressions S as follows:

S =

(

∑

i=1

∏

j=1

i j

∈ B

)

(9)

Based on the mathematically formalization of the

SoP form given in the equation above, we propose the

new SoP Kernel Expansion Strategy (SES) by means

of the following candidate set:

{

k b| b ∈ B ∧  ∈ {+, ×} ∧ (k  b) ∈ S

}

(10)

This candidate set ensures, that the resulting ker-

nel expression complies with the SoP form. As with

state-of-the-art algorithms, base kernels are consid-

ered as initial candidates of this strategy.

Orthogonal to the described improvement on ker-

nel expansion strategies, we assess the tactic of base

kernel replacement as ineffectual from a theoreti-

cal perspective, as it contradicts the greedy search

(Resende and Ribeiro, 2016) propagated by authors

of state-of-the-art methods (cf. Steinruecken et al.,

2019). Therefore, we will separately analyze the ef-

fects of this replacement tactic on GPM retrieval in

Section 5.

To sum up, our structural design for large-scale

GPMs encompasses two solutions for the major bot-

tlenecks of GPM structures used by current state-of-

the-art algorithms. On the one hand, local approxi-

mations by means of an indicator change point opera-

tor reduce computational complexity by partitioning

a GPM into several locally-specialized sub-models.

On the other hand, reducing the amount of candidate

covariance functions per sub-model k

by employ-

ing SES mitigates evaluative calculations for inferior

models.

4 AUTOMATIC GPM RETRIEVAL

In this section, we propose the Timeseries Automatic

GPM Retrieval (TAGR) algorithm for efﬁcient re-

trieval of large-scale GPMs. To this end, we make use

of the efﬁciency improvements, i.e. locally approxi-

mated covariance function and SoP Kernel Expansion

Strategy (SES), described in the previous section.

In order to limit the algorithm’s overall com-

plexity and to increase model locality, i.e. to cap-

ture and describe local evolving patterns underlying

the data, the TAGR algorithm initiates multiple lo-

cal kernel searches on dynamically partitioned data.

Subsequently, TAGR concatenates the resulting sub-

models, i.e. each independent composite kernel-

based covariance function k

, into a joint large-scale

GPM by means of K (cf. Equation 7) and a zero-

mean function.

Prior to the computation of these independent sub-

models, a global partitioning needs to be determined.

As the ﬁeld of change point detection mechanisms

for partitioning a dataset via their target values y ∈

Y (Truong et al., 2020; Aminikhanghahi and Cook,

2017) is a well researched domain, we assume the

set of change points T to be one of the parameters

of TAGR. Besides T , the given dataset D = {X,Y }

(cf. Subsection 2.1) and complexity parameter c

max

restricting sub-model size, are given as parameters.

Algorithm 1 illustrates the TAGR algorithm.

Without loss of generality we initiate the sub-models

∈ K for every data partition D

bounded by change

points τ

b−1

and τ

via a white noise kernel k

W N

. The

set I encompasses the indices of all non-ﬁnal sub-

models k

∈ K. The following while-loop is executed

until this set of indices is empty. During every iter-

ation the segment K

with the lowest model quality

by means of log marginal likelihood L is selected,

candidate kernel expressions C

for that segment are

found using the predeﬁned kernel expansion strategy

and the best performing kernel expression k

is deter-

mined. If that kernel expression k

reaches maximum

complexity c

max

in terms of number of involved base

kernels or is equal to the previous best expression K

it is considered ﬁnal and thus removed from I . Fi-

nally, a GPM utilizing a K -based concatenation of

sub-models k

∈ K and zero-mean is returned.

Algorithm 1: Timeseries Automatic GPM Retrieval.

function (D, c

max

, T )

K = {k

W N

}

|T |

i=1

I = {i}

|T |

i=1

while |I | > 0 do

i = argmin

i∈I

L(0, K

, θ |D

)

= arg max

k∈C

L(0, k, θ |D

)

if k

= K

∨ c(k

) ≥ c

max

then

I = I \ i

= k

return GP

(0, K (·, ·|K, T ))

Since the given algorithm is designed to enable

large-scale applications of GPM retrieval, it may be

accelerated by optimizing multiple segments in paral-

lel. The practical capabilities of the TAGR algorithm,

both in an unparallelized as well as in a parallelized

manner, are evaluated in the following chapter. Al-

though we do not explicitly name that opportunity in

Algorithm 1, a preliminary GPM K may be returned

after every iteration of the while-loop making the al-

gorithm anytime capable.

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes

Table 1: Used Benchmark Datasets.

Dataset Name Size Dim.

1 Airline 144 2

2 Solar Irradiance 391 2

3 Mauna Loa 702 2

4 SML (Zamora-Mart

ınez et al., 2014) 4,137 24

5 Power Plant (T

ufekci, 2014) 9,568 5

6 GEFCOM (Hong et al., 2014) 38,064 21

7 Jena Weather (Max Planck Institute for Biogeochemistry, 2019) 420,551 15

8 Household Energy (Hebrail and Berard, 2012) 2,075,259 9

5 EVALUATION

In this section, we investigate the performance of the

proposed TAGR algorithm in terms of efﬁciency and

accuracy. At ﬁrst we are solely evaluating TAGR for

different conﬁgurations of complexity parameter c

max

and different kernel expansion strategies (cf. Sec-

tion 3.2) to investigate, how they affect model qual-

ity and runtime (cf. Section 5.2). Based on the

empirically proven best parameter conﬁguration, we

compare the TAGR algorithm against the state-of-

the-art algorithms CKS, SKC and ABCD (cf. Sec-

tion 5.3). They serve as a baseline to assess efﬁ-

ciency and accuracy of our proposed algorithm. Fi-

nally, TAGR’s scalability and runtime performance

on large-scale datasets is demonstrated in Section 5.4,

where we omit the evaluation of ABCD, SKC and

CKS due to their computation time complexity (Ras-

mussen and Williams, 2006), which impedes analyz-

ing larger datasets (Kim and Teh, 2018) (cf. Table 2).

5.1 Setup of Experiments

In order to ensure comparability with existing and

prospective algorithms, we base our experiments on

publicly available datasets, which were also used by

Dimension ”Carbon dioxide in ppm (room)” was used

as target Y .

Dimension ”electrical power output (P

)” was used as

target Y .

We chose one of the 20 utility zones (i.e. the ﬁrst one)

as the informational content among zones may be consid-

ered equivalent Hong et al. (2014). Our thorough empirical

analysis indicated, that selecting this attribute over the other

zones hardly affected the results.

Dimension ”Air

Temperature” was used as target Y .

We are using the recognized intervall of this continuously

gathered dataset from 2009-01-01 to 2016-12-31 (cf. Chol-

let, 2018)

Dimension ”Global active power” was used as target

Y .

4.2 4.4

4.6

·10

−2

PELT

WindowBased

BottomUp

Binary

Model Error - MSE

460

480

500

Runtime - seconds

Figure 1: Impact of partitioning methods on average run-

time and average model error of the TAGR algorithm.

Duvenaud et al. (2013) and Lloyd et al. (2014) in

their papers proposing CKS and ABCD (datasets 1,

2, 3 in Table 1). Kim and Teh (2018) note that

datasets larger than a few thousand points are ”far

beyond the scope of [...] GP optimisation in CKS”

and consequently beyond the scope of ABCD as well.

Therefore, datasets 1 - 3 are rather small. Espe-

cially for scalability analysis, we include ﬁve addi-

tional datasets (datasets 4-8 in Table 1).

As we focus on time series data in this paper and

both algorithms CKS and ABCD have been evalu-

ated for strictly one-dimensional input X ⊆ R and out-

put data Y ⊆ R (Duvenaud et al., 2013; Lloyd et al.,

2014), we will design our evaluation procedure ac-

cordingly.

In order to reﬂect practical circumstances, all ex-

periments are executed on a 2019 Dell Laptop hav-

ing a 1.9 GHz CPU (4 cores / 8 threads) with 16

GB of RAM. All algorithms were implemented using

Python 3 and Tensorﬂow 2 (no GPU acceleration en-

abled). Since Duvenaud et al. (2013) and Lloyd et al.

(2014) used L just for optimizing models and Mean

Squared Error (MSE) for assessing model quality, we

will adopt that procedure as well. For every evalua-

tion, MSE reﬂects the ﬁve-fold cross-validated model

error.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

BES SES

100

Runtime - seconds

w/o replacement w/ replacement

BES SES

·10

−2

Model Error - MSE

Figure 2: Impact of kernel expansion strategy on average

runtime and average model error of the TAGR algorithm.

1 2 4 8

100

150

200

max

Runtime - seconds

1 2 4 8

·10

−2

max

Model Error - MSE

Figure 3: Impact of complexity parameter c

max

on average

runtime and average model error of the TAGR algorithm.

5.2 TAGR Conﬁgurations

In this ﬁrst series of experiments, we evaluate the

inﬂuence of various change point detection mecha-

nisms to determine T , model complexity parameter

max

∈ [1, 2, 3, 4] and different kernel expansion strate-

gies (i.e. BES and SES) on both efﬁciency of the com-

putation, measured by runtime in seconds, and qual-

ity of the statistical model by MSE. We executed the

TAGR algorithm with all possible parameter conﬁgu-

rations on the datasets mentioned above (cf. Table 1).

All experiments of the ﬁrst series execute the TAGR

algorithm in a non-parallel way. The results are sum-

marized in Figures 2 and 3.

At ﬁrst, we evaluate, which standard change

point detection mechanism (cf. Truong et al., 2020;

Aminikhanghahi and Cook, 2017) to use in order to

determine the set of change points T as one of the

parameters of TAGR. To do so, we compare the four

mechanisms put forward by Truong et al. (2020) con-

sidering runtime performance and model error for ex-

ecuting TAGR utilizing the resulting change points.

Figure 1 illustrates the given performance indicators

averaged over all benchmark datasets and given pa-

rameter conﬁgurations of kernel expansion strategy

and complexity c

max

. Since Binary Segmentation of-

fers one of the lowest average model error and bene-

ﬁts runtime the most, we choose this mean of change

point detection for further use of TAGR.

Figure 2 shows, how different kernel expansion

strategies impact average model error and average

runtime performance of the TAGR algorithm. In gen-

eral, BES is outperformed by SES and not using the

replacement tactic further improves on both of these

strategies. For most of the retrieved GPMs per bench-

mark dataset, the replacement tactic does not change

the resulting model and only increases runtime by in-

creasing the amount of candidate models. Still, some

GPM retrievals are negatively affected by employing

that tactic.

In addition to the different kernel expansion strate-

gies, we evaluated the complexity parameter c

max

While model error assessed by MSE is positively in-

ﬂuenced by increasing complexity parameter c

max

the opposite holds true for runtime performance (cf.

Figure 3). Due to these results, we choose SES

and a model complexity of c

max

= 4 as default pa-

rameter conﬁguration for the comparison of the pro-

posed TAGR algorithm with existing state-of-the-art

approaches. The chosen TAGR conﬁguration does

not include the replacement tactic.

5.3 Comparison with State of the Art

In the second series of experiments, we compare the

performance in terms of model quality and runtime

of the TAGR algorithm to state-of-the-art algorithms.

In order to prevent interferences between paralleliz-

ability features and bare algorithmic qualities among

the evaluated competitors, all experiments of the sec-

ond series are executed in a non-parallel way. The

runtimes are summarized in Figure 4, while the cor-

responding detailed results are listed in Table 2.

As can be seen in Figure 4, the TAGR algorithm’s

divisive approach leads to a lower growth of runtime

in relation to the data size. This property comes es-

pecially handy for larger datasets. Considering for

example the dataset PowerPlant, the SKC algorithm

has a runtime of 9:19h (∼ 33,533 seconds), while the

TAGR algorithm only needs 0:01h (∼ 68 seconds)

to ﬁnish the corresponding computations. Hence,

our proposal is approximately up to 500 times as

fast as existing state-of-the-art solutions (SKC). This

huge improvement can be traced back to TAGR’s de-

sign, which not aims at retrieving one inextricable

large model, but one consisting of small, indepen-

dent partitions representing coherent fractions of the

data. Subsequently, model evaluations by means of

log marginal likelihood L or MSE are substantially

cheaper.

Besides ensuring better runtime properties of our

approach in comparison to CKS, ABCD and SKC, we

also show that TAGR produces statistical models by

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes

Table 2: Comparative evaluation results for state-of-the-art algorithms.

Runtime MSE

Dataset CKS ABCD SKC TAGR CKS ABCD SKC TAGR

Airline 00:00:01 00:00:02 00:00:04 00:00:04 0.1361 0.0981 0.1175 0.0269

Solar 00:00:03 00:00:06 00:00:08 00:00:05 0.0856 0.0897 0.0845 0.0769

Mauna 00:00:15 00:00:26 00:00:24 00:00:07 0.1525 0.1519 0.1724 0.0156

SML 00:39:05 00:45:52 00:29:18 00:00:25 0.1103 0.1103 0.2259 0.0112

Power 12:27:51 14:37:15 09:18:53 00:01:08 0.0726 0.0726 0.1772 0.0046

dataset size

Runtime - seconds

CKS

ABCD

SKC

TAGR

Figure 4: Runtime for different state-of-the-art algorithms

as a function of dataset size.

Table 3: Performance and Model Quality.

Runtime

Dataset Parallel Non-Parallel MSE

GEFCOM 0:03:44 0:13:09 0.0111

Jena 0:46:58 4:18:03 0.0109

Household 5:21:26 21:34:42 0.0024

means of kernel expressions of better quality. Table 2

illustrates the resulting model qualities per algorithm.

In general, TAGR delivers better results regarding ev-

ery dataset analyzed according to MSE measure.

5.4 Scalability of TAGR

The third and ﬁnal series of experiments is only fo-

cused on evaluation of the TAGR algorithm. TAGR

clearly outperforms state-of-the-art algorithms in

terms of runtime, while still maintaining solid model

quality. Nevertheless, the largest dataset considered

comprises only approximately 10,000 records, which

is by no deﬁnition large-scale. Thus, we aim to show

the scalability features of our proposal in the remain-

der of this section, for which we employ the largest

three of the selected datasets (cf. Table 1), containing

up to two million records.

Table 3 presents the results of non-parallel and

parallel execution of the TAGR algorithm for the

given large-scale datasets. We executed it in a par-

allel as well as in a non-parallel fashion, to main-

tain comparability with given publications (cf. Lloyd

−2

−1

dataset size

runtime / dataset size

Non-Parallel

Parallel

Figure 5: Runtime per Data Record.

et al., 2014; Duvenaud et al., 2013; Kim and Teh,

2018; Steinruecken et al., 2019), while still show-

ing TAGR’s scalability features. As parallelization

of TAGR does not affect the resulting models, only

one qualitative measure concerning model accuracy

is given per analyzed dataset.

Figure 5 illustrates how the ratio between runtime

and dataset size evolves along absolute dataset size.

It shows that TAGR becomes more efﬁcient facing in-

creasing dataset sizes. Beyond that, concurrent ex-

ecution of our proposed algorithms building blocks

reduces runtime compared to non-parallel execution.

Those mentioned advances do not obstruct the accu-

racy of the derived GPM as shown in Tables 2 and

To sum up, we have shown which kind of param-

eter conﬁguration of our algorithmic approach pro-

duces the most promising results regarding model

quality and performance. Based on that, we demon-

strated how our approach produces statistical data

models by means of Gaussian Processes of supe-

rior quality with regards to state-of-the-art algo-

rithms, while outperforming these algorithms con-

cerning runtime performance in general.

6 CONCLUSION

In this paper, we have investigated a new approach for

efﬁcient, automatic GPM retrieval for large-scale time

series datasets. While state-of-the-art GPM retrieval

algorithms lack scalability (cf. Berns and Beecks,

2020b), we propose a new structural design for large-

scale GPMs. This design allows to get around the

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

two major bottlenecks of current methods by using

a divisive approach to reduce effects of Gaussian Pro-

cesses’ cubic runtime complexity as well as employ-

ing a purposive strategy to generate fewer candidate

models.

Our performance evaluation has revealed that

GPMs resulting from the proposed TAGR algorithm

deliver similar model quality in comparison to those

models produced by state-of-the-art algorithms. In

addition, runtime of the retrieval process is reduced

signiﬁcantly especially for larger time series, where

we achieve a speed-up factor of approximately 500

with regards to existing methods such as CKS, ABCD

and SKC.

As for future work, we consider global approxi-

mations an opportunity for further optimizing our ap-

proach. We therefore plan to address the opportuni-

ties of low-rank approximations such as the Nystr

method (Hensman et al., 2013) in our future work.

In addition, we plan to develop GPM retrieval algo-

rithms for big data processing frameworks in order to

scale GPM retrieval to very large and even multidi-

mensional datasets.

REFERENCES

Abrahamsen and Petter (1997). A review of gaussian ran-

dom ﬁelds and correlation functions: Technical report

917.

Aminikhanghahi, S. and Cook, D. J. (2017). A survey

of methods for time series change point detection.

Knowl. Inf. Syst., 51(2):339–367.

Berns, F. and Beecks, C. (2020a). Automatic gaussian pro-

cess model retrieval for big data. In CIKM. ACM.

Berns, F. and Beecks, C. (2020b). Towards large-scale gaus-

sian process models for efﬁcient bayesian machine

learning. In DATA, pages 275–282. SciTePress.

Berns, F., Schmidt, K. W., Grass, A., and Beecks, C. (2019).

A new approach for efﬁcient structure discovery in iot.

In BigData, pages 4152–4156. IEEE.

Cheng, C. and Boots, B. (2017). Variational inference for

gaussian process models with linear complexity. In

NIPS, pages 5184–5194.

Chollet, F. (2018). Deep learning with Python. Manning

Publications Co, Shelter Island New York.

Csat

o, L. and Opper, M. (2000). Sparse representation for

gaussian process models. In NIPS, pages 444–450.

MIT Press.

Duvenaud, D., Lloyd, J. R., Grosse, R. B., Tenenbaum,

J. B., and Ghahramani, Z. (2013). Structure dis-

covery in nonparametric regression through composi-

tional kernel search. In ICML (3), volume 28 of JMLR

Workshop and Conference Proceedings, pages 1166–

1174. JMLR.org.

Gittens, A. and Mahoney, M. W. (2016). Revisiting the nys-

trom method for improved large-scale machine learn-

ing. J. Mach. Learn. Res., 17:117:1–117:65.

Hebrail, G. and Berard, A. (2012). Individual house-

hold electric power consumption data set.

https://archive.ics.uci.edu/ml/datasets/individual+

household+electric+power+consumption. Accessed:

09/01/2020.

Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaus-

sian processes for big data. In UAI. AUAI Press.

Hong, T., Pinson, P., and Fan, S. (2014). Global energy

forecasting competition 2012. International Journal

of Forecasting, 30(2):357–363.

Iliev, A. I., Kyurkchiev, N., and Markov, S. (2017). On the

approximation of the step function by some sigmoid

functions. Math. Comput. Simul., 133:223–234.

Kim, H. and Teh, Y. W. (2018). Scaling up the automatic

statistician: Scalable structure discovery using gaus-

sian processes. In AISTATS, volume 84 of Proceed-

ings of Machine Learning Research, pages 575–584.

PMLR.

Li, S. C. and Marlin, B. M. (2016). A scalable end-to-end

gaussian process adapter for irregularly sampled time

series classiﬁcation. In NIPS, pages 1804–1812.

Liu, H., Ong, Y., Shen, X., and Cai, J. (2020). When gaus-

sian process meets big data: A review of scalable gps.

IEEE Transactions on Neural Networks and Learning

Systems, pages 1–19.

Lloyd, J. R., Duvenaud, D., Grosse, R. B., Tenenbaum,

J. B., and Ghahramani, Z. (2014). Automatic con-

struction and natural-language description of nonpara-

metric regression models. In AAAI, pages 1242–1250.

AAAI Press.

Low, K. H., Yu, J., Chen, J., and Jaillet, P. (2015). Paral-

lel gaussian process regression for big data: Low-rank

representation meets markov approximation. In AAAI,

pages 2821–2827. AAAI Press.

Malkomes, G., Schaff, C., and Garnett, R. (2016). Bayesian

optimization for automated model selection. In NIPS,

pages 2892–2900.

Max Planck Institute for Biogeochemistry (2019). Weather

Station Beutenberg / Weather Station Saaleaue: Jena

Weather Data Analysis. https://www.bgc-jena.mpg.

de/wetter/. Accessed: 09/01/2020.

Park, C. and Apley, D. W. (2018). Patchwork kriging

for large-scale gaussian process regression. J. Mach.

Learn. Res., 19:7:1–7:43.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian

processes for machine learning. Adaptive computa-

tion and machine learning. MIT Press.

Resende, M. G. and Ribeiro, C. C. (2016). Optimization by

GRASP: Greedy Randomized Adaptive Search Proce-

dures. Springer New York, New York, NY.

Rivera, R. and Burnaev, E. (2017). Forecasting of com-

mercial sales with large scale gaussian processes. In

ICDM Workshops, pages 625–634. IEEE Computer

Society.

Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N.,

and Aigrain, S. (2013). Gaussian processes for time-

series modelling. Philosophical transactions. Series

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes

A, Mathematical, physical, and engineering sciences,

371(1984):20110550.

Steinruecken, C., Smith, E., Janz, D., Lloyd, J. R., and

Ghahramani, Z. (2019). The automatic statistician.

In Automated Machine Learning, The Springer Series

on Challenges in Machine Learning, pages 161–173.

Springer.

Titsias, M. K. (2009). Variational learning of induc-

ing variables in sparse gaussian processes. In AIS-

TATS, volume 5 of JMLR Proceedings, pages 567–

574. JMLR.org.

Truong, C., Oudre, L., and Vayatis, N. (2020). Selective re-

view of ofﬂine change point detection methods. Signal

Process., 167.

ufekci, P. (2014). Prediction of full load electrical power

output of a base load operated combined cycle power

plant using machine learning methods. Interna-

tional Journal of Electrical Power & Energy Systems,

60:126–140.

Williams, C. K. I. and Seeger, M. W. (2000). Using the

nystr

om method to speed up kernel machines. In

NIPS, pages 682–688. MIT Press.

Wilson, A. G. and Adams, R. P. (2013). Gaussian pro-

cess kernels for pattern discovery and extrapolation.

In ICML (3), volume 28 of JMLR Workshop and Con-

ference Proceedings, pages 1067–1075. JMLR.org.

Zamora-Mart

ınez, F., Romeu, P., Botella-Rocamora, P., and

Pardo, J. (2014). On-line learning of indoor tempera-

ture forecasting models towards energy efﬁciency. En-

ergy and Buildings, 83:162–172.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval