Kernel Completion for Learning Consensus Support Vector Machines in
Bandwidth-limited Sensor Networks
Sangkyun Lee and Christian P¨olitz
Fakult¨at f¨ur Informatik, LS VIII, Technische Universit¨at Dortmund, 44221 Dortmund, Germany
Keywords:
Sensor Networks, Support Vector Machines, Gaussian Kernels, Sampling, Matrix Completion.
Abstract:
Recent developments in sensor technology allows for capturing dynamic patterns in vehicle movements, tem-
perature changes, and sea-level fluctuations, just to name a few. A usual way for decision making on sensor
networks, such as detecting exceptional surface level changes across the Pacific ocean, involves collecting
measurement data from all sensors to build a predictor in a central processing station. However, data col-
lection becomes challenging when communication bandwidth is limited, due to communication distance or
low-energy requirements. Also, such settings will introduce unfavorable latency for making predictions on
unseen events. In this paper, we propose an alternative strategy for such scenarios, aiming to build a consensus
support vector machine (SVM) in each sensor station by exchanging a small amount of sampled information
from local kernel matrices amongst peers. Our method is based on decomposing a “global” kernel defined
with all features into “local” kernels defined only with attributes stored in each sensor station, sampling few
entries of the decomposed kernel matrices that belong to other stations, and filling in unsampled entries in
kernel matrices by matrix completion. Experiments on benchmark data sets illustrate that a consensus SVM
can be built in each station using limited communication, which is competent in prediction performance to an
SVM built with accessing all features.
1 INTRODUCTION
Sensors can monitor many different kinds of dynam-
ics in nature, generating numerous data, and thereby
embodying research challenges in machine learning
and data mining (Whittaker et al., 1997; Lippi et al.,
2010; Morik et al., 2012). There is a wide spectrum of
sensing devices available today, but they share a com-
mon property: communication is costly and should
be avoided whenever possible, due to restrictions in
bandwidth or in energy consumption. This is a clear
barrier for global decision making, for which it is typ-
ically required to agglomerate all local sensor mea-
surements into a central location for processing.
On the other hand, many sensors are stationed
within devices equipped with surprisingly powerful
and energy-efficient computation units. This has mo-
tivated us to use computation to save communication.
Specifically, we aim to build a support vector machine
(SVM) (Boser et al., 1992) in each of such devices,
called sensor stations, using local measurements and
a small amount of sampled information transmitted
from other stations. The goal is to obtain a consensus
SVM in each station that behaves similarly to a global
SVM that could be constructed if we collect informa-
tion from all stations for central processing.
Our work is closely related to learning SVMs in
distributed environments, which can be split into two
categories. Case I: examples are distributed (fea-
tures are not distributed). In such cases a global
SVM can be trained using a distributed optimiza-
tion algorithm (Boyd et al., 2011), or separate SVMs
can be trained locally for data partitions with extra
constraints to produce similar models (Forero et al.,
2010). Alternatively, local SVMs can be trained in
their primal form independently on data partitions
and then combined to produce a model with a re-
duced variance (Lee and Bockermann, 2011; Cram-
mer et al., 2012). Case II: features are distributed (ex-
amples are not distributed). In such cases a central
coordination of local SVM training has been consid-
ered to improve global prediction performance (Lee
et al., 2012; Stolpe et al., 2013). Our work focuses on
the second case where features are distributed, con-
sidering communication-efficientapproximations to a
global kernel matrix (which could be built by access-
ing all features) in each station, without any central
coordination.
113
Lee S. and Pölitz C..
Kernel Completion for Learning Consensus Support Vector Machines in Bandwidth-limited Sensor Networks.
DOI: 10.5220/0004829401130124
In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 113-124
ISBN: 978-989-758-018-5
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
Our suggested method is based on decompositions
of a (global) kernel into separate parts, where each of
them is another kernel defined with attributes stored
locally in each sensor station. Each decomposed ker-
nel matrix is stored in a sensor station where corre-
sponding attributes are stored. Each station receives
few sampled entries of the decomposed kernel matri-
ces stored in remote stations, and then applies matrix
completion to approximate the values of unobserved
entries. Using these altogether, a consensus SVM is
created in each station, which can be applied for pre-
dicting future events using local and remote informa-
tion in a similar fashion.
We denote the Euclidean norm by k·k and the car-
dinality of a finite set A by |A| throughout the paper.
2 SUPPORT VECTOR MACHINES
WITH DECOMPOSED
KERNELS
Let us consider sensor stations represented as nodes
n = 1,2,.. .,N in a network, where each node stores
measurements from its own sensors, in a feature vec-
tor x
i
[n]
p
n
, of sensing targets i = 1,2,... ,m. For
simplicity we assume that communication between
any pair of nodes is allowed. A collection of all these
vectors x
i
= (x
i
[1]
T
,x
i
[2]
T
,. .. ,x
i
[N]
T
)
T
can be seen
as an input vector of length p =
N
n=1
p
n
.
2.1 Support Vector Machines
The dual formulation of SVMs is described as fol-
lows (Shawe-Taylor and Sun, 2011),
min
α
m
1
2
α
T
Qα1
T
α ,
subject to y
T
α = 0,
0 α C1 .
(1)
Here 1 := (1,1,... , 1)
T
and y := (y
1
,y
2
,. .. ,y
m
)
T
are
column vectors of length m, and C is a given con-
stant. (Without loss of generality, we focus on the
case of classification our method can be general-
ized for other types.) The matrix Q
m×m
is a
scaled kernel matrix, that is, Q := YKY for a positive
semidefinite kernel matrix K, where Y := diag(y) is
the diagonal matrix whose elements are given by the
vector y. SVMs have been successful in many appli-
cations, including multitask multiclass learning prob-
lems (Ji and Sun, 2013) for example.
2.2 Decomposition of Kernels
We consider two different decompositions of the
kernel matrix K, especially those obtainable from
the popular Gaussian kernel. We refer to them
as “MULTIPLICATIVE and “ADDITIVE”, defined as
follows for i, j = 1, 2, . ..,m,
(MULTIPLICATIVE)
[K]
ij
=
N
n=1
exp
γkx
i
[n] x
j
[n]k
2
, and
(ADDITIVE)
[K]
ij
=
1
N
N
n=1
exp
γ
n
kx
i
[n] x
j
[n]k
2
.
(2)
The MULTIPLICATIVE kernel is indeed the same as
the standard Gaussian kernel (Scholkopf and Smola,
2001), but our description above reveals that it can
be constructed by multiplying “local” Gaussian ker-
nels defined with attributes stored locally in sensor
stations. The construction of ADDITIVE is similar,
except that local Gaussian kernels are averaged, not
multiplied. ADDITIVE resembles how kernels are
used in the multiple kernel learning (Lanckriet et al.,
2002): the connection is further discussed in Sec-
tion 4.3. Note that MULTIPLICATIVE has a single pa-
rameter γ > 0, but ADDITIVE has a separate parameter
γ
n
> 0 for each local kernel.
2.3 Local and Remote Parts in
Decomposition
From the definitions in (2), we identify the parts that
can be computed with attributes stored locally in each
node (local parts), and that need to be transferred from
other nodes in a sensor network (remote parts).
First, the expression of MULTIPLICATIVE can be
rewritten in the following way,
[K]
ij
= exp
N
n=1
γkx
i
[n] x
j
[n]k
2
!
= exp
γkx
i
[n] x
j
[n]k
2
n
6=n
exp
γkx
i
[n
] x
j
[n
]k
2
= [G
n
]
ij
[G
n
]
ij
, (3)
where the “local” Gaussian kernel G
n
for a node n and
the product G
n
of all “remote” kernels are defined
entrywise respectively by
[G
n
]
ij
:= exp
γkx
i
[n] x
j
[n]k
2
(local)
[G
n
]
ij
:=
n
6=n
[G
n
]
ij
(remote) .
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
114
Similarly, ADDITIVE can be written as
[K]
ij
=
1
N
[H
n
]
ij
+
n
6=n
[H
n
]
ij
!
=
1
N
[H
n
+ H
n
]
ij
, (4)
where H
n
is the local part and H
n
is the remote part
for node n, defined respectively by
[H
n
]
ij
:= exp
γ
n
kx
i
[n] x
j
[n]k
2
(local)
[H
n
]
ij
:=
n
6=n
[H
n
]
ij
(remote) .
For a node n, the computation of the local part G
n
(or
H
n
) is done exactly using local attributes, where the
remote part G
n
(or H
n
) is to be approximated.
3 KERNEL COMPLETION
Let us denote the kernel matrix to be estimated in the
nth node by
e
K
n
, which is computed by
[
e
K
n
]
ij
:=
(
[G
n
]
ij
[
e
G
n
]
ij
(MULTIPLICATIVE)
1
N
[H
n
+
e
H
n
]
ij
(ADDITIVE) .
(5)
Here
e
G
n
(or
e
H
n
) is an estimate of the remote part
G
n
(or H
n
). Once we have
e
K
n
, it can be plugged in
(1) replacing Q in the form of
e
Q
n
:= Y
e
K
n
Y.
In order to obtain the estimate
e
G
n
(or
e
H
n
), we
make use of matrix completion (Cand`es and Recht,
2009), which is a method to reconstruct a matrix from
only a few sampled entries from it. The purpose of
using matrix completion is (i) to reduce the number
of entries required to be sampled from remote kernel
parts in bandwidth-limited situations. Matrix comple-
tion will not be required if all nodes provide complete
information. And it is (ii) to avoid the complexity of
defining an optimal sampling strategy. That is, a sim-
ple uniform random sampling strategy is enough for
matrix completion to guarantee the perfect recovery
of the original kernel matrix with high probability.
We first discuss extra constraints we need to add
to matrix completion, so that the resulting matrix is to
be a valid kernel matrix.
3.1 Constraints on
e
G
n
and
e
H
n
First,
e
G
n
has to be a symmetric matrix where diago-
nal entries are all ones. It becomes a valid kernel ma-
trix if and only if it is positive semidefinite (Scholkopf
and Smola, 2001), that is, z
T
e
G
n
z 0 for all z
m
.
Since each entry of a local Gaussian kernel G
n
is
in the range (0,1] by definition, the product of such
entries in
e
G
n
should be in the same range as well.
Next,
e
H
n
shares the same properties as
e
G
n
, except
for that each diagonal element of
e
H
n
is (N 1), not
one, by construction.
There is another possible way to decompose the
MULTIPLICATIVE kernel,
[K]
ij
= exp
γkx
i
[n] x
j
[n]k
2
γ
n
6=n
kx
i
[n
] x
j
[n
]k
2
= exp([D
n
+ D
n
]
ij
) ,
with
[D
n
]
ij
:= γkx
i
[n]x
j
[n]k
2
, [D
n
]
ij
:=
n
6=n
[D
n
]
ij
.
Then our task becomes making an estimate
e
D
n
of a
distance matrix D
n
, which has zero diagonal entries.
The estimate defines a valid distance matrix if and
only if it is conditionally positive semidefinite, that is,
z
T
e
D
n
z 0 for all z
m
with z
T
1 = 0 (Schoenberg,
1938). This implies that
e
D
n
is positive semidefinite,
or it has a single negative eigenvalue. It turned out
that our kernel completion in forms of (3) performed
better in our experiments, so we did not pursue this
direction further.
3.2 Low-rank Matrix Completion
For the description of matrix completion, we follow
the line of discussion in (Recht and R´e, 2011). Matrix
completion reconstructs a full matrix from only a few
entries sampled from the original matrix. In general,
matrix completion works with matrices in any shape,
but we focus on square matrices here.
Suppose that X
m×m
is a matrix we wish to
recover, and that the entries at (i, j) of X are
revealed and stored in another matrix M. Matrix
completion solves the following convex optimization
problem to recover X,
min
X
(i, j)
(X
ij
M
ij
)
2
+ λkXk
, kXk
:=
m
k=1
σ
k
(X) .
Here kXk
is the nuclear norm of X, which is the
summation of singular values σ
k
(X) of X and penal-
izes the rank of X.
The nuclear norm simplifies when we assume that
the matrix X has the rank r, and consider a factoriza-
tion of X into LR
T
for some L
m×r
and R
m×r
.
This leads to X
ij
= [LR
T
]
ij
= L
i·
R
T
j·
, and
kXk
= min
X=LR
T
1
2
kLk
2
F
+
1
2
kRk
2
F
,
KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks
115
* *
. . .
*
{[M
n
]
ij
: (i, j) }
G
1
G
N
Completion
!
G
n
G
n
G
2
+
Figure 1: A schematic of kernel completion, with MULTIPLICATIVE kernel. Each node n (i) collects and summarizes the
samples corresponding to from remote kernel matrices G
n
as the known entries of the matrix M
n
, and then (ii) fills up the
unknown entries of M
n
via matrix completion, producing
e
G
n
, (iii) forming an estimate of kernel matrices together with the
exact local kernel G
n
.
where k·k
F
is the Frobenius norm. The equivalence
can be understood by taking a singular value decom-
position X = UΣV
T
and setting L = UΣ
1/2
and R =
VΣ
1/2
. Then kXk
= tr(Σ), kLk
2
F
= tr(L
T
L) = tr(Σ),
and kRk
2
F
= tr(Σ), so the equality holds. For details,
we refer to (Recht et al., 2010; Recht and R´e, 2011).
Using the property of the nuclear norm on rank-r
matrices, we can reformulate the matrix completion
optimization as
min
L,R
(i, j)
(L
i·
R
T
j·
M
ij
)
2
+
λ
2
kLk
2
F
+
λ
2
kRk
2
F
. (6)
To obtain solutions, we use the JELLYFISH algo-
rithm (Recht and R´e, 2011), which is a highly par-
allel incremental gradient descent procedure to find
the minimizers, making use of the fact that the gra-
dient of the above objective depends on only L
i·
and
R
j·
, and therefore the computation of each iteration
can be easily distributed for the pairs (i, j) .
3.2.1 Constrained Matrix Completion
To incorporate the constraints discussed in Sec-
tion 3.1, we need to find a matrix
e
X
that is close
to L
(R
)
T
where L
and R
are the solutions of (6)
(both are m by r, r (0,m]), belonging to a convex
set K of symmetric positive semidefinite rank-r ma-
trices,
K := {
e
X :
e
X 0, (
e
X)
T
=
e
X, rank(
e
X) = r} .
The following lemma shows that the description of
this set can be simplified.
Lemma 3.1. The elements
e
X in K must have the form
e
X = ZZ
T
, where Z
m×r
.
Proof. Suppose that
e
X is in K . Since
e
X is symmet-
ric and positive semidefinite, from the eigen decom-
position of
e
X there exists a factor U
m×m
such
that
e
X = UΣU
T
where Σ 0 is the diagonal matrix
of eigenvalues. Removing the columns of U and the
part of Σ corresponding to the zero eigenvalues, we
obtain
e
U
m×r
and
e
Σ
r×r
. Then Z =
e
U
e
Σ
1/2
can
be constructed so that
e
X = ZZ
T
.
Conversely, any
e
X in the form of
e
X = ZZ
T
sat-
isfies
e
X
T
=
e
X (symmetric) and z
T
e
Xz = kz
T
Zk
2
2
0
for all z
m
(positive semidefinite). Therefore
e
X =
ZZ
T
is an element of K .
This lemma indicates that the set K can be rewritten
as simple as,
K = {ZZ
T
: Z
m×r
} .
The next step is to find a matrix Z such that ZZ
T
is
close to L
(R
)
T
. An
2
projection of L
(R
)
T
onto
K requires an iterative procedure which is as costly
as finding L
and R
. Therefore we consider an al-
ternative projection for which we have a closed-form
solution,
Z
= argmin
Z
m×r
1
2
kZL
k
2
F
+
1
2
kZR
k
2
F
.
From the KKT conditions, the solution is obtained by
Z
=
L
+ R
2
.
Then a projection
e
X
is obtained by
e
X
= Z
(Z
)
T
,
which has a guarantee on its quality as stated in the
next lemma:
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
116
Lemma 3.2. The trace-norm distance between
e
X
=
Z
(Z
)
T
, where Z
= (L
+ R
)/2, and L
(R
)
T
is
bounded, that is,
tr(
e
X
L
(R
)
T
)
1
4
kL
R
k
2
F
.
Proof. Using
e
X
= Z
(Z
)
T
, the result can be derived
as follows,
tr(
e
X
L
(R
)
T
)
=
1
4
tr{(L
+ R
)(L
+ R
)
T
4L
(R
)
T
}
=
1
4
tr{(L
R
)(L
R
)
T
}
=
1
4
kL
R
k
2
F
.
Here we have used the properties of the trace that
tr(X+ Y) = tr(X) + tr(Y) = tr(X) + tr(Y
T
) = tr(X+
Y
T
) and tr(XX
T
) = kXk
2
F
.
The above lemma tells that the distance between
e
X
and L
(R
) becomes small whenever L
R
, which
is likely to happen in our case since we define M and
in such a way that if (i, j) then ( j, i) , and
M
ij
= M
ji
.
3.2.2 Sample Index Pair Subset
In our method, we assume that a single sample index
pair set {(i, j) : 1 i, j m} is fixed across all
nodes. It is more efficient than using multiple sample
sets, since otherwise we have to store and complete
each remote matrix G
n
, n
{1,.. .,N}\{n}, sepa-
rately. Using a pre-defined across nodes can be im-
plemented as using a fixed random seed for a pseudo
random number generator, so that does not have to
be transferred at all.
Given , each node n receives information from
other nodes n
and stores it in M
n
as follows for all
(i, j) ,
[M
n
]
ij
=
n
∈{1,...,N}\{n}
[G
n
]
ij
(MULTIPLICATIVE)
n
∈{1,...,N}\{n}
[H
n
]
ij
N1
(ADDITIVE) .
That is, the communication cost for each node n is
O((N 1)||). The use of matrix completion makes
it possible to choose an of relatively small size
(O(m
1.2
rlogm) when M
n
is a rank-r matrix, see The-
orem 4.1 for details) in a simple way, that is, via ran-
dom uniform sampling.
Once the matrix M
n
is obtained, the node n solves
the matrix completion (6) with [M
n
]
to obtain Z
n
=
(L
n
+ R
n
)/2, and then compute
e
G
n
= Z
n
(Z
n
)
T
or
e
H
n
= (N 1)Z
n
(Z
n
)
T
, based on Lemmas 3.1 and
3.2. An estimate of the kernel K, obtained by (5), is
then used for training an SVM.
After training SVMs, we apply the same tech-
nique for new test examples to build the test kernel
matrix. This usually involves smaller matrix com-
pletion problems corresponding to the support vectors
and test examples.
3.3 Extra Saving with ADDITIVE
The description of the matrix completion optimiza-
tion (6) involves all training examples. However, if a
(super-)set of the support vectors (SVs), which fully
determines a prediction function, is known a priori,
then we can solve the completion problem only for
the set, reducing the cost of matrix completion.
Let us consider the SVs of the “global” SVM
problem (1) equipped with the exact ADDITIVE ker-
nel (2), which is constructed by accessing all features
in a central location. We denote this set of SVs as S
.
Note that S
is never obtained, since we do not solve
such a global problem.
We try to estimate S
from the sets of “local”
SVs. These local SVs are obtained from solving an
individual SVM (1) in each node n, using only the
local features, that is, setting the scaled kernel ma-
trix as Q = YH
n
Y for the local kernel matrix H
n
=
exp(γ
n
kx
i
[n] x
j
[n]k
2
). We denote the set of SVs
in the node n by S
n
obtained in this way.
In the next theorem, we show that the union of the
local SV sets S
n
encompasses the global SV set S
.
To shorten the length of our proof, here we show the
case for the SVMs without any intercept, that is, the
constraint y
T
α = 0 is removed (the same result holds
for the case with intercepts).
Theorem3.3. Consider the global SVM problem with
the ADDITIVE kernel and its set of SVs S
,
α
:= argmin
0αC1
1
2
α
T
Y
1
N
N
n=1
H
n
!
Yα1
T
α ,
S
:= {i : [α
]
i
> 0} ,
and the corresponding local SVM problem and its SVs
for each node n, n = 1,2,... ,N,
α
n
:= argmin
0αC1
1
2
α
T
Y(H
n
)Yα 1
T
α ,
S
n
:= {i : [α
n
]
i
> 0} .
Then we have
S
N
n=1
S
n
.
Proof. Let us consider an index i S
of an SV of
the global SVM problem, such that [α
]
i
> 0. Sup-
pose that the ith component of the gradient of all local
KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks
117
SVM problems at α
is strictly positive, that is,
[YH
n
Yα
1]
i
> 0, n {1, 2, ..., N} . (7)
Let us look into the optimality condition of the global
SVM, regarding the ith component of the optimizer
α
. From the KKT conditions, we have
1
N
N
n=1
[YH
n
Yα
1]
i
[p
]
i
+ [q
]
i
= 0,
[p
]
i
[α
]
i
= 0, [q
]
i
[C1α
]
i
= 0,
where p
m
+
and q
m
+
are the Lagrange mul-
tipliers for the constraints α 0 and α C1, respec-
tively. Then [α
]
i
> 0 implies [p
]
i
= 0, and therefore
1
N
N
n=1
[YH
n
Yα
1]
i
+ [q
]
i
= 0.
If (7) is true, then we have a contradiction here since
the first term above becomes strictly positive, where
the second term satisfies [q
]
i
0, and therefore the
equality cannot hold. This implies that there exists at
least one node n for which the condition in (7) is not
satisfied, that is, [YH
n
Yα
1]
i
0. This means that
if we search for the local SVM solution at the node
n starting from α
, we must increase the value of the
ith component from [α
]
i
to reach the minimizer [α
n
]
i
of this local SVM problem, since otherwise we will
increase the objective function value. That is,
[α
n
]
i
[α
]
i
> 0.
This implies that the index i also becomes an SV of at
least one local SVM problem. Therefore, i
N
n=1
S
n
,
which implies the claim.
Theorem 3.3 enables us to restrict our attention to
the union SV set without losing any information for
the case of ADDITIVE, where the size of the union
SV set is typically much smaller than that of the en-
tire training examples index set. In effect, this leads
to more efficiency in solving the matrix completion
problem (6), by reducing the number of variables
from O(m
2
) to O(|
n
S
n
|
2
).
3.4 Algorithm
Our kernel completion method for training SVMs is
summarized in Algorithm 1. There, we have used the
symbol to represent elementwise multiplications be-
tween matrices.
We have implemented our algorithm as open-
source in C++, based on the JELLYFISH code
1
(Recht
1
Available for download at http://hazy.cs.wisc.edu/
hazy/victor/jellyfish/
and R´e, 2011) for matrix completion, and SVM-
LIGHT
2
(Joachims, 1999) for solving SVMs. Our im-
plementation makes use of the union SVs set theorem
(Theorem 3.3) for the ADDITIVE approach to reduce
kernel completion time, but not for MULTIPLICATIVE
since the theorem does not apply for this case.
4 RELATED WORK
Here we present existing methods that are closely re-
lated to our development.
4.1 Separable Approximate
Optimization of SVMs
Lee, Stolpe, and Morik (Lee et al., 2012) have inves-
tigated the primal formulation of SVMs in a setting
close to ours. In their work, the distributed nature
of input features is considered via making an individ-
ual approximate feature mapping ϕ
n
for each node n,
such that for a given local kernel function k
n
, it ap-
proximates kernel evaluations,
hϕ
n
(x
i
),ϕ
n
(x
j
)i k
n
(x
i
,x
j
), i, j .
Using this mapping, each node solves its own local
SVM in the primal, producing a decision vector w
[n].
Based upon the local solutions, a “global” SVM is ex-
plicitly constructed in a central node, which is defined
with the collection of local decision vectors and local
feature mappings (weighted by µ
n
0), that is,
w :=
w[1]
.
.
.
w[N]
, ϕ(x) :=
µ
1
ϕ
1
(x[1])
.
.
.
µ
N
ϕ
N
(x[N])
.
An interesting characteristic of this central SVM is
that if we have optimized the local SVMs using the
specific forms of loss functions
n
whose weighted
summation forms an upper bound of the original loss
function , that is,
(w
T
ϕ(x),y)
N
i=1
µ
n
n
w[n]
T
ϕ
n
(x[n]),y
,
then it can be shown that this central SVM minimizes
an upper bound of the standard SVM objective with
the original loss function. The nonnegative weights
µ
1
,µ
2
,. .. ,µ
N
are optimized in the central node, which
requires transferring O(m) numbers from each local
node n = 1,2,. . .,N.
2
Available at http://svmlight.joachims.org/
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
118
Algorithm 1: Kernel Completion for SVMs.
input : A data set {(x
i
,y
i
)}
m
i=1
, a sample set , and parameters γ, {γ
n
}
N
n=1
.
(parallel: in each node n = 1,2,.. . ,N)
input : local measurements/labels {(x
i
[n],y
i
)}
m
i=1
.
Compute local kernel matrix G
n
for MULTIPLICATIVE (or H
n
for ADDITIVE);
if ADDITIVE then
// Make the union of SV index sets (
ADDITIVE
only)
Solve the SVM (1) with Q YG
n
Y, to obtain the SV index set S
n
;
Receive S
n
for all other nodes n
;
Trim to fit
N
n=1
S
n
;
end
// Collect samples from remote kernel matrices
Initialize:
[M
n
]
(
11
T
(MULTIPLICATIVE)
0 (ADDITIVE)
for n
{1,2,...,N}\{n} do
Receive [G
n
]
(MULTIPLICATIVE), or [H
n
]
(ADDITIVE).
[M
n
]
(
[M
n
]
[G
n
]
(MULTIPLICATIVE)
[M
n
]
+ [H
n
]
(ADDITIVE)
end
For ADDITIVE, scale [M
n
]
[M
n
]
/(N 1);
// Kernel completion for
e
K
n
Solve matrix completion (6) with observed entries in [M
n
]
, to obtain L
n
and R
n
;
Compute projections, to obtain Z
n
(L
n
+ R
n
)/2 ;
Compute the estimated kernel matrix
e
K
n
by (5):
(
e
G
n
Z
n
(Z
n
)
T
(MULTIPLICATIVE)
e
H
n
(N 1)Z
n
(Z
n
)
T
(ADDITIVE)
,
e
K
n
(
G
n
e
G
n
(MULTIPLICATIVE)
1
N
(H
n
+
e
H
n
) (ADDITIVE)
// Obtain an estimated consensus SVM
Solve the SVM problem (1) replacing Q with
e
Q
n
Y
e
K
n
Y;
(end)
The kernel function of this central SVM is in-
deed a weighted approximation of our ADDITIVE ker-
nel (4), when each local feature mapping approxi-
mates a Gaussian kernel (parametrized by γ
n
) with lo-
cal features, and the weights are fixed to µ
n
= 1/
N.
However, our work is quite different from this ap-
proach in several ways. First, we do not require a spe-
cial node to build a central SVM, therefore avoiding
a communication complexity of O(mN). Moreover,
to classify a test point in the central SVM approach,
O(N) elements have to be transferred to a central node
for each test point. However in our case testing can
be done in any node, although it also requires some
communication. Second, in our method estimation
happens only in kernel completion, whereas both ker-
nels and loss functionsare approximatedin the central
SVM approach. Lastly, we can use both ADDITIVE
and MULTIPLICATIVE kernels, but only ADDITIVE
kernels are allowed in the central SVM approach.
4.2 Consensus-based Distributed SVMs
Another closely related study is done by Forero,
Cano, and Giannakis (Forero et al., 2010). The mo-
tivation of this work is very similar to ours, in the
sense that it tries to construct a consensus SVM in a
KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks
119
distributed fashion, without having a central process-
ing location. They have developed a fully distributed
SVM training algorithm based on the alternating di-
rection method of multipliers (Bertsekas and Tsitsik-
lis, 1997).
However, the consensus-based distributed SVM
considers situations where examples are distributed
over connected nodes, not features are distributed as
in our work. Moreover, the consensus requirements
are expressed as extra constraints in a distributed
SVM optimization problem therein: in our case, con-
sensus SVMs are obtained by making approximations
in each node to a “global” kernel matrix that would
have constructed if we have collected all features to a
central location.
4.3 Multiple Kernel Learning
Our ADDITIVE kernel is closely related to the mul-
tiple kernel learning (MKL) approach. In MKL, we
consider a convex combination of N kernel matrices:
k(x
i
,x
j
) =
N
n=1
β
n
k
n
(x
i
,x
j
), β
n
0,
N
n=1
β
n
= 1 .
MKL searches for the optimal mixing coefficients
β
1
,β
2
,. .. ,β
N
, as well as the optimal values of the
SVM dual variables. This requires to solve a semi-
definite program (Lanckriet et al., 2002), a quadrati-
cally constrained quadratic program (Lanckriet et al.,
2004) when we normalize kernels so that k
n
(x
i
,x
i
) =
1, or a quadratic program (Rakotomamonjy et al.,
2007) with further modifications.
In our ADDITIVE approach (4), we use fixed mix-
ing coefficients to β
n
= 1/N, in order to avoid storing
and completing individual local kernel matrices. We
could replace our SVM training with an MKL prob-
lem, and it might have a benefit to identify unimpor-
tant nodes that could be excluded from future commu-
nication, but MKL will impose overhead in computa-
tion and communication which may not be affordable.
4.4 Theory of Matrix Completion
Matrix completion provides guarantees under certain
conditions to recover the original full matrix using
only a few entries from it. Here we introduce the
idea following Cand`es and Recht (Cand`es and Recht,
2009; Cand`es and Recht, 2012).
Going back to the matrix completion problem (6),
we have defined a matrix M
m×m
with rank r, and
a sample set such that for (i, j) , the compo-
nents M
ij
are known to us. The goal is to recover
the rest of the matrix M. Let us consider the reduced
singular value decomposition of M,
M = UΣV
T
, U
T
U = I, V
T
V = I,
where Σ
r×r
is a diagonal matrix with singular
values. The columns of U
m×r
and V
m×r
compose orthonormal bases of R (M) and R (M
T
),
respectively, where R (X) denotes the range (column
space) of a matrix X. Based on these, we define a
measure called the coherence of R (M) (Cand`es and
Recht, 2009):
Definition For M = UΣV
T
, the coherence of R (M)
is defined by
co(R (M)) :=
m
r
max
i=1,2,...,m
kUU
T
e
i
k
2
[1, m/r] ,
where e
i
is the ith standard unit vector.
Here UU
T
defines the projection matrix onto R (M).
Coherence co(R (M)) measures the alignment be-
tween the range space of M and any of the standard
unit vectors. That is, the maximal coherence m/r is
achieved whenever R (M) contains any of the stan-
dard basis vector e
i
, i = 1,2,... ,m. On the other hand,
coherence decreases as the basis vectors of R (M) be-
comes more like random vectors. For example, sup-
pose that U contains uniform random column vec-
tors, i.e. the value of each entry is O(1/m) in magni-
tude satisfying UU
T
= I. Then we have kUU
T
e
i
k
2
=
kU
T
e
i
k
2
= O(r/m) for any i which gives the mini-
mum coherence value, using the fact that UU
T
= I
and U
T
e
i
r
. Repeating the same argument for V,
we see that M = UΣV
T
is likely to be a dense matrix
if both co(R (M)) and co(R (M
T
)) are small. That
is, it becomes harder that many entries of M becomes
zero, which is a necessary property for matrix com-
pletion so that recovery would be possible from only
a few sampled entries (otherwise they will contain
many zero entries which are non-informative).
The next theorem states the required conditions of
M and the estimated size of the sample set , so that
matrix completion will succeed with high probability.
Theorem 4.1 (Cand`es and Recht, 2009). For a ma-
trix M = UΣV
T
m×m
of rank r, suppose that there
exists constants δ
0
> 0 and δ
1
> 0 such that
(i) max{co(R (M)),co(R (M
T
))} δ
0
,
(ii) max
i, j
|[UV
T
]
ij
| δ
1
r/m .
If we sample || elements of M uniformly at random,
as many as
|| ψmax(δ
2
1
,δ
0.5
0
δ
1
,δ
0
m
0.25
)mr(βlogm)
for some constants ψ and β > 2, then the minimizer
of the matrix completion problem (6) is unique and
equal to the original M with probability at least 1
zm
β
for some constant z. If rank is small, that r
m
0.2
/δ
0
, then the requirement reduces to
|| ψδ
0
m
1.2
r(βlogm) .
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
120
A natural conjecture from this theorem is that
Gaussian kernels would fit well for matrix comple-
tion, as they typically producesdense and numerically
low-rank matrices (note that they are always full-rank
in theory), whose entries are bounded above by 1. We
use this theorem in the followingsection to check how
well kernel matrices constructed from various data
sets satisfy the required conditions for matrix comple-
tion, and how they affect the prediction performance
of the resulting SVMs.
5 EXPERIMENTS
For experiments, we used five benchmark data sets
from the UCI machine learning repository (Bache and
Lichman, 2013), summarized in Table 1, and also
their subset composed of 5000 training and 5000 test
examples (denoted by 5k/5k) to study characteristics
of algorithms under various circumstances.
Table 1: Data sets and their training parameters. Different
values of C were used for the full data sets (column C) and
smaller 5k/5k sets (column C(5k/5k)).
Name m (train) test p C C(5k/5k) γ
ADULT
40701 8141 124 10 10 0.001
MNIST
58100 11900 784 0.1 1162 0.01
CCAT
89702 11574 47237 100 156 1.0
IJCNN
113352 28339 22 1 2200 1.0
COVTYPE
464809 116203 54 10 10 1.0
For all experiments, we split the original input fea-
ture vectors into subvectors of almost equal lengths,
one for each node of N = 3 nodes (for 5k/5k sets)
and N = 10 (for full data sets) nodes. The tuning
parameters C and γ were determined by cross val-
idation for the full sets, and the C values for the
5k/5k subsets were determined by independent val-
idation subsets, both with SVMLIGHT. The results
of SVMLIGHT were included for a comparison to a
non-distributed SVM training. Following (Lee et al.,
2012), the local Gaussian kernel parameters for AD-
DITIVE were adjusted to γ
n
=
p
p
n
Nγ for a given γ,
so that γ
n
kx
i
[n] x
j
[n]k will have the same order of
magnitude O(γp) as γkx
i
x
j
k.
Throughput the experiments, we imposed that if
(i, j) then ( j, i) as well, and M
ij
= M
ji
.
5.1 Characteristics of Kernel Matrices
The first set of experiments is to verify that how well
kernel matrices fit for matrix completion. For this,
we computed the two types of exact kernel matri-
ces defined in (2), MULTIPLICATIVE and ADDITIVE,
accessing all features of the small 5k/5k subsets of
the five UCI data sets (the MULTIPLICATIVE kernels
were equivalent to the usual Gaussian kernels).
The important characteristics of the kernel matri-
ces with respect to matrix completion are its rank (r),
coherence (δ
0
[1,m/r]), and the maximal value of
|[UV
T
]
ij
| (where U and V are the left and right factors
from singular value decomposition), as discussed in
Theorem 4.1. When δ
0
is closer to its smallest value
of one, and |[UV
T
]
ij
| is bounded above by a small
value, then matrix completion becomes well-posed.
Further, if the rank is small as well, then the theorem
indicates that we can recover the original matrix from
even smaller samples.
Table 2 summarizes these characteristics. Clearly,
the rank (numerically effective rank, with eigenvalues
larger than a threshold of 0.01) and coherence val-
ues were much smaller in case of ADDITIVE, indicat-
ing potential benefits of using this approach compared
to MULTIPLICATIVE. All numbers of |[UV
T
]
ij
| ap-
peared to be small, especially for the ADDITIVE ker-
nels of
ADULT
,
IJCNN
, and
COVTYPE
. Kernel matrices
of these three sets also had much lower ranks than the
rest. For
MNIST
and
CCAT
, the numbers hinted that
matrix completion would suffer from difficulties, un-
less the sample set || was large.
5.2 The Effect of Sampling Size
Next, we have used the 5k/5k data sets to investigate
how the prediction performance of SVMs changed
over several difference sizes of the sample set . We
define the sampling ratio as
Sampling Ratio := ||/(m
2
) ,
where the value of m is 5000 in this section. We
compared the prediction performance of using MUL-
TIPLICATIVE and ADDITIVE to that of SVMLIGHT.
Figure 2 illustrates the test accuracy values for
ve sampling ratios in up to 10%. The statistics are
over N = 3 nodes and over random selections of .
The performance on
ADULT
,
IJCNN
, and
COVTYPE
was
close to that of SVMLIGHT, and it kept increasing with
the growth of ||. This behavior was expected in
the previous section as their kernel matrices had good
conditions for matrix completion. On the other hand,
the performance on
MNIST
and
CCAT
was far inferior
to that of SVMLIGHT, as also expected.
The bottom-right corner of Figure 2 shows the
concentration of eigenvalue spectrum in the five ker-
nel matrices. The height of each box represents the
magnitude of the corresponding normalized eigen-
value, so that the height a stack of boxes represents the
proportion of entire spectrum concentrated in the top
10 eigenvalues. The plot shows that 90% of the spec-
KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks
121
Table 2: The density, rank, coherence (δ
0
), and maximal values of |[UV
T
]
ij
| of kernel matrices. Effective numbers of ranks
are shown, which correspond to eigenvalues larger than a threshold (0.01). Coherence values are bounded by 1 δ
0
m/r.
Smaller values of δ
0
, r, and max|[UV
T
]
ij
| are indicative of better conditions for matrix completion.
MULTIPLICATIVE ADDITIVE
density r δ
0
m/r max|[UV
T
]
ij
| density r δ
0
m/r max|[UV
T
]
ij
|
ADULT
1.0 789 5.54 6.34 0.87 1.0 222 8.32 22.52 0.37
MNIST
1.0 4782 1.03 1.05 0.99 1.0 4568 1.07 1.10 0.98
CCAT
1.0 4984 1.00 1.00 1.00 1.0 4982 1.00 1.00 1.00
IJCNN
1.0 1516 3.19 3.30 0.97 1.0 698 1.75 7.16 0.25
COVTYPE
1.0 1423 3.32 3.51 0.95 1.0 424 1.56 11.79 0.13
60
65
70
75
80
85
90
95
100
0.02 0.04 0.06 0.08 0.1
Test Accuracy (%)
Sampling Ratio (||/(m
2
))
ADULT
Multiplicative
Additive
SVMlight
60
65
70
75
80
85
90
95
100
0.02 0.04 0.06 0.08 0.1
Sampling Ratio (||/(m
2
))
MNIST
60
65
70
75
80
85
90
95
100
0.02 0.04 0.06 0.08 0.1
Test Accuracy (%)
Sampling Ratio (||/(m
2
))
CCAT
60
65
70
75
80
85
90
95
100
0.02 0.04 0.06 0.08 0.1
Sampling Ratio (||/(m
2
))
IJCNN
60
65
70
75
80
85
90
95
100
0.02 0.04 0.06 0.08 0.1
Test Accuracy (%)
Sampling Ratio (||/(m
2
))
COVTYPE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ADU MNI CCA IJC COV
Ratio
Top 10 Eig (SINGLE)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ADU MNI CCA IJC COV
Top 10 Eig (COMPOSITE)
Figure 2: Prediction accuracy on test sets for 5k/5k subsets of the ve UCI data sets, over different sampling ratios in kernel
completion. The average and standard deviation over multiple trials with random and N = 3 nodes are shown. The bottom-
right plot illustrates the proportion of the entire eigen-spectrum concentrated in the top ten eigenvalues.
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
122
Table 3: Test prediction performance on full data sets (mean and standard deviation). Two sampling ratios (2% and 10%)
are tried for our method. The SVMLIGHT results are from using the classical Gaussian kernels with matching parameters.
|
n
S
n
|/m is the fraction of the union support vector sets to their corresponding training sets.
ADDITIVE
ASSET
SVMLIGHT
|
n
S
n
|/m 2% 10%
ADULT
0.61 81.4±1.00 84.2±0.18 80.0±0.02
84.9
MNIST
0.99 78.9±1.69 87.0±0.20 88.9±0.39
98.9
CCAT
0.84 87.2±1.00 92.0±0.35 73.7±1.00
95.8
IJCNN
0.56 96.0±0.35 96.5±0.23 90.9±0.88
99.3
trum in
ADULT
is concentrated in the top 10 eigenval-
ues, indicating that its kernel matrix has a very small
numerically effective rank. This would be the reason
why our method performed as good as SVMLIGHT for
ADULT
.
Comparing MULTIPLICATIVE to ADDITIVE, both
showed similar prediction performance. However,
higher concentration of the eigen spectrum of ADDI-
TIVE indicated that it would make a good alternative
to MULTIPLICATIVE, also considering the extra sav-
ing with ADDITIVE discussed in Section 3.3.
5.3 Performance on Full Data Sets
In the last experiment, we used the full data sets for
comparing our method to one of the closely related
approaches, ASSET (Lee et al., 2012), introduced in
Section 4. Since ASSET admits only ADDITIVE ker-
nels, we have omitted MULTIPLICATIVE in compar-
ison. Among the several versions of ASSET in (Lee
et al., 2012), we used the “Separate version with
central optimization.
COVTYPE
was excluded due to
extra-long runtimes of SVMLIGHT and ours.
The results are in Table 3. The second column
shows the ratio between a union SV set and an en-
tire training set. The square of these numbers in-
dicates the saving we have achieved by the union
SVs trick, for example the size of matrix is reduced
to 37% of the original size for
ADULT
. The saving
was substantial for
ADULT
and
IJCNN
. In terms of
prediction performance, we have achieved test accu-
racy approaching to that of SVMLIGHT (within 1%
point (
ADULT
), 3.8% points (
CCAT
), and 2.8% points
(
IJCNN
) on average) with 10% sampling ratio, except
for the case of
MNIST
where the gap was significantly
larger (11.9%): this result was consistent to the dis-
cussion in Sections 5.1 and 5.2. Our method (with
10% sampling) also outperformed ASSET (by 4.2%,
18.3%, and 5.6% on average for
ADULT
,
CCAT
, and
IJCNN
respectively) except for the case of
MNIST
with
a small but not negligible margin (1.9%). We con-
jecture that the approximation of kernel mapping in
ASSET have fitted particularly well for
MNIST
, but it
remains to be investigated further.
6 CONCLUSIONS
We have proposed a simple algorithm for learning
consensus SVMs in sensor stations connected with
band-limited communication channels. Our method
makes use of decompositions of kernels, together
with kernel completion to approximate unobserved
entries of remote kernel matrices. The resulting
SVMs performed well with relatively small numbers
of sampled entries, when kernel matrices satisfied re-
quired conditions. A property of support vectors also
helped us further reduce computational cost.
Using matrix completion, there is no need to iden-
tify and execute an optimal sampling strategy to have
similar performance guarantees. Although sample
complexity could be reduced by a small factor by
identifying specific sample sets for a given situa-
tion, such sets will depend on network topology and
cost/noise models, perhaps with the need for central
coordination.
Several aspects of our method remain to be inves-
tigated further. First, different types of kernels may
involve different types of decomposition, having dis-
similar characteristics in terms of matrix completion.
Second, although parameters of SVMs can be tuned
using small aggregated data, it would be desirable to
tune parameters locally, or to consider parameter-free
methods instead of SVMs. Also, despite the bene-
fits of the ADDITIVE kernel, it requires more kernel
parameters to be specified compared to the MULTI-
PLICATIVE kernel. Therefore when the budget for pa-
rameter tuning is limited, MULTIPLICATIVE would be
preferred to ADDITIVE. Finally, it would be worth-
while to analyze the characteristics of the suggested
algorithm in real communication systems to make it
more practical, considering non-uniform communica-
tion cost, for instance.
Considering kernel completion in the context of
privacy preserving learning would be an interesting
branch, if the number of entries required for kernel
completion to build a good classifier is less than the
number required to recover private information, or if
we can make kernel completion to fail unless it has
right credentials by possibly tweaking the coherence
KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks
123
of kernel matrices.
ACKNOWLEDGEMENTS
The authors acknowledge the support of Deutsche
Forschungsgemeinschaft (DFG) within the Collabo-
rative Research Center SFB 876 “Providing Informa-
tion by Resource-Constrained Analysis”, projects A1
and C1.
REFERENCES
Bache, K. and Lichman, M. (2013). UCI machine learning
repository.
Bertsekas, D. P. and Tsitsiklis, J. N. (1997). Parallel
and Distributed Computation: Numerical Methods.
Athena Scientific.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A
training algorithm for optimal margin classifiers. In
Proceedings of the fifth Annual Workshop on Compu-
tational Learning Theory, pages 144–152.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.
(2011). Distributed optimization and statistical learn-
ing via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–
122.
Cand`es, E. J. and Recht, B. (2009). Exact matrix comple-
tion via convex optimization. Foundations of Compu-
tational Mathematics, 9(6):717–772.
Cand`es, E. J. and Recht, B. (2012). Exact matrix comple-
tion via convex optimization. Communications of the
ACM, 55(6):111–119.
Crammer, K., Dredze, M., and Pereira, F. (2012).
Confidence-weighted linear classification for natural
language processing. Journal of Machine Learning
Research, 13:1891–1926.
Forero, P. A., Cano, A., and Giannakis, G. B. (2010).
Consensus-based distributed support vector machines.
Journal of Machine Learning Research, 11:1663–
1707.
Ji, Y. and Sun, S. (2013). Multitask multiclass support vec-
tor machines: Model and experiments. Pattern Recog-
nition, 46(3):914–924.
Joachims, T. (1999). Making large-scale support vector ma-
chine learning practical. In Sch¨olkopf, B., Burges, C.,
and Smola, A., editors, Advances in Kernel Methods -
Support Vector Learning, chapter 11, pages 169–184.
MIT Press, Cambridge, MA.
Lanckriet, G., Cristianini, N., Bartlett, P., E. G., and L.,
Jordan, M. (2002). Learning the kernel matrix with
semidefinite programming. In Proceedings of the 19th
International Conference on Machine Learning.
Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jor-
dan, M. I., and Noble, W. S. (2004). A statistical
framework for genomic data fusion. Bioinformatics,
20(16):2626–2635.
Lee, S. and Bockermann, C. (2011). Scalable stochastic
gradient descent with improved confidence. In Big
Learning – Algorithms, Systems, and Tools for Learn-
ing at Scale, NIPS Workshop.
Lee, S., Stolpe, M., and Morik, K. (2012). Separable ap-
proximate optimization of support vector machines
for distributed sensing. In Flach, P., Bie, T., and
Cristianini, N., editors, Machine Learning and Knowl-
edge Discovery in Databases, volume 7524 of Lecture
Notes in Computer Science, pages 387–402. Springer.
Lippi, M., Bertini, M., and Frasconi, P. (2010). Collective
traffic forecasting. In Proceedings of the 2010 Euro-
pean conference on Machine learning and knowledge
discovery in databases: Part II, pages 259–273.
Morik, K., Bhaduri, K., and Kargupta, H. (2012). Intro-
duction to data mining for sustainability. Data Mining
and Knowledge Discovery, 24(2):311–324.
Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet., Y.
(2007). More efficiency in multiple kernel learning. In
Proceedings of the 24th International Conference on
Machine Learning.
Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaran-
teed minimum-rank solutions of linear matrix equa-
tions via nuclear norm minimization. SIAM Review,
52(3):471–501.
Recht, B. and R´e, C. (2011). Parallel stochastic gradient al-
gorithms for large-scale matrix completion. Technical
report, University of Wisconsin-Madison.
Schoenberg, I. J. (1938). Metric spaces and positive definite
functions. Transactions of the American Mathemati-
cal Society, 44(3):522–536.
Scholkopf, B. and Smola, A. J. (2001). Learning with Ker-
nels: Support Vector Machines, Regularization, Opti-
mization, and Beyond. MIT Press, Cambridge, MA,
USA.
Shawe-Taylor, J. and Sun, S. (2011). A review of optimiza-
tion methodologies in support vector machines. Neu-
rocomputing, 74(17):3609–3618.
Stolpe, M., Bhaduri, K., Das, K., and Morik, K. (2013).
Anomaly detection in vertically partitioned data by
distributed core vector machines. In Machine Learn-
ing and Knowledge Discovery in Databases - Euro-
pean Conference, ECML PKDD 2013.
Whittaker, J., Garside, S., and Lindveld, K. (1997). Track-
ing and predicting a network traffic process. Interna-
tional Journal of Forecasting, 13(1):51–61.
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
124