Kernel Completion for Learning Consensus Support Vector Machines in

Bandwidth-limited Sensor Networks

Sangkyun Lee and Christian P¨olitz

Fakult¨at f¨ur Informatik, LS VIII, Technische Universit¨at Dortmund, 44221 Dortmund, Germany

Keywords:

Sensor Networks, Support Vector Machines, Gaussian Kernels, Sampling, Matrix Completion.

Abstract:

Recent developments in sensor technology allows for capturing dynamic patterns in vehicle movements, tem-

perature changes, and sea-level ﬂuctuations, just to name a few. A usual way for decision making on sensor

networks, such as detecting exceptional surface level changes across the Paciﬁc ocean, involves collecting

measurement data from all sensors to build a predictor in a central processing station. However, data col-

lection becomes challenging when communication bandwidth is limited, due to communication distance or

low-energy requirements. Also, such settings will introduce unfavorable latency for making predictions on

unseen events. In this paper, we propose an alternative strategy for such scenarios, aiming to build a consensus

support vector machine (SVM) in each sensor station by exchanging a small amount of sampled information

from local kernel matrices amongst peers. Our method is based on decomposing a “global” kernel deﬁned

with all features into “local” kernels deﬁned only with attributes stored in each sensor station, sampling few

entries of the decomposed kernel matrices that belong to other stations, and ﬁlling in unsampled entries in

kernel matrices by matrix completion. Experiments on benchmark data sets illustrate that a consensus SVM

can be built in each station using limited communication, which is competent in prediction performance to an

SVM built with accessing all features.

1 INTRODUCTION

Sensors can monitor many different kinds of dynam-

ics in nature, generating numerous data, and thereby

embodying research challenges in machine learning

and data mining (Whittaker et al., 1997; Lippi et al.,

2010; Morik et al., 2012). There is a wide spectrum of

sensing devices available today, but they share a com-

mon property: communication is costly and should

be avoided whenever possible, due to restrictions in

bandwidth or in energy consumption. This is a clear

barrier for global decision making, for which it is typ-

ically required to agglomerate all local sensor mea-

surements into a central location for processing.

On the other hand, many sensors are stationed

within devices equipped with surprisingly powerful

and energy-efﬁcient computation units. This has mo-

tivated us to use computation to save communication.

Speciﬁcally, we aim to build a support vector machine

(SVM) (Boser et al., 1992) in each of such devices,

called sensor stations, using local measurements and

a small amount of sampled information transmitted

from other stations. The goal is to obtain a consensus

SVM in each station that behaves similarly to a global

SVM that could be constructed if we collect informa-

tion from all stations for central processing.

Our work is closely related to learning SVMs in

distributed environments, which can be split into two

categories. Case I: examples are distributed (fea-

tures are not distributed). In such cases a global

SVM can be trained using a distributed optimiza-

tion algorithm (Boyd et al., 2011), or separate SVMs

can be trained locally for data partitions with extra

constraints to produce similar models (Forero et al.,

2010). Alternatively, local SVMs can be trained in

their primal form independently on data partitions

and then combined to produce a model with a re-

duced variance (Lee and Bockermann, 2011; Cram-

mer et al., 2012). Case II: features are distributed (ex-

amples are not distributed). In such cases a central

coordination of local SVM training has been consid-

ered to improve global prediction performance (Lee

et al., 2012; Stolpe et al., 2013). Our work focuses on

the second case where features are distributed, con-

sidering communication-efﬁcientapproximations to a

global kernel matrix (which could be built by access-

ing all features) in each station, without any central

coordination.

113

Lee S. and Pölitz C..

Kernel Completion for Learning Consensus Support Vector Machines in Bandwidth-limited Sensor Networks.

DOI: 10.5220/0004829401130124

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 113-124

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Our suggested method is based on decompositions

of a (global) kernel into separate parts, where each of

them is another kernel deﬁned with attributes stored

locally in each sensor station. Each decomposed ker-

nel matrix is stored in a sensor station where corre-

sponding attributes are stored. Each station receives

few sampled entries of the decomposed kernel matri-

ces stored in remote stations, and then applies matrix

completion to approximate the values of unobserved

entries. Using these altogether, a consensus SVM is

created in each station, which can be applied for pre-

dicting future events using local and remote informa-

tion in a similar fashion.

We denote the Euclidean norm by k·k and the car-

dinality of a ﬁnite set A by |A| throughout the paper.

2 SUPPORT VECTOR MACHINES

WITH DECOMPOSED

KERNELS

Let us consider sensor stations represented as nodes

n = 1,2,.. .,N in a network, where each node stores

measurements from its own sensors, in a feature vec-

tor x

[n] ∈ ℜ

, of sensing targets i = 1,2,... ,m. For

simplicity we assume that communication between

any pair of nodes is allowed. A collection of all these

vectors x

= (x

[1]

[2]

,. .. ,x

[N]

)

can be seen

as an input vector of length p =

∑

n=1

2.1 Support Vector Machines

The dual formulation of SVMs is described as fol-

lows (Shawe-Taylor and Sun, 2011),

min

α∈ℜ

Qα−1

α ,

subject to y

α = 0,

0 ≤ α ≤C1 .

(1)

Here 1 := (1,1,... , 1)

and y := (y

,. .. ,y

)

are

column vectors of length m, and C is a given con-

stant. (Without loss of generality, we focus on the

case of classiﬁcation – our method can be general-

ized for other types.) The matrix Q ∈ ℜ

m×m

is a

scaled kernel matrix, that is, Q := YKY for a positive

semideﬁnite kernel matrix K, where Y := diag(y) is

the diagonal matrix whose elements are given by the

vector y. SVMs have been successful in many appli-

cations, including multitask multiclass learning prob-

lems (Ji and Sun, 2013) for example.

2.2 Decomposition of Kernels

We consider two different decompositions of the

kernel matrix K, especially those obtainable from

the popular Gaussian kernel. We refer to them

as “MULTIPLICATIVE” and “ADDITIVE”, deﬁned as

follows for i, j = 1, 2, . ..,m,

(MULTIPLICATIVE)

[K]

∏

n=1

exp



−γkx

[n] −x

[n]k



, and

(ADDITIVE)

[K]

∑

n=1

exp



−γ

[n] −x

[n]k



(2)

The MULTIPLICATIVE kernel is indeed the same as

the standard Gaussian kernel (Scholkopf and Smola,

2001), but our description above reveals that it can

be constructed by multiplying “local” Gaussian ker-

nels deﬁned with attributes stored locally in sensor

stations. The construction of ADDITIVE is similar,

except that local Gaussian kernels are averaged, not

multiplied. ADDITIVE resembles how kernels are

used in the multiple kernel learning (Lanckriet et al.,

2002): the connection is further discussed in Sec-

tion 4.3. Note that MULTIPLICATIVE has a single pa-

rameter γ > 0, but ADDITIVE has a separate parameter

> 0 for each local kernel.

2.3 Local and Remote Parts in

Decomposition

From the deﬁnitions in (2), we identify the parts that

can be computed with attributes stored locally in each

node (local parts), and that need to be transferred from

other nodes in a sensor network (remote parts).

First, the expression of MULTIPLICATIVE can be

rewritten in the following way,

[K]

= exp

∑

n=1

−γkx

[n] −x

[n]k

= exp



−γkx

[n] −x

[n]k



∏

′

6=n

exp



−γkx

′

] −x

′



= [G

]

−n

]

, (3)

where the “local” Gaussian kernel G

for a node n and

the product G

−n

of all “remote” kernels are deﬁned

entrywise respectively by

]

:= exp



−γkx

[n] −x

[n]k



(local)

−n

]

∏

′

6=n

′

]

(remote) .

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

114

Similarly, ADDITIVE can be written as

[K]

]

∑

′

6=n

′

]

+ H

−n

]

, (4)

where H

is the local part and H

−n

is the remote part

for node n, deﬁned respectively by

]

:= exp



−γ

[n] −x

[n]k



(local)

−n

]

∑

′

6=n

′

]

(remote) .

For a node n, the computation of the local part G

(or

) is done exactly using local attributes, where the

remote part G

−n

(or H

−n

) is to be approximated.

3 KERNEL COMPLETION

Let us denote the kernel matrix to be estimated in the

nth node by

, which is computed by

[

]

(

]

[

−n

]

(MULTIPLICATIVE)

−n

]

(ADDITIVE) .

(5)

Here

−n

(or

−n

) is an estimate of the remote part

−n

(or H

−n

). Once we have

, it can be plugged in

(1) replacing Q in the form of

:= Y

In order to obtain the estimate

−n

(or

−n

), we

make use of matrix completion (Cand`es and Recht,

2009), which is a method to reconstruct a matrix from

only a few sampled entries from it. The purpose of

using matrix completion is (i) to reduce the number

of entries required to be sampled from remote kernel

parts in bandwidth-limited situations. Matrix comple-

tion will not be required if all nodes provide complete

information. And it is (ii) to avoid the complexity of

deﬁning an optimal sampling strategy. That is, a sim-

ple uniform random sampling strategy is enough for

matrix completion to guarantee the perfect recovery

of the original kernel matrix with high probability.

We ﬁrst discuss extra constraints we need to add

to matrix completion, so that the resulting matrix is to

be a valid kernel matrix.

3.1 Constraints on

−n

and

−n

First,

−n

has to be a symmetric matrix where diago-

nal entries are all ones. It becomes a valid kernel ma-

trix if and only if it is positive semideﬁnite (Scholkopf

and Smola, 2001), that is, z

−n

z ≥0 for all z ∈ ℜ

Since each entry of a local Gaussian kernel G

′

in the range (0,1] by deﬁnition, the product of such

entries in

−n

should be in the same range as well.

Next,

−n

shares the same properties as

−n

, except

for that each diagonal element of

−n

is (N −1), not

one, by construction.

There is another possible way to decompose the

MULTIPLICATIVE kernel,

[K]

= exp



−γkx

[n] −x

[n]k

−γ

∑

′

6=n

′

] −x

′



= exp([D

+ D

−n

]

) ,

with

]

:= −γkx

[n]−x

[n]k

, [D

−n

]

∑

′

6=n

′

]

Then our task becomes making an estimate

−n

of a

distance matrix D

−n

, which has zero diagonal entries.

The estimate deﬁnes a valid distance matrix if and

only if it is conditionally positive semideﬁnite, that is,

−n

z ≥0 for all z∈ℜ

with z

1 = 0 (Schoenberg,

1938). This implies that

−n

is positive semideﬁnite,

or it has a single negative eigenvalue. It turned out

that our kernel completion in forms of (3) performed

better in our experiments, so we did not pursue this

direction further.

3.2 Low-rank Matrix Completion

For the description of matrix completion, we follow

the line of discussion in (Recht and R´e, 2011). Matrix

completion reconstructs a full matrix from only a few

entries sampled from the original matrix. In general,

matrix completion works with matrices in any shape,

but we focus on square matrices here.

Suppose that X ∈ ℜ

m×m

is a matrix we wish to

recover, and that the entries at (i, j) ∈ Ω of X are

revealed and stored in another matrix M. Matrix

completion solves the following convex optimization

problem to recover X,

min

∑

(i, j)∈Ω

−M

)

+ λkXk

∗

, kXk

∗

∑

k=1

(X) .

Here kXk

∗

is the nuclear norm of X, which is the

summation of singular values σ

(X) of X and penal-

izes the rank of X.

The nuclear norm simpliﬁes when we assume that

the matrix X has the rank r, and consider a factoriza-

tion of X into LR

for some L ∈ℜ

m×r

and R ∈ℜ

m×r

This leads to X

= [LR

]

= L

i·

j·

, and

kXk

∗

= min

X=LR

kLk

kRk

KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks

115

* *

. . .

{[M

]

: (i, j) ∈ Ω}

Completion

−n

Figure 1: A schematic of kernel completion, with MULTIPLICATIVE kernel. Each node n (i) collects and summarizes the

samples corresponding to Ω from remote kernel matrices G

′

as the known entries of the matrix M

, and then (ii) ﬁlls up the

unknown entries of M

via matrix completion, producing

−n

, (iii) forming an estimate of kernel matrices together with the

exact local kernel G

where k·k

is the Frobenius norm. The equivalence

can be understood by taking a singular value decom-

position X = UΣV

and setting L = UΣ

1/2

and R =

VΣ

1/2

. Then kXk

∗

= tr(Σ), kLk

= tr(L

L) = tr(Σ),

and kRk

= tr(Σ), so the equality holds. For details,

we refer to (Recht et al., 2010; Recht and R´e, 2011).

Using the property of the nuclear norm on rank-r

matrices, we can reformulate the matrix completion

optimization as

min

L,R

∑

(i, j)∈Ω

i·

j·

−M

)

kLk

kRk

. (6)

To obtain solutions, we use the JELLYFISH algo-

rithm (Recht and R´e, 2011), which is a highly par-

allel incremental gradient descent procedure to ﬁnd

the minimizers, making use of the fact that the gra-

dient of the above objective depends on only L

i·

and

j·

, and therefore the computation of each iteration

can be easily distributed for the pairs (i, j) ∈ Ω.

3.2.1 Constrained Matrix Completion

To incorporate the constraints discussed in Sec-

tion 3.1, we need to ﬁnd a matrix

∗

that is close

to L

∗

)

where L

∗

and R

∗

are the solutions of (6)

(both are m by r, r ∈ (0,m]), belonging to a convex

set K of symmetric positive semideﬁnite rank-r ma-

trices,

K := {

X :

X  0, (

X, rank(

X) = r} .

The following lemma shows that the description of

this set can be simpliﬁed.

Lemma 3.1. The elements

X in K must have the form

X = ZZ

, where Z ∈ℜ

m×r

Proof. Suppose that

X is in K . Since

X is symmet-

ric and positive semideﬁnite, from the eigen decom-

position of

X there exists a factor U ∈ ℜ

m×m

such

that

X = UΣU

where Σ ≥ 0 is the diagonal matrix

of eigenvalues. Removing the columns of U and the

part of Σ corresponding to the zero eigenvalues, we

obtain

U ∈ ℜ

m×r

and

Σ ∈ ℜ

r×r

. Then Z =

1/2

can

be constructed so that

X = ZZ

Conversely, any

X in the form of

X = ZZ

sat-

isﬁes

X (symmetric) and z

Xz = kz

≥ 0

for all z ∈ ℜ

(positive semideﬁnite). Therefore

X =

is an element of K .

This lemma indicates that the set K can be rewritten

as simple as,

K = {ZZ

: Z ∈ ℜ

m×r

} .

The next step is to ﬁnd a matrix Z such that ZZ

close to L

∗

)

. An ℓ

projection of L

∗

)

onto

K requires an iterative procedure which is as costly

as ﬁnding L

∗

and R

∗

. Therefore we consider an al-

ternative projection for which we have a closed-form

solution,

∗

= argmin

Z∈ℜ

m×r

kZ−L

∗

kZ−R

∗

From the KKT conditions, the solution is obtained by

∗

+ R

∗

Then a projection

∗

is obtained by

∗

= Z

∗

)

which has a guarantee on its quality as stated in the

next lemma:

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

116

Lemma 3.2. The trace-norm distance between

∗

)

, where Z

∗

= (L

∗

+ R

∗

)/2, and L

∗

)

bounded, that is,

tr(

∗

−L

∗

)

) ≤

∗

−R

∗

Proof. Using

∗

= Z

∗

)

, the result can be derived

as follows,

tr(

∗

−L

∗

)

tr{(L

∗

+ R

∗

)(L

∗

+ R

∗

)

−4L

∗

)

}

tr{(L

∗

−R

∗

)(L

∗

−R

∗

)

}

∗

−R

∗

Here we have used the properties of the trace that

tr(X+ Y) = tr(X) + tr(Y) = tr(X) + tr(Y

) = tr(X+

) and tr(XX

) = kXk

The above lemma tells that the distance between

∗

and L

∗

) becomes small whenever L

∗

≈R

∗

, which

is likely to happen in our case since we deﬁne M and

Ω in such a way that if (i, j) ∈ Ω then ( j, i) ∈ Ω, and

= M

3.2.2 Sample Index Pair Subset Ω

In our method, we assume that a single sample index

pair set Ω ⊂ {(i, j) : 1 ≤ i, j ≤ m} is ﬁxed across all

nodes. It is more efﬁcient than using multiple sample

sets, since otherwise we have to store and complete

each remote matrix G

′

, n

′

∈ {1,.. .,N}\{n}, sepa-

rately. Using a pre-deﬁned Ω across nodes can be im-

plemented as using a ﬁxed random seed for a pseudo

random number generator, so that Ω does not have to

be transferred at all.

Given Ω, each node n receives information from

other nodes n

′

and stores it in M

as follows for all

(i, j) ∈ Ω,

]











∏

′

∈{1,...,N}\{n}

′

]

(MULTIPLICATIVE)

∑

′

∈{1,...,N}\{n}

′

]

N−1

(ADDITIVE) .

That is, the communication cost for each node n is

O((N −1)|Ω|). The use of matrix completion makes

it possible to choose an Ω of relatively small size

(O(m

1.2

rlogm) when M

is a rank-r matrix, see The-

orem 4.1 for details) in a simple way, that is, via ran-

dom uniform sampling.

Once the matrix M

is obtained, the node n solves

the matrix completion (6) with [M

]

Ω

to obtain Z

∗

+ R

∗

)/2, and then compute

−n

= Z

∗

)

−n

= (N −1)Z

∗

)

, based on Lemmas 3.1 and

3.2. An estimate of the kernel K, obtained by (5), is

then used for training an SVM.

After training SVMs, we apply the same tech-

nique for new test examples to build the test kernel

matrix. This usually involves smaller matrix com-

pletion problems corresponding to the support vectors

and test examples.

3.3 Extra Saving with ADDITIVE

The description of the matrix completion optimiza-

tion (6) involves all training examples. However, if a

(super-)set of the support vectors (SVs), which fully

determines a prediction function, is known a priori,

then we can solve the completion problem only for

the set, reducing the cost of matrix completion.

Let us consider the SVs of the “global” SVM

problem (1) equipped with the exact ADDITIVE ker-

nel (2), which is constructed by accessing all features

in a central location. We denote this set of SVs as S

∗

Note that S

∗

is never obtained, since we do not solve

such a global problem.

We try to estimate S

∗

from the sets of “local”

SVs. These local SVs are obtained from solving an

individual SVM (1) in each node n, using only the

local features, that is, setting the scaled kernel ma-

trix as Q = YH

Y for the local kernel matrix H

exp(−γ

[n] −x

[n]k

). We denote the set of SVs

in the node n by S

obtained in this way.

In the next theorem, we show that the union of the

local SV sets S

encompasses the global SV set S

∗

To shorten the length of our proof, here we show the

case for the SVMs without any intercept, that is, the

constraint y

α = 0 is removed (the same result holds

for the case with intercepts).

Theorem3.3. Consider the global SVM problem with

the ADDITIVE kernel and its set of SVs S

∗

:= argmin

0≤α≤C1

∑

n=1

Yα−1

α ,

∗

:= {i : [α

∗

]

> 0} ,

and the corresponding local SVM problem and its SVs

for each node n, n = 1,2,... ,N,

∗

:= argmin

0≤α≤C1

Y(H

)Yα −1

α ,

∗

:= {i : [α

∗

]

> 0} .

Then we have

∗

⊆ ∪

n=1

∗

Proof. Let us consider an index i ∈ S

∗

of an SV of

the global SVM problem, such that [α

∗

]

> 0. Sup-

pose that the ith component of the gradient of all local

KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks

117

SVM problems at α

∗

is strictly positive, that is,

[YH

Yα

∗

−1]

> 0, ∀n ∈{1, 2, ..., N} . (7)

Let us look into the optimality condition of the global

SVM, regarding the ith component of the optimizer

∗

. From the KKT conditions, we have

∑

n=1

[YH

Yα

∗

−1]

−[p

∗

]

+ [q

∗

]

= 0,

∗

]

[α

∗

]

= 0, [q

∗

]

[C1−α

∗

]

= 0,

where p

∗

∈ ℜ

and q

∗

∈ ℜ

are the Lagrange mul-

tipliers for the constraints α ≥ 0 and α ≤C1, respec-

tively. Then [α

∗

]

> 0 implies [p

∗

]

= 0, and therefore

∑

n=1

[YH

Yα

∗

−1]

+ [q

∗

]

= 0.

If (7) is true, then we have a contradiction here since

the ﬁrst term above becomes strictly positive, where

the second term satisﬁes [q

∗

]

≥ 0, and therefore the

equality cannot hold. This implies that there exists at

least one node n for which the condition in (7) is not

satisﬁed, that is, [YH

Yα

∗

−1]

≤ 0. This means that

if we search for the local SVM solution at the node

n starting from α

∗

, we must increase the value of the

ith component from [α

∗

]

to reach the minimizer [α

∗

]

of this local SVM problem, since otherwise we will

increase the objective function value. That is,

[α

∗

]

≥ [α

∗

]

> 0.

This implies that the index i also becomes an SV of at

least one local SVM problem. Therefore, i ∈∪

n=1

∗

which implies the claim.

Theorem 3.3 enables us to restrict our attention to

the union SV set without losing any information for

the case of ADDITIVE, where the size of the union

SV set is typically much smaller than that of the en-

tire training examples index set. In effect, this leads

to more efﬁciency in solving the matrix completion

problem (6), by reducing the number of variables

from O(m

) to O(|∪

∗

3.4 Algorithm

Our kernel completion method for training SVMs is

summarized in Algorithm 1. There, we have used the

symbol ◦to represent elementwise multiplications be-

tween matrices.

We have implemented our algorithm as open-

source in C++, based on the JELLYFISH code

(Recht

Available for download at http://hazy.cs.wisc.edu/

hazy/victor/jellyﬁsh/

and R´e, 2011) for matrix completion, and SVM-

LIGHT

(Joachims, 1999) for solving SVMs. Our im-

plementation makes use of the union SVs set theorem

(Theorem 3.3) for the ADDITIVE approach to reduce

kernel completion time, but not for MULTIPLICATIVE

since the theorem does not apply for this case.

4 RELATED WORK

Here we present existing methods that are closely re-

lated to our development.

4.1 Separable Approximate

Optimization of SVMs

Lee, Stolpe, and Morik (Lee et al., 2012) have inves-

tigated the primal formulation of SVMs in a setting

close to ours. In their work, the distributed nature

of input features is considered via making an individ-

ual approximate feature mapping ϕ

for each node n,

such that for a given local kernel function k

, it ap-

proximates kernel evaluations,

hϕ

),ϕ

)i ≈ k

), ∀i, j .

Using this mapping, each node solves its own local

SVM in the primal, producing a decision vector w

∗

[n].

Based upon the local solutions, a “global” SVM is ex-

plicitly constructed in a central node, which is deﬁned

with the collection of local decision vectors and local

feature mappings (weighted by µ

≥ 0), that is,

w :=







w[1]

w[N]







, ϕ(x) :=







(x[1])

(x[N])







An interesting characteristic of this central SVM is

that if we have optimized the local SVMs using the

speciﬁc forms of loss functions ℓ

whose weighted

summation forms an upper bound of the original loss

function ℓ, that is,

ℓ(w

ϕ(x),y) ≤

∑

i=1

ℓ



w[n]

(x[n]),y



then it can be shown that this central SVM minimizes

an upper bound of the standard SVM objective with

the original loss function. The nonnegative weights

,µ

,. .. ,µ

are optimized in the central node, which

requires transferring O(m) numbers from each local

node n = 1,2,. . .,N.

Available at http://svmlight.joachims.org/

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

118

Algorithm 1: Kernel Completion for SVMs.

input : A data set {(x

)}

i=1

, a sample set Ω, and parameters γ, {γ

}

n=1

(parallel: in each node n = 1,2,.. . ,N)

input : local measurements/labels {(x

[n],y

)}

i=1

Compute local kernel matrix G

for MULTIPLICATIVE (or H

for ADDITIVE);

if ADDITIVE then

// Make the union of SV index sets (

ADDITIVE

only)

Solve the SVM (1) with Q ← YG

Y, to obtain the SV index set S

∗

;

Receive S

∗

′

for all other nodes n

′

;

Trim Ω to ﬁt ∪

n=1

∗

;

end

// Collect samples from remote kernel matrices

Initialize:

]

Ω

←

(

(MULTIPLICATIVE)

0 (ADDITIVE)

for n

′

∈ {1,2,...,N}\{n} do

Receive [G

′

]

Ω

(MULTIPLICATIVE), or [H

′

]

Ω

(ADDITIVE).

]

Ω

←

(

]

Ω

◦[G

′

]

Ω

(MULTIPLICATIVE)

]

Ω

+ [H

′

]

Ω

(ADDITIVE)

end

For ADDITIVE, scale [M

]

Ω

← [M

]

Ω

/(N −1);

// Kernel completion for

Solve matrix completion (6) with observed entries in [M

]

Ω

, to obtain L

∗

and R

∗

;

Compute projections, to obtain Z

∗

← (L

∗

+ R

∗

)/2 ;

Compute the estimated kernel matrix

by (5):

(

−n

← Z

∗

)

(MULTIPLICATIVE)

−n

← (N −1)Z

∗

)

(ADDITIVE)

←

(

◦

−n

(MULTIPLICATIVE)

−n

) (ADDITIVE)

// Obtain an estimated consensus SVM

Solve the SVM problem (1) replacing Q with

← Y

(end)

The kernel function of this central SVM is in-

deed a weighted approximation of our ADDITIVE ker-

nel (4), when each local feature mapping approxi-

mates a Gaussian kernel (parametrized by γ

) with lo-

cal features, and the weights are ﬁxed to µ

= 1/

√

However, our work is quite different from this ap-

proach in several ways. First, we do not require a spe-

cial node to build a central SVM, therefore avoiding

a communication complexity of O(mN). Moreover,

to classify a test point in the central SVM approach,

O(N) elements have to be transferred to a central node

for each test point. However in our case testing can

be done in any node, although it also requires some

communication. Second, in our method estimation

happens only in kernel completion, whereas both ker-

nels and loss functionsare approximatedin the central

SVM approach. Lastly, we can use both ADDITIVE

and MULTIPLICATIVE kernels, but only ADDITIVE

kernels are allowed in the central SVM approach.

4.2 Consensus-based Distributed SVMs

Another closely related study is done by Forero,

Cano, and Giannakis (Forero et al., 2010). The mo-

tivation of this work is very similar to ours, in the

sense that it tries to construct a consensus SVM in a

KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks

119

distributed fashion, without having a central process-

ing location. They have developed a fully distributed

SVM training algorithm based on the alternating di-

rection method of multipliers (Bertsekas and Tsitsik-

lis, 1997).

However, the consensus-based distributed SVM

considers situations where examples are distributed

over connected nodes, not features are distributed as

in our work. Moreover, the consensus requirements

are expressed as extra constraints in a distributed

SVM optimization problem therein: in our case, con-

sensus SVMs are obtained by making approximations

in each node to a “global” kernel matrix that would

have constructed if we have collected all features to a

central location.

4.3 Multiple Kernel Learning

Our ADDITIVE kernel is closely related to the mul-

tiple kernel learning (MKL) approach. In MKL, we

consider a convex combination of N kernel matrices:

k(x

) =

∑

n=1

), β

≥ 0,

∑

n=1

= 1 .

MKL searches for the optimal mixing coefﬁcients

,β

,. .. ,β

, as well as the optimal values of the

SVM dual variables. This requires to solve a semi-

deﬁnite program (Lanckriet et al., 2002), a quadrati-

cally constrained quadratic program (Lanckriet et al.,

2004) when we normalize kernels so that k

) =

1, or a quadratic program (Rakotomamonjy et al.,

2007) with further modiﬁcations.

In our ADDITIVE approach (4), we use ﬁxed mix-

ing coefﬁcients to β

= 1/N, in order to avoid storing

and completing individual local kernel matrices. We

could replace our SVM training with an MKL prob-

lem, and it might have a beneﬁt to identify unimpor-

tant nodes that could be excluded from future commu-

nication, but MKL will impose overhead in computa-

tion and communication which may not be affordable.

4.4 Theory of Matrix Completion

Matrix completion provides guarantees under certain

conditions to recover the original full matrix using

only a few entries from it. Here we introduce the

idea following Cand`es and Recht (Cand`es and Recht,

2009; Cand`es and Recht, 2012).

Going back to the matrix completion problem (6),

we have deﬁned a matrix M ∈ℜ

m×m

with rank r, and

a sample set Ω such that for (i, j) ∈ Ω, the compo-

nents M

are known to us. The goal is to recover

the rest of the matrix M. Let us consider the reduced

singular value decomposition of M,

M = UΣV

, U

U = I, V

V = I,

where Σ ∈ ℜ

r×r

is a diagonal matrix with singular

values. The columns of U ∈ ℜ

m×r

and V ∈ ℜ

m×r

compose orthonormal bases of R (M) and R (M

respectively, where R (X) denotes the range (column

space) of a matrix X. Based on these, we deﬁne a

measure called the coherence of R (M) (Cand`es and

Recht, 2009):

Deﬁnition For M = UΣV

, the coherence of R (M)

is deﬁned by

co(R (M)) :=

max

i=1,2,...,m

kUU

∈[1, m/r] ,

where e

is the ith standard unit vector.

Here UU

deﬁnes the projection matrix onto R (M).

Coherence co(R (M)) measures the alignment be-

tween the range space of M and any of the standard

unit vectors. That is, the maximal coherence m/r is

achieved whenever R (M) contains any of the stan-

dard basis vector e

, i = 1,2,... ,m. On the other hand,

coherence decreases as the basis vectors of R (M) be-

comes more like random vectors. For example, sup-

pose that U contains uniform random column vec-

tors, i.e. the value of each entry is O(1/m) in magni-

tude satisfying UU

= I. Then we have kUU

= O(r/m) for any i which gives the mini-

mum coherence value, using the fact that UU

= I

and U

∈ ℜ

. Repeating the same argument for V,

we see that M = UΣV

is likely to be a dense matrix

if both co(R (M)) and co(R (M

)) are small. That

is, it becomes harder that many entries of M becomes

zero, which is a necessary property for matrix com-

pletion so that recovery would be possible from only

a few sampled entries (otherwise they will contain

many zero entries which are non-informative).

The next theorem states the required conditions of

M and the estimated size of the sample set Ω, so that

matrix completion will succeed with high probability.

Theorem 4.1 (Cand`es and Recht, 2009). For a ma-

trix M = UΣV

∈ℜ

m×m

of rank r, suppose that there

exists constants δ

> 0 and δ

> 0 such that

(i) max{co(R (M)),co(R (M

))} ≤ δ

(ii) max

i, j

|[UV

]

| ≤ δ

√

r/m .

If we sample |Ω| elements of M uniformly at random,

as many as

|Ω| ≥ ψmax(δ

,δ

0.5

,δ

0.25

)mr(βlogm)

for some constants ψ and β > 2, then the minimizer

of the matrix completion problem (6) is unique and

equal to the original M with probability at least 1−

−β

for some constant z. If rank is small, that r ≤

0.2

/δ

, then the requirement reduces to

|Ω| ≥ ψδ

1.2

r(βlogm) .

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

120

A natural conjecture from this theorem is that

Gaussian kernels would ﬁt well for matrix comple-

tion, as they typically producesdense and numerically

low-rank matrices (note that they are always full-rank

in theory), whose entries are bounded above by 1. We

use this theorem in the followingsection to check how

well kernel matrices constructed from various data

sets satisfy the required conditions for matrix comple-

tion, and how they affect the prediction performance

of the resulting SVMs.

5 EXPERIMENTS

For experiments, we used ﬁve benchmark data sets

from the UCI machine learning repository (Bache and

Lichman, 2013), summarized in Table 1, and also

their subset composed of 5000 training and 5000 test

examples (denoted by 5k/5k) to study characteristics

of algorithms under various circumstances.

Table 1: Data sets and their training parameters. Different

values of C were used for the full data sets (column C) and

smaller 5k/5k sets (column C(5k/5k)).

Name m (train) test p C C(5k/5k) γ

ADULT

40701 8141 124 10 10 0.001

MNIST

58100 11900 784 0.1 1162 0.01

CCAT

89702 11574 47237 100 156 1.0

IJCNN

113352 28339 22 1 2200 1.0

COVTYPE

464809 116203 54 10 10 1.0

For all experiments, we split the original input fea-

ture vectors into subvectors of almost equal lengths,

one for each node of N = 3 nodes (for 5k/5k sets)

and N = 10 (for full data sets) nodes. The tuning

parameters C and γ were determined by cross val-

idation for the full sets, and the C values for the

5k/5k subsets were determined by independent val-

idation subsets, both with SVMLIGHT. The results

of SVMLIGHT were included for a comparison to a

non-distributed SVM training. Following (Lee et al.,

2012), the local Gaussian kernel parameters for AD-

DITIVE were adjusted to γ

≈ Nγ for a given γ,

so that γ

[n] −x

[n]k will have the same order of

magnitude O(γp) as γkx

−x

Throughput the experiments, we imposed that if

(i, j) ∈ Ω then ( j, i) ∈ Ω as well, and M

= M

5.1 Characteristics of Kernel Matrices

The ﬁrst set of experiments is to verify that how well

kernel matrices ﬁt for matrix completion. For this,

we computed the two types of exact kernel matri-

ces deﬁned in (2), MULTIPLICATIVE and ADDITIVE,

accessing all features of the small 5k/5k subsets of

the ﬁve UCI data sets (the MULTIPLICATIVE kernels

were equivalent to the usual Gaussian kernels).

The important characteristics of the kernel matri-

ces with respect to matrix completion are its rank (r),

coherence (δ

∈ [1,m/r]), and the maximal value of

|[UV

]

| (where U and V are the left and right factors

from singular value decomposition), as discussed in

Theorem 4.1. When δ

is closer to its smallest value

of one, and |[UV

]

| is bounded above by a small

value, then matrix completion becomes well-posed.

Further, if the rank is small as well, then the theorem

indicates that we can recover the original matrix from

even smaller samples.

Table 2 summarizes these characteristics. Clearly,

the rank (numerically effective rank, with eigenvalues

larger than a threshold of 0.01) and coherence val-

ues were much smaller in case of ADDITIVE, indicat-

ing potential beneﬁts of using this approach compared

to MULTIPLICATIVE. All numbers of |[UV

]

| ap-

peared to be small, especially for the ADDITIVE ker-

nels of

ADULT

IJCNN

, and

COVTYPE

. Kernel matrices

of these three sets also had much lower ranks than the

rest. For

MNIST

and

CCAT

, the numbers hinted that

matrix completion would suffer from difﬁculties, un-

less the sample set |Ω| was large.

5.2 The Effect of Sampling Size

Next, we have used the 5k/5k data sets to investigate

how the prediction performance of SVMs changed

over several difference sizes of the sample set Ω. We

deﬁne the sampling ratio as

Sampling Ratio := |Ω|/(m

) ,

where the value of m is 5000 in this section. We

compared the prediction performance of using MUL-

TIPLICATIVE and ADDITIVE to that of SVMLIGHT.

Figure 2 illustrates the test accuracy values for

ﬁve sampling ratios in up to 10%. The statistics are

over N = 3 nodes and over random selections of Ω.

The performance on

ADULT

IJCNN

, and

COVTYPE

was

close to that of SVMLIGHT, and it kept increasing with

the growth of |Ω|. This behavior was expected in

the previous section as their kernel matrices had good

conditions for matrix completion. On the other hand,

the performance on

MNIST

and

CCAT

was far inferior

to that of SVMLIGHT, as also expected.

The bottom-right corner of Figure 2 shows the

concentration of eigenvalue spectrum in the ﬁve ker-

nel matrices. The height of each box represents the

magnitude of the corresponding normalized eigen-

value, so that the height a stack of boxes represents the

proportion of entire spectrum concentrated in the top

10 eigenvalues. The plot shows that 90% of the spec-

KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks

121

Table 2: The density, rank, coherence (δ

), and maximal values of |[UV

]

| of kernel matrices. Effective numbers of ranks

are shown, which correspond to eigenvalues larger than a threshold (0.01). Coherence values are bounded by 1 ≤ δ

≤ m/r.

Smaller values of δ

, r, and max|[UV

]

| are indicative of better conditions for matrix completion.

MULTIPLICATIVE ADDITIVE

density r δ

m/r max|[UV

]

| density r δ

m/r max|[UV

]

ADULT

1.0 789 5.54 6.34 0.87 1.0 222 8.32 22.52 0.37

MNIST

1.0 4782 1.03 1.05 0.99 1.0 4568 1.07 1.10 0.98

CCAT

1.0 4984 1.00 1.00 1.00 1.0 4982 1.00 1.00 1.00

IJCNN

1.0 1516 3.19 3.30 0.97 1.0 698 1.75 7.16 0.25

COVTYPE

1.0 1423 3.32 3.51 0.95 1.0 424 1.56 11.79 0.13

100

0.02 0.04 0.06 0.08 0.1

Test Accuracy (%)

Sampling Ratio (|Ω|/(m

))

ADULT

Multiplicative

Additive

SVMlight

100

0.02 0.04 0.06 0.08 0.1

Sampling Ratio (|Ω|/(m

))

MNIST

100

0.02 0.04 0.06 0.08 0.1

Test Accuracy (%)

Sampling Ratio (|Ω|/(m

))

CCAT

100

0.02 0.04 0.06 0.08 0.1

Sampling Ratio (|Ω|/(m

))

IJCNN

100

0.02 0.04 0.06 0.08 0.1

Test Accuracy (%)

Sampling Ratio (|Ω|/(m

))

COVTYPE

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ADU MNI CCA IJC COV

Ratio

Top 10 Eig (SINGLE)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ADU MNI CCA IJC COV

Top 10 Eig (COMPOSITE)

Figure 2: Prediction accuracy on test sets for 5k/5k subsets of the ﬁve UCI data sets, over different sampling ratios in kernel

completion. The average and standard deviation over multiple trials with random Ω and N = 3 nodes are shown. The bottom-

right plot illustrates the proportion of the entire eigen-spectrum concentrated in the top ten eigenvalues.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

122

Table 3: Test prediction performance on full data sets (mean and standard deviation). Two sampling ratios (2% and 10%)

are tried for our method. The SVMLIGHT results are from using the classical Gaussian kernels with matching parameters.

|∪

∗

|/m is the fraction of the union support vector sets to their corresponding training sets.

ADDITIVE

ASSET

SVMLIGHT

|∪

∗

|/m 2% 10%

ADULT

0.61 81.4±1.00 84.2±0.18 80.0±0.02

84.9

MNIST

0.99 78.9±1.69 87.0±0.20 88.9±0.39

98.9

CCAT

0.84 87.2±1.00 92.0±0.35 73.7±1.00

95.8

IJCNN

0.56 96.0±0.35 96.5±0.23 90.9±0.88

99.3

trum in

ADULT

is concentrated in the top 10 eigenval-

ues, indicating that its kernel matrix has a very small

numerically effective rank. This would be the reason

why our method performed as good as SVMLIGHT for

ADULT

Comparing MULTIPLICATIVE to ADDITIVE, both

showed similar prediction performance. However,

higher concentration of the eigen spectrum of ADDI-

TIVE indicated that it would make a good alternative

to MULTIPLICATIVE, also considering the extra sav-

ing with ADDITIVE discussed in Section 3.3.

5.3 Performance on Full Data Sets

In the last experiment, we used the full data sets for

comparing our method to one of the closely related

approaches, ASSET (Lee et al., 2012), introduced in

Section 4. Since ASSET admits only ADDITIVE ker-

nels, we have omitted MULTIPLICATIVE in compar-

ison. Among the several versions of ASSET in (Lee

et al., 2012), we used the “Separate” version with

central optimization.

COVTYPE

was excluded due to

extra-long runtimes of SVMLIGHT and ours.

The results are in Table 3. The second column

shows the ratio between a union SV set and an en-

tire training set. The square of these numbers in-

dicates the saving we have achieved by the union

SVs trick, for example the size of matrix is reduced

to 37% of the original size for

ADULT

. The saving

was substantial for

ADULT

and

IJCNN

. In terms of

prediction performance, we have achieved test accu-

racy approaching to that of SVMLIGHT (within 1%

point (

ADULT

), 3.8% points (

CCAT

), and 2.8% points

(

IJCNN

) on average) with 10% sampling ratio, except

for the case of

MNIST

where the gap was signiﬁcantly

larger (11.9%): this result was consistent to the dis-

cussion in Sections 5.1 and 5.2. Our method (with

10% sampling) also outperformed ASSET (by 4.2%,

18.3%, and 5.6% on average for

ADULT

CCAT

, and

IJCNN

respectively) except for the case of

MNIST

with

a small but not negligible margin (1.9%). We con-

jecture that the approximation of kernel mapping in

ASSET have ﬁtted particularly well for

MNIST

, but it

remains to be investigated further.

6 CONCLUSIONS

We have proposed a simple algorithm for learning

consensus SVMs in sensor stations connected with

band-limited communication channels. Our method

makes use of decompositions of kernels, together

with kernel completion to approximate unobserved

entries of remote kernel matrices. The resulting

SVMs performed well with relatively small numbers

of sampled entries, when kernel matrices satisﬁed re-

quired conditions. A property of support vectors also

helped us further reduce computational cost.

Using matrix completion, there is no need to iden-

tify and execute an optimal sampling strategy to have

similar performance guarantees. Although sample

complexity could be reduced by a small factor by

identifying speciﬁc sample sets Ω for a given situa-

tion, such sets will depend on network topology and

cost/noise models, perhaps with the need for central

coordination.

Several aspects of our method remain to be inves-

tigated further. First, different types of kernels may

involve different types of decomposition, having dis-

similar characteristics in terms of matrix completion.

Second, although parameters of SVMs can be tuned

using small aggregated data, it would be desirable to

tune parameters locally, or to consider parameter-free

methods instead of SVMs. Also, despite the bene-

ﬁts of the ADDITIVE kernel, it requires more kernel

parameters to be speciﬁed compared to the MULTI-

PLICATIVE kernel. Therefore when the budget for pa-

rameter tuning is limited, MULTIPLICATIVE would be

preferred to ADDITIVE. Finally, it would be worth-

while to analyze the characteristics of the suggested

algorithm in real communication systems to make it

more practical, considering non-uniform communica-

tion cost, for instance.

Considering kernel completion in the context of

privacy preserving learning would be an interesting

branch, if the number of entries required for kernel

completion to build a good classiﬁer is less than the

number required to recover private information, or if

we can make kernel completion to fail unless it has

right credentials by possibly tweaking the coherence

KernelCompletionforLearningConsensusSupportVectorMachinesinBandwidth-limitedSensorNetworks

123

of kernel matrices.

ACKNOWLEDGEMENTS

The authors acknowledge the support of Deutsche

Forschungsgemeinschaft (DFG) within the Collabo-

rative Research Center SFB 876 “Providing Informa-

tion by Resource-Constrained Analysis”, projects A1

and C1.

REFERENCES

Bache, K. and Lichman, M. (2013). UCI machine learning

repository.

Bertsekas, D. P. and Tsitsiklis, J. N. (1997). Parallel

and Distributed Computation: Numerical Methods.

Athena Scientiﬁc.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A

training algorithm for optimal margin classiﬁers. In

Proceedings of the ﬁfth Annual Workshop on Compu-

tational Learning Theory, pages 144–152.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.

(2011). Distributed optimization and statistical learn-

ing via the alternating direction method of multipliers.

Foundations and Trends in Machine Learning, 3(1):1–

122.

Cand`es, E. J. and Recht, B. (2009). Exact matrix comple-

tion via convex optimization. Foundations of Compu-

tational Mathematics, 9(6):717–772.

Cand`es, E. J. and Recht, B. (2012). Exact matrix comple-

tion via convex optimization. Communications of the

ACM, 55(6):111–119.

Crammer, K., Dredze, M., and Pereira, F. (2012).

Conﬁdence-weighted linear classiﬁcation for natural

language processing. Journal of Machine Learning

Research, 13:1891–1926.

Forero, P. A., Cano, A., and Giannakis, G. B. (2010).

Consensus-based distributed support vector machines.

Journal of Machine Learning Research, 11:1663–

1707.

Ji, Y. and Sun, S. (2013). Multitask multiclass support vec-

tor machines: Model and experiments. Pattern Recog-

nition, 46(3):914–924.

Joachims, T. (1999). Making large-scale support vector ma-

chine learning practical. In Sch¨olkopf, B., Burges, C.,

and Smola, A., editors, Advances in Kernel Methods -

Support Vector Learning, chapter 11, pages 169–184.

MIT Press, Cambridge, MA.

Lanckriet, G., Cristianini, N., Bartlett, P., E. G., and L.,

Jordan, M. (2002). Learning the kernel matrix with

semideﬁnite programming. In Proceedings of the 19th

International Conference on Machine Learning.

Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jor-

dan, M. I., and Noble, W. S. (2004). A statistical

framework for genomic data fusion. Bioinformatics,

20(16):2626–2635.

Lee, S. and Bockermann, C. (2011). Scalable stochastic

gradient descent with improved conﬁdence. In Big

Learning – Algorithms, Systems, and Tools for Learn-

ing at Scale, NIPS Workshop.

Lee, S., Stolpe, M., and Morik, K. (2012). Separable ap-

proximate optimization of support vector machines

for distributed sensing. In Flach, P., Bie, T., and

Cristianini, N., editors, Machine Learning and Knowl-

edge Discovery in Databases, volume 7524 of Lecture

Notes in Computer Science, pages 387–402. Springer.

Lippi, M., Bertini, M., and Frasconi, P. (2010). Collective

trafﬁc forecasting. In Proceedings of the 2010 Euro-

pean conference on Machine learning and knowledge

discovery in databases: Part II, pages 259–273.

Morik, K., Bhaduri, K., and Kargupta, H. (2012). Intro-

duction to data mining for sustainability. Data Mining

and Knowledge Discovery, 24(2):311–324.

Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet., Y.

(2007). More efﬁciency in multiple kernel learning. In

Proceedings of the 24th International Conference on

Machine Learning.

Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaran-

teed minimum-rank solutions of linear matrix equa-

tions via nuclear norm minimization. SIAM Review,

52(3):471–501.

Recht, B. and R´e, C. (2011). Parallel stochastic gradient al-

gorithms for large-scale matrix completion. Technical

report, University of Wisconsin-Madison.

Schoenberg, I. J. (1938). Metric spaces and positive deﬁnite

functions. Transactions of the American Mathemati-

cal Society, 44(3):522–536.

Scholkopf, B. and Smola, A. J. (2001). Learning with Ker-

nels: Support Vector Machines, Regularization, Opti-

mization, and Beyond. MIT Press, Cambridge, MA,

USA.

Shawe-Taylor, J. and Sun, S. (2011). A review of optimiza-

tion methodologies in support vector machines. Neu-

rocomputing, 74(17):3609–3618.

Stolpe, M., Bhaduri, K., Das, K., and Morik, K. (2013).

Anomaly detection in vertically partitioned data by

distributed core vector machines. In Machine Learn-

ing and Knowledge Discovery in Databases - Euro-

pean Conference, ECML PKDD 2013.

Whittaker, J., Garside, S., and Lindveld, K. (1997). Track-

ing and predicting a network trafﬁc process. Interna-

tional Journal of Forecasting, 13(1):51–61.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

124