A Recursive Approach For Multiclass Support Vector Machine

Application to Automatic Classiﬁcation of Endomicroscopic Videos

Alexis Zubiolo

, Gr

egoire Malandain

, Barbara Andr

and

Eric Debreuve

Team Morpheme (Lab I3S/Inria SA-M/iBV), University of Nice-Sophia Antipolis/CNRS/Inria, Sophia Antipolis, France

Mauna Kea Technologies, 9 rue d’Enghien, 75010 Paris, France

Keywords:

Multiclass classiﬁcation, Supervised Learning, Hierarchical Approach, Graph Minimum-cut, Support Vector

Machine (SVM).

Abstract:

The two classical steps of image or video classiﬁcation are: image signature extraction and assignment of a

class based on this image signature. The class assignment rule can be learned from a training set composed

of sample images manually classiﬁed by experts. This is known as supervised statistical learning. The well-

known Support Vector Machine (SVM) learning method was designed for two classes. Among the proposed

extensions to multiclass (three classes or more), the one-versus-one and one-versus-all approaches are the most

popular ones. This work presents an alternative approach to extending the original SVM method to multiclass.

A tree of SVMs is built using a recursive learning strategy, achieving a linear worst-case complexity in terms

of number of classes for classiﬁcation. During learning, at each node of the tree, a bi-partition of the current

set of classes is determined to optimally separate the current classiﬁcation problem into two sub-problems.

Rather than relying on an exhaustive search among all possible subsets of classes, the partition is obtained by

building a graph representing the current problem and looking for a minimum cut of it. The proposed method

is applied to classiﬁcation of endomicroscopic videos and compared to classical multiclass approaches.

1 INTRODUCTION

The problem of automatic image (or video, or object)

classiﬁcation is to ﬁnd a function that maps an im-

age to a class or category among a number of pre-

deﬁned classes. An image can be viewed as a vec-

tor of high-dimension. In practice, it is preferable

to deal with a synthetic signature of lower dimen-

sion. Therefore, the two classical steps of image clas-

siﬁcation are: image signature extraction (Oliva and

Torralba, 2001) and signature-based image classiﬁca-

tion (Cortes and Vapnik, 1995). (Note that the signa-

ture extraction step can itself be the result of a learn-

ing process (Sivic and Zisserman, 2003).) The clas-

siﬁcation rule can be learned from a set of training

sample images manually classiﬁed by experts. This

is known as supervised statistical learning where sta-

tistical refers to the use of samples and supervised

refers to the sample classes being provided. In this

paper, we are interested in the learning aspect of the

multiclass

problem when using a binary classiﬁca-

tion approach as a building block. The Support Vec-

Traditionally in classiﬁcation, multiclass means “three

classes or more” while the two-class case is referred to as

binary classiﬁcation.

tor Machine (SVM) (Cortes and Vapnik, 1995) is a

well-known binary classiﬁer that will be used in the

following. In its original, linear version, it makes it

possible to separate two sets of d-dimensional sam-

ples (the training sample signatures for two classes)

by ﬁnding a particular hyperplane, i.e., by determin-

ing its two parameters b ∈ R and w ∈ R

. It does

so by maximizing the half-margin 1/

which rep-

resents the distance to the nearest training sample of

any class. Whenever the two classes are not linearly

separable, the soft margin strategy can be used to ac-

count for misclassiﬁcations and the kernel trick can be

applied to extend the SVM method to nonlinear sep-

aration (Scholkopf and Smola, 2001). The parameter

tuning the soft margin tolerance is usually denoted by

C (see Fig. 3).

Among the proposed extensions of binary clas-

siﬁcation methods (such as the SVM) to multiclass

(three classes or more), the one-versus-one and one-

versus-all approaches are the most popular ones. Let

us suppose that there are p ≥ 3 classes. The idea

of the one-versus-all strategy is to oppose to any of

the classes the union of the remaining p − 1 classes.

Then, p SVM classiﬁers are determined, each one

441

Zubiolo A., Malandain G., André B. and Debreuve É..

A Recursive Approach For Multiclass Support Vector Machine - Application to Automatic Classiﬁcation of Endomicroscopic Videos.

DOI: 10.5220/0004654704410447

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 441-447

ISBN: 978-989-758-003-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

scoring, say, positively for one of the classes. To clas-

sify a new image, its signature is tested against all the

SVMs and it is assigned to the class with the high-

est score (largest distance to the SVM hyperplane).

The one-versus-one strategy opposes the classes by

pair for all possible pairs. Therefore,

p(p−1)

SVMs

are determined. For classiﬁcation, a new image sig-

nature is tested against all the SVMs, each SVM

votes in favor of one of the two classes it corresponds

to, and the image is assigned to the highest voted

class. Other methods also learn all the pairwise SVMs

(as for one-versus-one) but use a different scheme

to predict the class during the classiﬁcation step. It

is the case for the Decision Directed Acyclic Graph

(DDAG (Platt et al., 2000)) and the Adaptive Directed

Acyclic Graph (ADAG (Kijsirikul et al., 2002)) meth-

ods.

As an alternative to these aforementioned strate-

gies, hierarchical methods can be designed. For ex-

ample, the work done in (Tibshirani and Hastie, 2006)

applies clustering techniques to the different classes

and considers the widths of the one-versus-one SVM

margins to deﬁne linkage criteria. This paper presents

a recursive learning strategy to extend a binary classi-

ﬁcation method to multiclass. A tree of SVMs is built

using a recursive learning strategy in such a way that

a linear worst-case complexity is achieved for clas-

siﬁcation. During learning, at each node of the tree,

a bi-partition of the set of classes is found to deter-

mine an optimal separation of the current classiﬁca-

tion problem into two sub-problems. This decision re-

lies on building a graph representing the current prob-

lem and looking for a minimum cut of it. The pro-

posed method is applied to classiﬁcation of endomi-

croscopic videos and compared to classical multiclass

approaches.

2 WHY A RECURSIVE

STRATEGY?

2.1 Motivations

When learning is performed ofﬂine (as described

in the present context), it is interesting to design a

method with a low classiﬁcation complexity, even if

we have to pay the price of a high learning complex-

ity for it. The classiﬁcation complexities (in terms of

number of classes) of the one-versus-one and the one-

versus-all strategies are quadratic and linear, respec-

tively. When thinking about a complexity lower than

linear, the logarithmic one comes to mind. Recur-

sive (or, equivalently, hierarchical) approaches natu-

rally lead to such performances. Hence, we propose

to decompose the original multiclass problem with p

classes into two sub-problems (of “similar size”, ide-

ally), i.e., involving q

and q

classes, respectively,

with q

+ q

= p. Let us denote by virtual class the

union of classes involved in a sub-problem. Decid-

ing which virtual class a given signature belongs to is

a classical binary classiﬁcation. Then, as long as the

sub-problems involve three classes or more, they can

be further decomposed into smaller sub-problems.

The question is thus to optimally decompose a given

p-class problem, p ≥ 3, into two sub-problems (see

Section 3.1).

Another motivation for such a recursive approach

is the fair balance between the sub-problems. In-

deed, as already mentioned, the two virtual classes

resulting from the decomposition of a p-class prob-

lem should each gather the same (or almost the same)

number of classes, ideally. If all the classes have

roughly the same number of training signatures, so

will have the virtual classes. It is certainly desirable

for the determination of a reliable binary classiﬁcation

rule, as opposed to the case where one virtual class

contains much less samples than the other one. This

fair balance property also holds for the one-versus-

one strategy (unfortunately, as already mentioned, it

has a quadratic classiﬁcation complexity). However,

it does not for the one-versus-all strategy which relies

on virtual classes gathering either one class or p − 1

classes.

Finally, with the proposed recursive approach, the

successive binary classiﬁcations into virtual classes

progressively narrow the classiﬁcation decision down

to the assignment of a unique label among the prede-

ﬁned classes. The one-versus-one and one-versus-all

approaches do not exhibit such a coherence since sev-

eral predeﬁned classes can receive votes when test-

ing a signature against the different SVMs. The ﬁnal

classiﬁcation decision must deal with competing par-

tial decisions. Although the practical solutions

make

sense, the principle is not fully satisfying. With the

one-versus-one strategy, for a signature belonging to,

say, class i, it can be further noted that all the SMVs

learned to distinguish between class j and class k,

j 6= i and k 6= i, will be used to decide whether the sig-

nature belongs to class j or class k, and these uninfor-

mative partial decisions will be accounted for in the

ﬁnal decision. This is know as the non-competence

problem.

Maximum number of votes for one-versus-one or max-

imum positive score for one-versus-all.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

442

2.2 Optimizing the Classiﬁcation

Complexity Alone

Let us suppose that there are p ≥ 3 predeﬁned classes.

Let us recall that a q-subset is a set containing q ele-

ments taken from a larger set containing p elements.

Let us remind that the number of q-subsets is given

by the binomial coefﬁcient





As mentioned in Section 2.1, the general idea

is to decompose a p-class problem into a q

-class

sub-problem and a q

-class sub-problem where q

= p, and continuing recursively with the two sub-

problems. Each decomposition relies on a binary

classiﬁer separating q

classes for the q

other ones.

Thus, a tree of binary classiﬁer is built. For classiﬁ-

cation, this tree has to be traversed from the root to a

leaf, following a branch depending on the responses

of the node classiﬁers. To optimize the classiﬁca-

tion complexity in terms of the number of classiﬁers

that are tested, the tree must be of minimal depth or,

equivalently, as close as possible to a perfect binary

tree. This is achieved by enforcing the following con-

straints:



+ q

= p (partition constraint)

− q

| ≤ 1 (balance constraint)

. (1)

The classiﬁcation complexity is then equal to the tree

depth, i.e. log

(p). However, the limiting factor of

such an approach is the combinatorial learning com-

plexity in terms of the number of binary classiﬁers

that must be determined at each node of the tree. Sec-

tion 3 proposes another strategy involving graph the-

ory in order to overcome this issue.

3 PROPOSED METHOD: GRAPH

CUT BASED SVM TREE

(GC-SVM)

3.1 Trade-off between Learning and

Classiﬁcation Complexities

The following description is valid for any binary clas-

siﬁer framework. However, this paper focuses on the

SVM example. First, similarly to the one-versus-

one approach, we compute the SVMs between each

pair of classes among the p ≥ 3 predeﬁned classes.

Now, following the recursive strategy described in

Sections 2.2 and 2.1, let us assume that we are about

to deal with a node containing more than three classes

(originally, the root node contains all p classes).

The idea is to use graph theory tools to determine

a bi-partition of this set of classes. This bi-partition

will not necessarily be balanced (i.e. the second con-

dition of Eq. 1 will no longer be taken into account).

A labeled, weighted, undirected, complete graph G is

built such that:

• Each node represents a class (i.e., node i ≡ class

i);

• The weight c

i j

of the edge linking nodes i and

j is equal to the inverse of the margin of the

SVM computed between classes i and j, i.e., c

i j



i j



The minimum cut of G will result in a bi-partition

of the nodes such that the sum of the weights of

the edges cut is minimal. It is computed using

Stoer-Wagner’s min-cut algorithm (Stoer and Wag-

ner, 1997). Thanks to the chosen graph deﬁnition,

this will also correspond to separating the classes of

pairs that have a large SVM margin. Hence, the two

virtual classes, union of the classes on each side of the

partition, will tend to have a large margin too. We just

have to compute the corresponding SVM and assign

it to the current node. Therefore, for the whole tree

building, p(p − 1)/2 SVMs are computed ﬁrst, then

additional SVMs are computed for each non-leaf tree

node. There are p − 1 such nodes. Actually, for the

nodes that are parents of two leaves, the SVM needs

not be computed since it corresponds to one of the

SVMs ﬁrst computed for each pair of classes. There-

fore, there are at most p − 1 additional SVMs to be

computed. As a result, O(p

) SVMs are computed in

total. The proposed learning method is described in

algorithm 1. It calls the procedure MINCUT which is

entirely deﬁned in (Stoer and Wagner, 1997).

Once the tree T of SVMs and virtual classes has

been built, the classiﬁcation of a new image signature

is simply performed by following a unique branch

based on the successive decisions of the nodes’

SVMs. The branch leaf contains the label of the iden-

tiﬁed predeﬁned class. Figure 1 presents an instance

of SVM tree for ﬁve classes labeled 1 to 5.

{1, 2, 4}, {3, 5}, SVM

{1, 4}, {2}, SVM

{1}, {4}, SVM

1 4

{3}, {5}, SVM

3 5

Figure 1: Type of tree that the proposed method builds dur-

ing the learning stage. An example of classiﬁcation of a

new image signature is also illustrated by showing the vis-

ited nodes in boldface (read from root to leaf).

ARecursiveApproachForMulticlassSupportVectorMachine-ApplicationtoAutomaticClassificationof

EndomicroscopicVideos

443

Algorithm 1: The GC-SVM algorithm.

1: procedure GCSVM(L, p)

2: T ← empty tree

3: if p == 1 then

4: Create a node with label l

and add it to T

5: else if p == 2 then

6: Create two nodes with labels l

and l

and

add them to T

7: else

8: a ← an arbitrary node of G

9: (L

) = MINCUT(G,a)

10: p

← the number of classes in L

11: p

← the number of classes in L

12: T

= GCSVM(L

, p

)

13: T

= GCSVM(L

, p

)

14: let T

be the left child of T

15: let T

be the right child of T

16: end if

17: return T

18: end procedure

3.2 Complexity

Let us start with the learning step.

Proposition 1. The number of SVM computed during

the learning step is O(p

Proof. The number of SVM computed is equal to the

sum of:

• The number of binary SVM we compute to build

the graph O(p

• The number of nodes in the graph built, i.e. O(p).

As a result, the total number of SVM computed is

O(p

As for the classiﬁcation step, the number of SVM

used depends on the way the binary tree is built: it

is equal to the depth of the tree. This is the reason

why we distinguish the worst-case complexity and the

best-case complexity.

Proposition 2 (Worst-case complexity for classiﬁca-

tion step). The worst-case complexity for the classiﬁ-

cation step is O(p).

Proof. The worst-case scenario happens when the

tree built is a degenerate tree, i.e. when for each par-

ent node, there is only one associated child node. In

this case, the depth of the tree is O(p). So is the com-

plexity.

Proposition 3 (Best-case complexity for classiﬁca-

tion step). The best-case complexity for the classiﬁ-

cation step is O(log(p)).

Proof. The best-case happens when the tree is bal-

anced, i.e. when for each parent node, there are two

associated child nodes whenever it is possible. In such

a case, the depth is logarithmic, and as a result the

complexity is O(log(p)).

Let us compare the methods above-mentionned

with the most commonly used algorithms. Table 1

presents the complexities in terms of number of

classes for both learning (ofﬂine task performed only

once) and classiﬁcation (performed on-demand), for

one-versus-one, one-versus-all, the Directed Acyclic

Graph SVM method (DAGSVM) (Platt et al., 2000),

and the proposed methods.

The proposed method offers a lower classiﬁcation

complexity and still a reasonable learning complex-

ity. Moreover, since testing a decomposition means

learning a binary classiﬁer, and since the computer

time needed for a learning depends on the number

of training samples, a straightforward way to reduce

the learning time is to subsample cleverly the training

set (Bakir et al., 2005).

4 EXPERIMENTAL RESULTS

The proposed method has been implemented in Mat-

lab using the SVM-KM (Support Vector Machine and

Kernel Methods) toolbox (Canu et al., 2005) which

implements the binary SVM classiﬁer as well as the

one-versus-one and one-versus-all strategies. It has

been applied to automatic classiﬁcation of endomi-

croscopic videos.

The medical database to which we apply the

classiﬁcation methods contains 116 endomicroscopic

videos, each of them showing a colonic polyp in vivo

at the cellular level. These videos were acquired at the

Mayo Clinic in Jacksonville during endoscopy proce-

dures on 65 patients, using a technology called probe-

based Confocal Laser Endomicroscopy (pCLE) de-

veloped by Mauna Kea Technologies. Each video is

assigned to a pathological class which is the histo-

logical diagnosis established by an expert pathologist

from a biopsy on the imaged polyp. The 5 patho-

logical classes are: purely benign (14 videos), hy-

perplastic (21 videos), tubular adenoma (62 videos),

tubulovillous adenoma (15 videos) and adenocarci-

noma (4 videos) (See Fig. 2). In (Andr

e et al., 2011),

a bag-of-visual-words method was proposed to build

the visual signatures of these videos, based on mosaic

images associated with stable video subsequences. To

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

444

Purely benign Hyperplastic Tubular adenoma Tubulovillous adenoma

Adenocarcinoma

Figure 2: Examples of annotated endomicroscopic videos.

0 2 4 6 8 10 12 14 16 18 20

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

log

accuracy

One vs. One

One vs. All

DAGSVM

GC−SVM

(a) Gaussian radial basis function kernel : k(x

) = exp





− x





0 2 4 6 8 10 12 14 16 18 20

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

log

accuracy

One vs. One

One vs. All

DAGSVM

GC−SVM

(b) Polynomial kernel : k(x

) =





+ 1



Figure 3: Accuracies (vertical axis) as functions of the parameter log

the proposed methods. Let us remind that C is the soft margin parameter (see Section 1).

ARecursiveApproachForMulticlassSupportVectorMachine-ApplicationtoAutomaticClassificationof

EndomicroscopicVideos

445

Table 1: Number of SVMs computed for p classes.

Method Learning Classiﬁcation

One-versus-one O(p

) O(p

)

One-versus-all O(p) O(p)

DAGSVM O(p

) O(p)

GC-SVM O(p

) Between O(log

(p)) and O(p)

Table 2: Best average accuracies for the method (Andr

e et al., 2011) and the multiclass SVM methods with their own optimal

value of C (see Fig. 3).

Method Best accuracy

(Andr

e et al., 2011) with adapted signatures 62.9%

One-versus-one 63.5%

One-versus-all 65.0%

DAGSVM 62.7%

GC-SVM (proposed) 67.1%

make the videos signatures easily exploitable with-

out learning bias, we adapted this signature extraction

method by considering as visual words 100 descrip-

tion vectors randomly selected in 5 endomicroscopic

videos of colonic polyps that do not belong to the

database. Because the size of the database is relatively

small, the classiﬁcation methods were applied to the

adapted video signatures using leave-one-patient-out

cross validation (Andr

e et al., 2011). For multiclass

classiﬁcation method comparison, we added, as ref-

erence method for endomicroscopic video classiﬁca-

tion, the k-Nearest Neighbors (k-NN) classiﬁcation

method of (Andr

e et al., 2011) that uses a weighted

majority vote based on the χ

similarity distance be-

tween the adapted video signatures. The video classi-

ﬁcation results are given in Fig. 3 and Table 2.

In this experiment, the proposed method performs

the best despite having a lower classiﬁcation com-

plexity. On Fig 3, it can be noted that the range of

good values for the parameter C is roughly the same

for all four SVM-based methods. However, above

this range, the accuracy signiﬁcantly drops for all the

methods.

5 CONCLUSION AND

PERSPECTIVES

The results of Section 4 are encouraging since the

classiﬁcation of the GC-SVM algorithm slightly out-

performs the standard methods on our dataset, while

having a lower classiﬁcation complexity (see Table 1).

However, as shown in Figure 3, the accuracy of the

classiﬁcation depends on various parameters:

• The choice of the kernel,

• The parameters deﬁning the kernel (degree of the

polynomial, variance of the Gaussian radial basis

function, . . . ),

• The soft margin parameter, C.

This issue affects all the multiclass SVM methods

stated in this paper. In most cases, the choice of the

kernel and its parameters is left to the user or com-

puted through a cross validation. The kernel and its

parameters could either be computed automatically or

learned (Cortes et al., 2008) to avoid a cross valida-

tion.

Another perspective which could be taken into ac-

count is to change the capacity c of the min-cut al-

gorithm (Stoer and Wagner, 1997). As a matter of

fact, we could try to have the binary tree built by the

GCSVM algorithm as balanced as possible in order to

lower its depth and consequently the number of SVM

used during the classiﬁcation step (see Section 3.2).

This can be done by deﬁning an energy term e equal

to the sum of the capacity c and an “imbalance term”

as suggested in (Dell’Amico and Trubian, 1998). If

The cut of the graph is (A,

A), then a possible deﬁni-

tion of this energy term could be

e(A,

A) = c(A,

A) + λ



|A| − |



where λ is the “imbalance factor”, which depends on

how hard we want the tree to be balanced.

Finally, the method should be tested on bigger

data sets, particularly composed of more classes, in

order to better evaluate the performance gain com-

pared to methods with linear and quadratic complexi-

ties.

REFERENCES

Andr

e, B., Vercauteren, T., Buchner, A. M., Wallace, M. B.,

and Ayache, N. (2011). A Smart Atlas for Endomi-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

446

croscopy using Automated Video Retrieval. Medical

Image Analysis, 15(4):460–476.

Bakir, G. H., Planck, M., Bottou, L., and Weston, J. (2005).

Breaking svm complexity with cross training. In In

Proceedings of the 17 th Neural Information Process-

ing Systems Conference.

Canu, S., Grandvalet, Y., Guigue, V., and Rakotomamonjy,

A. (2005). Svm and kernel methods matlab toolbox.

Perception Syst

emes et Information, INSA de Rouen,

Rouen, France.

Cortes, C., Mohri, M., and Rostamizadeh, A. (2008). Learn-

ing sequence kernels. Machine Learning for Signal

Processing.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

In Machine Learning, pages 273–297.

Dell’Amico, M. and Trubian, M. (1998). Solution of large

weighted equicut problems. European Journal of Op-

erational Research, 106(2-3):500–521.

Kijsirikul, B., Ussivakul, N., and Meknavin, S. (2002).

Adaptive directed acyclic graphs for multiclass clas-

siﬁcation. In Proceedings of the 7th Paciﬁc Rim Inter-

national Conference on Artiﬁcial Intelligence: Trends

in Artiﬁcial Intelligence, PRICAI ’02, pages 158–168,

London, UK, UK. Springer-Verlag.

Oliva, A. and Torralba, A. (2001). Modeling the shape

of the scene: A holistic representation of the spatial

envelope. International Journal of Computer Vision,

42:145–175.

Platt, J. C., Cristianini, N., and Shawe-taylor, J. (2000).

Large margin dags for multiclass classiﬁcation. In Ad-

vances in Neural Information Processing Systems 12,

pages 547–553.

Scholkopf, B. and Smola, A. J. (2001). Learning with Ker-

nels: Support Vector Machines, Regularization, Opti-

mization, and Beyond. MIT Press, Cambridge, MA,

USA.

Sivic, J. and Zisserman, A. (2003). Video Google: A text

retrieval approach to object matching in videos. In

Proceedings of the International Conference on Com-

puter Vision, volume 2, pages 1470–1477.

Stoer, M. and Wagner, F. (1997). A simple min-cut algo-

rithm. Journal of the ACM, 44(4):585–591.

Tibshirani, R. and Hastie, T. (2006). Margin trees for high-

dimensional classiﬁcation. Journal of Machine Learn-

ing Research, 8:2007.

ARecursiveApproachForMulticlassSupportVectorMachine-ApplicationtoAutomaticClassificationof

EndomicroscopicVideos

447