IMAGE MATTING USING SVM AND NEIGHBORING

INFORMATION

Tadaaki Hosaka†, Takumi Kobayashi†and Nobuyuki Otsu†‡

†National Institute of Advanced Industrial Science and Technology, 1-1-1 Umezono, Tsukuba, Japan

‡University of Tokyo, 7-3-1 Hongo, Tokyo, Japan

Keywords:

Image matting, Markov random ﬁeld, support vector machine, belief propagation.

Abstract:

Image matting is a technique for extracting a foreground object in a static image by estimating the opacity

at each pixel in the foreground image layer. This problem has recently been studied in the framework of

optimizing a cost function. The common drawback of previous approaches is the decrease in performance

when the foreground and background contain similar colors. To solve this problem, we propose a cost function

considering not only a single pixel but also its neighboring pixels, and utilizing the SVM classiﬁer to enhance

the discrimination between the foreground and background. Optimization of the cost function can be achieved

by belief propagation. Experimental results show favorable matting performance for many images.

1 INTRODUCTION

1.1 Image Matting

Image matting is one of the primary processing tech-

niques in image and video editing. In this problem,

an image is assumed to be a composite of foreground

and background image layers. Let a given image,

the foreground image, and the background image be

denoted by ˜c = (c

,...,c

f = ( f

, f

,..., f

and

b = (b

,...,b

), respectively. Each element,

, f

, and b

(i = 1,2,...,N), is the RGB value (3-

dimensional vector; each pixel value ranges from 0 to

255) of pixel i, and N is the number of pixels. Then,

the observed image ˜c is modeled by the linear com-

bination of a foreground image

f and a background

image

b at each pixel as

= α

+ (1− α

, (1)

where α

∈ [0, 1] is the mixing rate called alpha value

or opacity. The task of image matting is to estimate

the opacity

α = (α

,α

,...,α

), foreground colors

f, and background colors

b for each pixel in a given

image ˜c.

This task is inherently an under-constrained prob-

lem, since the number of constraints in Eq.(1) is much

less than the number of variables to be estimated

(

α,

f, and

b). Moreover, as the foreground object a

user intends to extract is unknown, the user is usually

required to impose constraints, by indicating parts of

the foreground and background, which provide clues

for classifying the remaining pixels (Figure 1). In this

paper, we utilize this user-input information, as well

as previous approaches.

1.2 Previous Work

Blue screen matting (Smith and Blinn, 1996) was de-

veloped as a technique for motion picture photogra-

phy, which is well known as chroma-key composit-

ing. Recent approaches attempted to extract fore-

ground mattes directly from natural images without

assuming a constant background. Several methods re-

quired a user to prepare a trimap, which is a roughly

segmented map consisting of three regions: deﬁnitely

foreground, deﬁnitely background, and unknown re-

gions (Figure 1 (b)). Knockout 2 (Berman et al.,

2000) extrapolates the known foreground and back-

ground colors into the unknown region. Ruzon and

Tomasi ﬁrst introduced a probabilistic view to image

matting, and estimated alpha mattes using foreground

and background distributions around unknown pixels

(Ruzon and Tomasi, 2000). Chuang et al. solved the

matting problem based on the Bayesian framework

344

Hosaka T., Kobayashi T. and Otsu N. (2007).

IMAGE MATTING USING SVM AND NEIGHBORING INFORMATION.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IFP/IA, pages 344-349

 SciTePress

and maximum a posteriori estimation (Chuang et al.,

2001). Sun et al. obtained the alpha matte by solving

the Poisson equation between the gradients of alpha

value and color intensities (Sun et al., 2004). Grady

et al. formulated image matting from the viewpoint of

transition probabilities in random walks (Grady et al.,

2005).

For high-quality matting, users need to carefully

generate the trimap, which is a troublesome and time-

consuming task. This problem was partially solved

by (Wang and Cohen, 2005). In their approach, a

user draws few strokes in the foreground object and

the background, as illustrated in Figure 1 (c), where

pixels on the red strokes are in the foreground, and

those on the blue strokes are in the background. They

deﬁned a cost function for alpha estimation on the

Markov random ﬁeld (MRF), and minimized it us-

ing the belief propagation (BP) (Pearl, 1988). Re-

cently, under the assumption that the foreground and

background colors lie on a straight line in RGB color

spaces, a closed form solution to image matting has

been derived, and the alpha value was analytically ob-

tained (Levin et al., 2006).

(a) (b) (c)

Figure 1: Methods of indicating the target object. (a) Orig-

inal image. (b) Trimap. A user roughly segments the im-

age into deﬁnitely foreground (painted white), deﬁnitely

background (painted black), and unknown regions (painted

gray). (c) Input strokes. A user marks the foreground (red

strokes) and background (blue strokes).

1.3 Objective of this Paper

The common drawback of the aforementioned algo-

rithms is that the performance tends to deteriorate

when the foreground and background regions contain

similar colors. One solution is to provide an interac-

tive user interface to modify imperfections, which has

been adopted by Poisson matting (Sun et al., 2004).

In this paper, we aim at improving the perfor-

mance itself using neighboring information around

the referred pixel, while traditional algorithms use

only the information of a single pixel. This exten-

sion, to a certain extent, incorporates texture-like in-

formation into the image matting. Furthermore, we

enhance the discrimination between foreground and

background with support vector machine (SVM).

2 COST FUNCTION

2.1 Formulation

The formulation of our cost function is partially sim-

ilar to (Wang and Cohen, 2005). They considered

two terms in their cost function: the local smooth-

ing term, and the likelihood term which expresses the

sufﬁciency level of the matting equation (1) when the

alpha value is estimated. However, their formulation

seems so complicated that the essence is slightly am-

biguous.

In this paper, we incorporate three factors into a

cost function for high-quality matting: ﬁdelity to the

matting equation (1), local smoothness, and discrim-

ination based on user inputs. Thus, our cost function

is expressed as

α,

b; ˜c) = λ

∑

i∈P

(α

, f

)

∑

(ij)∈N

(α

,α

) +λ

∑

i∈P

(α

), (2)

where U

, and U

express the matting, smooth-

ing, and discrimination terms, respectively. The intro-

duction of the discrimination term is novel to image

matting, and g

is the 15-dimensional color vector de-

ﬁned below. The symbols

P and N represent the set

of pixels and adjacent pixel pairs, respectively. The

positive parameters λ

and λ

control the balance

between these three terms. We specify these terms

below.

2.2 Matting Term

Since the basic assumption of image matting is de-

scribed by Eq.(1), the desirable alpha matte should

satisfy this equation. Here, we explicitly introduce

the ﬁtness of this model using the square error as

(α

, f

) = ||c

− α

− (1− α

. (3)

2.3 Smoothing Term

The smoothing term is deﬁned as

(α

,α

) =

||c

− c

|| +1

· (α

− α

)

. (4)

This expression means that the smoothness in a given

image ˜c also enforces that in alpha mattes.

2.4 Discrimination Term

2.4.1 Extension of Image Vector

Traditional approaches focused only on the RGB vec-

tor of each pixel. However, including a similar color

in a foreground object and the background makes it

difﬁcult to classify the two regions based on pixel-

wise RGB colors. One solution is to incorporate

neighboring information with pixel-wise colors, and

extract effective features from the local image for nat-

ural image matting.

Based on this perspective, we use the informa-

tion of each pixel and its four nearest neighbors as

one of the straightforward extensions. Although there

are several alternatives for color information, such

as HSV colors and SIFT (Lowe, 2004), we adopt

standard RGB colors to facilitate comparison of our

method with previous work. Therefore, we construct

a 15-dimensional vector consisting of the RGB inten-

sities of each pixel and its four nearest neighbors for

the discrimination term. The array of these vectors

is denoted by ˜g = (g

,...,g

), where g

is a 15-

dimensional vector at pixel i.

We expect this conﬁguration to extract some tex-

ture information. It is natural that the RGB color com-

binations among ﬁve pixels have more divergences

than in the case of a single pixel, and therefore, ex-

tending 3-dimensional RGB colors to 15-dimensional

vectors provides additional information for more ac-

curate classiﬁcation.

2.4.2 Classiﬁcation by Svm

We enhance discrimination between foreground and

background by using the 15-dimensional vectors to

extract effective information for image matting. The

support vector machine (SVM) with the kernel trick

provides a scheme for carrying out this task. In-

put vector ~x is classiﬁed by y = Θ[ f

SVM

(~x)], where

y ∈ {0, 1} is a class label, f

SVM

(·) is the SVM output

function, and Θ[z] is 1 for z ≥ 0 or 0 otherwise.

We construct the discrimination term based on the

outputs of the SVM classiﬁer. Note that the train-

ing data consists of the proposed 15-dimensional vec-

tors at user-marked pixels, and class labels express the

foreground (y = 1) and background (y = 0). For pix-

els that a user does not mark, the discrimination term

is deﬁned as

(α

) = α

+ (1− α

. (5)

In this expression, d

and d

represent the afﬁnity of

pixel i to the foreground and the background, respec-

tively. They are deﬁned by the SVM output function

SVM

) as

1+exp{−a

| f

SVM

)|}

, d

1−k

= 1−d

, (6)

where k

≡ Θ[ f

SVM

)] is the classiﬁcation result.

The coefﬁcients a

and a

should be determined ap-

propriately; here, we empirically set these parameters

as a

= 4/J

, where J

and J

denote the average val-

ues of the SVM output function for the foreground

and background training data, respectively. In this

study, we adopt the Gaussian kernel (Muller et al.,

2001)

K(~x,~x

′

) = exp



−

||~x−~x

′

2σ



, (7)

where σ is a parameter ﬁxed as σ

= 1000 throughout

the paper.

Figure 2 shows the effectiveness of the 15-

dimensional vectors and classiﬁcation by the SVM.

This ﬁgure shows the value of d

in 256 gray-levels,

when using the standard 3-dimensional RGB vec-

tors ((b), (e), and (h)) and the 15-dimensional ex-

tended vectors (our method, (c), (f), and (i)). Red and

blue strokes indicate user inputs of foreground and

background, respectively. Figure 2(a) is an artiﬁcial

graphic produced to help understand the effectiveness

of the proposed 15-dimensional vectors, in which a

foreground object (the yellow ball) exists in a back-

ground texture of a striped pattern of width one pixel.

Since a similar color exists in both the foreground and

background, the performance of pixel-wise methods

degrades (b), while our method (the 15-dimensional

vector and classiﬁcation by SVM) provides favorable

discrimination result (c) as well as (f) and (i).

3 ALGORITHM

It is difﬁcult to minimize the cost function (2) with

respect to

α,

f, and

b simultaneously. This difﬁculty

was also faced by (Wang and Cohen, 2005). As they

did, we minimize the cost function for alpha values

using belief propagation (BP) keeping

f and

b ﬁxed,

and minimize the cost function for foreground and

background colors by sampling method keeping

ﬁxed.

3.1 Estimation of Alpha Values By Bp

Finding optimal alpha mattes with minimum cost cor-

responds to the MAP estimation problem, which is

generally computationally difﬁcult. Thus, we have to

employ practically tractable algorithms that generate

(sub)optimal solutions.

For discrete combinatorial optimization, the belief

propagation (Pearl, 1988) is a promising approach for

such tasks. BP has been recently exploited for vari-

ous computer vision problems (e.g., stereo matching

(Sun et al., 2003)) as well as image matting (Wang

and Cohen, 2005). Therefore, we quantize the alpha

value to 11 levels (at 0.1 intervals between 0 and 1), in

original images

with user inputs

3-dimension 15-dimension

(a) (b) (c)

(d) (e) (f)

(g)

(h) (i)

Figure 2: Comparison among the discrimination terms. The

pixel in white indicates an afﬁnity for the foreground and

the pixel in black indicates that for the background. Tra-

ditional 3-dimensional RGB (b) are insufﬁcient to sepa-

rate the foreground object from the background texture in

the toy example shown in (a), whereas the proposed 15-

dimensional vectors (c) provide excellent classiﬁcation. In

the example of the stuffed rabbit, although the difference is

not necessarily clear, we can see some places in the back-

ground where the 15-dimensional case (f) is superior to the

3-dimensional case (e). In the soccer ball image, the 3-

dimensional vectors misclassify pixels to the left of the ball

(h).

order to transform the current problem into a discrete

combinatorial optimization.

On the current MRF, BP is represented as a mes-

sage passing algorithm between neighboring pixels:

(α

) = min

(α

, f

)/Z

+λ

(α

) +U

(α

,α

)

∑

k∈N ( j)\i

t−1

(α

), (8)

where N ( j)\i denotes the set of nearest neighbors of

pixel j other than i, and t = 1,2,... is an index for

iteration steps. The matting term U

is normalized

by a factor Z

≡

∑

′

||c

− α

′

− (1− α

′

to re-

strict this term to a range [0,1] as well as the other

two terms (U

and U

) and facilitate the adjustment

of parameters λ

and λ

Note that the messages m

and m

are different

variables. After the convergence of the iterations, a

belief vector is computed for each pixel as

(α

) = λ

(α

, f

)/Z

+ λ

(α

)

∑

j∈N (i)

∗

(α

), (9)

where the superscript * represents the value at conver-

gence, and the optimal label at pixel i, denoted as α

∗

is estimated as

∗

= argmin

(α

). (10)

As used in (Wang and Cohen, 2005), we employ the

techniques proposed by (Felzenszwalb and Hutten-

locher, 2004) to facilitate the calculation of Eq.(8)

3.2 Sampling for Foreground and

Background Colors

We must estimate the foreground and background col-

ors,

f and

b as well as alpha values. Foreground and

background colors appear only in the matting term.

We determine these values by a sampling approach.

Let the current value of the matting term at pixel i

be denoted as v

≡ ||c

−α

−(1−α

. For each

pixel i, we sequentially search the optimal foreground

and background colors in its neighboring pixel j from

the nearest neighbors within a radius of 20 pixels. We

focus f

if α

> α

(or b

if α

< α

), and replace the

foreground (background) colors f

) with f

) if

the matting term is reduced, i. e.,

||c

− α

− (1− α

< v

(||c

− α

− (1− α

< v

3.3 Algorithm Flow

We use the multiscale technique proposed by (Felzen-

szwalb and Huttenlocher, 2004) to facilitate the com-

putation and obtain better results. We begin with an

estimation for the coarsest image, and use the results

as initial values for the ﬁner image. The ﬁnal alpha

matte is obtained as a result of the original scale. Our

entire algorithm is described below.

1) Generation of multiscale images Multiscale im-

ages for an original image and the user strokes are

generated by the standard quad-tree method.

2) Classiﬁcation by SVM The elements of the dis-

crimination term d

and d

are calculated.

3) Initialization We set f

= c

for user-marked fore-

ground pixels, and b

= c

for user-marked back-

ground pixels. Unmarked pixels take over the val-

ues of corresponding pixels at the previous coarser

scale.

original images

with user inputs

Wang and Cohen Levin et al. our method

Figure 3: Examples of experimental results. The ﬁrst column shows original images with user-speciﬁed strokes. The other

columns show the results of (Wang and Cohen, 2005), (Levin et al., 2006), and our method. The parameters λ

and λ

well as those included in the two previous approaches are adjusted so that the performance is optimal by appearance.

4) Estimation of alpha values The alpha values are

estimated by the BP with foreground and back-

ground colors ﬁxed.

5) Estimation of foreground and background colors

The foreground and background colors are esti-

mated by sampling from neighboring pixels with

the alpha values ﬁxed.

6) Repeat the steps 4 and 5 until the values of

α,

and

b remain constant.

7) Return to the step 3 and start the estimation for the

next ﬁner scale.

4 EXPERIMENTAL RESULTS

The proposed approach has been tested for various

images. Figure 3 shows several results obtained by

our method, compared to other methods, (Wang and

Cohen, 2005) and (Levin et al., 2006). The results

of these previous works were obtained using the pro-

grams provided on their websites. There are four

multiscales for every image. The upper three exam-

ples were also used in the previous works, and we

set user-marked inputs in places similar to those stud-

ies. The parameters λ

and λ

in Eq.(2) were de-

termined manually for each image so that the perfor-

mance is optimal by appearance, and the parameters

in the other methods were also optimized by hand.

It is basically difﬁcult to obtain ground truth and

quantitatively evaluate matting performance. There-

fore, we resort to subjective evaluation. Previous ap-

proaches work well on the images of a peacock and

a face, and our approach also compares favorably on

those images. In the latter two images which con-

tain similar colors in the foreground and background,

our method extracts the foreground object better than

the other algorithms on the whole, which indicates

that the proposed 15-dimensional color vectors and

classiﬁcation by SVM are effective for image mat-

ting. However, in some instances, our method does

not necessarily capture the details as well as the other

methods. Figure 4 shows an example of a compos-

ite image, the stuffed rabbit extracted by our method

with a blue background. The enlarged details in the

red square are relatively reasonable, while those in the

green square are missing in the composite image.

The performance of these matting algorithms de-

pends on the positions and the quantity of the user

inputs. In particular, when a user draws only a few

strokes, the performance can deteriorate drastically.

An example of the calculation time is as follows.

Using a 2.66 GHz CPU with 3 GB RAM, an image

size of 341×455 pixels (the stuffed rabbit in Figure 3)

requires about 23 sec for the classiﬁcation by SVM

and about 17 sec for the subsequent estimation by

BP and sampling without speciﬁc programming op-

timization.

5 CONCLUSION

This paper has proposed the improvement of the cost

function for image matting. A key contribution is the

use of neighboring information in terms of higher di-

mensional vectors, instead of considering the infor-

mation in a single pixel. In addition, we enhanced the

discrimination between foreground and background

with SVM. We obtained high-quality matting results

even when a foreground object and background had

similar colors.

Our future work includes further improvements

to the cost function and estimation process for fore-

ground and background colors, in order to obtain

more desirable results. Setting the parameter values

also inﬂuences matting results. In this study, we man-

ually set optimal values for λ

and λ

, which may

not be implemented in practice. Statistical inference

methods, such as the maximum of marginal likelihood

(Tanaka, 2002) could be used for this parameter esti-

mation. Another problem is the optimal setting of the

parameters σ and a

in the SVM formulation. Cross

validation method is one promising solution for this

problem.

composite image

enlarged

(composite)

enlarged

(original)

Figure 4: An example of composite images with blue back-

ground. It can be seen that some details in the original im-

age are missing in the composite image.

ACKNOWLEDGEMENTS

The study was supported by the advanced surveil-

lance technology project of MEXT. The authors thank

T. Kurita and N. Ichimura for their helpful discus-

sions.

REFERENCES

Berman, A., Vlahos, P., and Dadourian, A. (2000). Com-

prehensive method for removing from an image the

background surrounding a selected object. In U. S.

Patent, 6, 134, 345.

Chuang, Y. Y., Curless, B., Salesin, D., and Szeliski, R.

(2001). A bayesian approach to digital matting. In

Proc. of IEEE CVPR. pp. 264-271.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efﬁcient

belief propagation for early vision. In Proc. of IEEE

CVPR. pp. 261-268.

Grady, L., Schiwietz, T., Aharon, S., and Westermann, R.

(2005). Random walks for interactive alpha-matting.

In Proc. of VIIP05. pp. 423-429.

Levin, A., Linschinski, D., and Weiss, Y. (2006). A closed

form solution to natural image matting. In Proc. of

IEEE CVPR. pp. 61-68.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. In International Journal of Com-

puter Vision. vol. 60, pp. 91-110.

Muller, K. R., Mika, S., Ratsch, G., Tsuda, K., and

Scholkopf, B. (2001). An introduction to kernel-based

learning algorithms. In IEEE Trans. on Neural Net-

works. vol. 12, pp. 181-201.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Sys-

tems. Morgan-Kauffman, San Francisco.

Ruzon, M. A. and Tomasi, C. (2000). Alpha estimation in

natural images. In Proc. of IEEE CVPR. pp. 18-25.

Smith, A. R. and Blinn, J. F. (1996). Blue screen matting. In

Proc. of the 23rd annual conf. on Computer graphics

and interactive techniques. pp. 259-268.

Sun, J., Jia, J., Tang, C. K., and Shum, H. Y. (2004). Poisson

matting. In Proc. of ACM SIGGRAPH. pp. 315-321.

Sun, J., Zheng, N. N., and Shum, H. Y. (2003). Stereo

matching using belief propagation. In IEEE Trans.

on PAMI. vol. 25, pp. 787-800.

Tanaka, K. (2002). Theoretical study of hyperparameter

estimation by maximization of marginal likelihood

in image restoration by means of cluster variation

method. In Electronics and Communications in Japan.

vol. 85, pp. 50-62.

Wang, J. and Cohen, M. F. (2005). An iterative optimization

approach for uniﬁed image segmentation and matting.

In Proc. of IEEE ICCV. pp. 936-943.