An Accurate Hand Segmentation Approach using a Structure based

Shape Localization Technique

Jose M. Saavedra

1,2

, Benjamin Bustos

and Violeta Chang

University of Chile, Department of Computer Science, Santiago, Chile

ORAND S.A., Santiago, Chile

Keywords:

Hand Segmentation, Hand Localization, Color based Segmentation, Local Descriptors.

Abstract:

Hand segmentation is an important stage for a variety of applications such as gesture recognition and biomet-

rics. The accuracy of the hand segmentation process becomes more critical in applications that are based on

hand measurements as in the case of biometrics. In this paper, we present a very accurate hand segmentation

technique, relying on both hand localization and color information. First, our proposal locates a hand on an

input image, the hand location is then used to extract a training region which will play a critical role for seg-

menting the whole hand in an accurate way. We use a structure-based method (STELA), originally proposed

for 3D model retrieval, for the hand localization stage. STELA exploits not only locality but also structural

information of the hand image and does not require a large image collection for training. Second, our proposal

separates the hand region from the background using the color information captured from the training region.

In this way, the segmentation depends only on the user skin color. This segmentation approach allows us to

handle a variety of skin colors and illumination conditions. In addition, our proposal is characterized by being

fully automatic, where a user calibration stage is not required. Our results show a 100% in the hand localiza-

tion process under different kinds of images and a very accurate hand segmentation achieving over 90% of

correct segmentation at the expense of having only 5% for false positives..

1 INTRODUCTION

Hand segmentation has become very important for

many applications such as those related to vision-

based virtual reality (Yuan et al., 2008), gesture

recognition (Wachs et al., 2011), and biometric recog-

nition (Huang et al., 2008; Y

uk et al., 2006). Fur-

thermore, the hand segmentation step turns more crit-

ical in cases where a depth analysis of the hand is

required. This kind of analysis could imply getting

some measures from the hand image. For instance, in

the case of hand biometrics getting reliable hand mea-

surements is essential. Commonly, hand-based bio-

metrics require a special device to capture the hand

properties limiting its usability. This problem could

be solved if we could get hand measurements directly

from a digital cam. Of course, this would require

an accurate segmentation stage. Yoruk et al. (Y

et al., 2006) proposed a method to carry out this pro-

cess. However, in their work, they consider that the

image background is dark and homogenous leading

to a trivial segmentation stage.

Commonly, hand segmentation is carried out by

skin color based techniques (Jones and Rehg, 2002;

Lew et al., 2002). These techniques build a generic

model (e.g. a statistical model) from a large collec-

tion of training skin images. The model will decide

whether a pixel in the image is actually a hand point

or not. An important study on this kind of model

proposed by Jones et al. (Jones and Rehg, 2002) is

based on a Gaussian mixture model. A critical draw-

back of a skin color based technique is its low perfor-

mance under illumination variations. Moreover, this

kind of technique performs poorly in presence of skin

color-like regions. Figure 1 shows a segmentation re-

sult using the statistical model proposed by Jones et

al. (Jones and Rehg, 2002) for skin segmentation.

Instead of using a generic color model one could

use a model dependent on the user skin color leading

to an adaptive approach. The adaptive term arises due

to the fact that the color model will be adapted to the

user skin color. This approach allows us not only to

tackle the diversity of skin color but also to handle

diverse illumination conditions.

In the adaptive segmentation approach, a hand

portion called the training region needs to be previ-

321

Saavedra J., Bustos B. and Chang V..

An Accurate Hand Segmentation Approach using a Structure based Shape Localization Technique.

DOI: 10.5220/0004288703210326

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 321-326

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: A low accurate segmentation result using a skin

color based approach (Jones and Rehg, 2002). The white

pixel on the right binary are the pixels detected as hand

points.

ously marked on an input image. The pixels inside

the marked region are used to build a color model

that will be used to segment the rest of the hand by

means of color similarity. Following this idea, Yuan

et al. (Yuan et al., 2008) proposed an algorithm that

makes up color clusters using a training region and

then labels the clusters as hand or background de-

pending on the size of each cluster. Finally, the image

points are classiﬁed depending on what cluster they

belong to. The problem with this algorithm is that it

requires the user to mark the training region manually,

an undesired task in automatic environments.

To get a training region automatically, we could

locate the hand on the input image. Using this loca-

tion we could mark an appropriate hand region, com-

monly using the location coordinates as the center of

such a region.

For the localization problem, the Viola and Jones

detector (Viola and Jones, 2002) may be used. The

problem with this approach is that it requires large

training image collection and it is time consuming for

the training. Because a hand is a simple object with

a well deﬁned shape, an expensive training stage is

unnecessary. Moreover, the localization problem may

be carried out using a local structure-based approach

exploiting not only locality information but also the

structure of the hand shape.

Our contribution in this work is to present a very

accurate hand segmentation technique composed of

two main steps: (1) estimate the hand location on

an image, and (2) separate the hand region from the

background. For the localization stage, we use a lo-

cal structure-based approach exploiting both struc-

tural and locality information of a hand. Structural

information is related to the components forming a

hand and locality information is related to the spatial

relationship between these components. To this end,

we use the STELA (StrucTurE-based Local Aproach)

method proposed by Saavedra et al. (Saavedra et al.,

2011). For the segmentation stage we extend the idea

of Yuan et al. (Yuan et al., 2008) proposing strategies

to compute the underlying parameters. In this case,

we make up color clusters from a training region ob-

Figure 2: A example of hand segmentation using our ap-

proach. The blue contour deﬁnes the segmented region.

tained directly from the localization stage (a manual

localization is no longer required). For color repre-

sentation we use only the chromatic channels of the

L*a*b* color space as suggested by Yuan et al. (Yuan

et al., 2008). The segmentation stage ends with a

post-procesing phase to reduce imperfections caused

by noise. An example of our results is shown in Fig-

ure 2 where the segmentation is speciﬁed by a blue

contour.

The remaining part of this document is organized

as follows. Section 2 describes the local structure

based approach (STELA) which our proposed method

relies on. Section 3 describes in detail the hand seg-

mentation process. Section 4 presents the experimen-

tal evaluation, and ﬁnally, Section 5 discusses some

conclusions.

2 STELA

STELA is a structure-based approach proposed by

Saavedra et al. (Saavedra et al., 2011) for retrieving

3D models when the query is a line-based sketch. A

STELA descriptor is invariant to translation, scaling,

and rotation transformation. The main property of

this approach is that it is based not only on the struc-

tural information but also on the locality information

of an image.

For getting structural information, an image is

decomposed into simple shapes. This method uses

straight lines as primitive shapes which are named

keyshapes. For getting locality information, local de-

scriptors taking into account the spatial relationship

between keyshapes are used. An interesting property

of keyshapes is that these allow us to represent an ob-

ject in a higher semantic level.

STELA consists of the following steps: (1) get an

abstract image, (2) detect keyshapes, (3) compute lo-

cal descriptors, and (4) match local descriptors.

1. Abstract Image. The abstract image allows us to

reduce the effect of noise, keeping only relevant

edges. To this end, STELA applies the canny op-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

322

Figure 3: A synthetic representation of the partitioning to

make up the STELA descriptor.

erator (Canny, 1986) over the image in order to

have an edge map representation. Then, to ap-

proximate strokes, the method uses the edgelink

operator (Kovesi, 2000) which returns a set of

edge lists.

2. Detecting Keyshapes. Detecting simple shapes

that compose a more complex object allows us

to take into account structural information in the

similarity measurement. STELA detects only

straight lines which form an object. Lines can

be detected easily and still maintain enough dis-

criminative information. These lines are re-

ferred as keyshapes following the same idea of

keypoints (Mikolajczyk and Schmid, 2004) even

though keyshapes are more representative in terms

of the object structure.

Finally, the center of each line is taken as the rep-

resentative point of each keyshape. In STELA,

each keyshape L is represented as a 5-tuple

[(x

),(x

), (x

),s,φ], where (x

) is

the start point, (x

) is the end point, (x

) is

the representative point, s is the line length, and φ

is the corresponding slope.

3. The Local Descriptor. A local descriptor is com-

puted over each keyshape. STELA uses an ori-

ented angular 8-partitioning descriptor. Figure 3

depicts a graphical representation of this descrip-

tor.

Having a keyshape L as reference, this descriptor

works as follows:

• Create a vector h, containing 8 cells. Initially,

h(i) = 0, i = 1...8.

• Let L be the reference keyshape represented as

L = [(x

),(x

),s,φ

]. (1)

• Let f

: R

→ R

be a rotation function around

the point (x

) with rotation angle β = −φ.

• Let ( ˆx

, ˆy

) = f

) be the normalized ver-

sion of (x

• For each keyshape Q 6= L represented by

[(x

), (x

),s

,φ

– Get ( ˆx

, ˆy

) = f

– Determine the corresponding bin of ( ˆx

, ˆy

) on

the oriented partitioning scheme.

– h(bin) = h(bin) + s

/MAX LEN, where

MAX LEN is the length of the abstract image

diagonal.

• Finally, h(bin) =

h(bin)

∑

i=1

h(i)

, bin = 1 ...8.

4. Matching. An image is treated as a set of de-

scriptors. This set captures the object shape. The

matching problem may be regarded as an instance

of the Bipartite Graph Matching. STELA re-

solves the assignment problem applying the Hun-

garian Method. The cost of matching two local

descriptors p and q could be thought of as the dis-

tance between p and q. In this way, the less sim-

ilar the descriptors are, the more expensive the

match becomes. As the STELA descriptor is a

probability distribution, it uses the χ

statistic test

as distance function.

For estimating the pose transformation, this

method uses the Hough Transform, where each

candidate match must vote just for three pa-

rameters (scale, translation in x-axis and y-axis).

The set of parameters with the highest score is

retained. This set of transformation parame-

ters characterizes the estimated pose. Only the

matches which agree with the estimated pose are

retained for the next process, the others are dis-

carded.

Finally, the dissimilarity between two images is

computed as the average cost of the matched de-

scriptors. The cost of the unmatched descriptors

is set to 1.

3 PROPOSED HAND

SEGMENTATION METHOD

We divide our approach in two stages, the ﬁrst one

corresponds to the hand localization, where STELA

is applied to estimate the location of the occurrence

of a hand. After this, we extract a training region

from the center of the located region and proceed to

segment the hand using the color information given

by the training region. Finally we carry out a post-

processing stage to make our result more robust to en-

vironment conditions like illumination or skin color-

AnAccurateHandSegmentationApproachusingaStructurebasedShapeLocalizationTechnique

323

Figure 4: Proposal framework.

Figure 5: Two examples of hand localization.

like regions near to the hand region. A framework of

our proposal is presented in Figure 4.

3.1 Hand Localization

Unlike the machine learning approach for detecting

objects, our approach only requires a simple hand pro-

totype. In our implementation we use a 80 × 80 hand

prototype image.

We use STELA to estimate the hand location in

an image. Since we are interested in detecting a hand

shape we could make STELA faster reducing the im-

age size. We resize the input image to a 100 × 120

image. To appropriately determine the location of a

hand, we apply the sliding window strategy. Each re-

gion inside a 80 × 80 window is compared with the

hand prototype by STELA. The center of the window

keeps the dissimilarity value between the windowed

region and the prototype.

Let d

i j

be the STELA dissimilarity value for each

pixel (i, j) in the resized input image. We deﬁne D as

a set of points where the dissimilarity value is close to

the minimum dissimilarity value (mdv). That is:

D = {(i, j) : d

i j

−mdv ≤ 0.1,i = 1 ...100, j = 1 ...120},

Finally, the occurrence of a hand is located in the

centroid (i

, j

) of D only if the mdv ≤ T H

. This

means that if we have mdv > T H

the method will re-

port a hand not found message. In our experiments

we set T H

= 0.65. If a hand is located, the cen-

troid (i

, j

) needs to be rescaled to have the real lo-

cation. Two examples of hand localization are shown

in Figure 5 where we notice that the proposed method

works even when the hand undergoes rotation varia-

tions.

3.2 Hand Segmentation

After locating the hand, we extract a 150x100 region

where the upper left corner corresponds to the hand

location point (see Figure 4). A great number of pix-

els inside this region must correspond to the hand. We

will refer to this region as the training region.

We proceed to make up color clusters. In this way,

each pixel of the training region must fall within one

of the built clusters. We represent the image by the

L*a*b* color space. We only use the chromatic chan-

nels a∗ and b∗ similiar as the proposal of Yuan et

al. (Yuan et al., 2008). Though we have conducted

experiments using different color spaces, we got bet-

ter results using L*a*b*.

Each color cluster q needs to keep two values, the

ﬁrst one corresponds to the number of pixels falling

inside it (N

), and the second one is a representa-

tive point of the cluster (this is not necessarily a real

pixel), expressed in terms of its corresponding a* b*

color information. We represent this point as [r

The representative pixel corresponds to the average of

the a* b* components of all pixels falling inside q.

We make up the clusters following this given al-

gorithm:

• SoC =

0 (set of clusters)

• N = 0 (number of clusters)

• For each pixel p in the training region

– Let [a,b] the corresponding a*b* color informa-

tion of p.

– m = min

1≤q≤N

([a,b],[r

]))

– q

∗

= argmin

1≤q≤N

([a,b],[r

]))

– If m < T H, update [r

] and increase N

by 1.

– Otherwise, add a new cluster w to SoC using p,

increasing N by 1. In this case, [r

] = [a,b]

and N

= 1.

• return SoC

In the previous algorithm T H = 0.01 and L

corre-

sponds to the Euclidean distance.

After building the set of clusters SoC, we segment

a hand starting from the pixel corresponding to the

hand location point expanding the process to the rest

of the pixels using the breadth ﬁrst search strategy

through the pixels detected as hand point. This strat-

egy avoids detecting objects that are far from the hand

region, minimizing the false positive points. The al-

gorithm determines what cluster a pixel p must be-

long to. An appropriate cluster q? for the pixel p must

satisfy two criteria:

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

324

1. The number of pixels in q? must be greater than

T H N. Clusters with few pixels may correspond

to the background.

2. Considering only clusters satisfying the ﬁrst crite-

rion, the cluster q? corresponds to that with min-

imum distance (md) between p and each cluster

representative point. Here, the distance function

is the Euclidean distance.

If md > T H

the pixel is marked as background,

otherwise it is marked as a hand point.

In our implementation T H N is the median of the

cluster sizes. T H N = median(N

,...,N

). In the

case of T H

we conduct a different strategy. The

main idea is that T H

has to be computed depending

on the distances between hand points from the train-

ing region. Therefore T H

is a distance that allows us

to discard the 10% of the training region points with

higher distance value whit respect to the cluster they

belong to.

3.3 Post-processing

After the segmentation stage we have a binary rep-

resentation where detected hand points are set to 1

and background points are set to 0. To have an accu-

rate segmentation we apply two post-processing op-

erations. First, the method discards a point detected

as hand point if the number of hand points in a local

region around the point is less than 50% of the local

region size. In this case, we use a 21x21 local region.

Second, the method applies morphological operations

to ﬁll holes in the hand region (Soille, 1999).

4 EXPERIMENTAL EVALUATION

In this section we show results of the automatic hand

segmentation using our proposal. To test our ap-

proach, we used a collection of 25 640 × 480 im-

ages containing a hand captured with different kinds

of illumination. In terms of pixels, our collection is

composed of 5,761,985 hand pixels (positive set) and

18,084,503 non-hand pixels (negative set). We com-

pare our result against a renowned skin color model,

speciﬁcally we use as baseline as proposed by Jones

et al. (Jones and Rehg, 2002).

A good tool to assess the performance of our seg-

mentation results is the ROC curve. In this way, we

have a mask indicating the hand region for each im-

age. ROC curve analysis takes into account the area

under the curve (AUC) as a quality measure. The

higher the AUC value is, the better the quality of our

method is. Furthermore ROC curves show us the cost

Figure 6: ROC curves comparing our method with the skin

color model proposed by Jones (Jones and Rehg, 2002).

The AUC value is indicated in the legend.

Table 1: Correct detection rate (CD) vs. false positive

rate for our method and the skin color model proposed by

Jones (Jones and Rehg, 2002).

CD FP (Our method) FP (Skin color model)

0.95 0.146 0.641

0.90 0.052 0.490

0.85 0.046 0.334

0.80 0.009 0.234

we have paid in terms of false positives when a high

correct detection is desired. The ROC curve compar-

ing our method with the skin color model is presented

in Figure 6.

Our method achieves an AUC value of 0.96 as

the skin color model only achieves 0.85. In addition

to the overall good performance of our method, it is

worth pointing out that our proposal achieves a cor-

rect detection rate over 90% at the expense of having

only 5% for false positives. The skin color model re-

sults in 49% of false positives to achieve a comparable

result. In Table 1 the relationship between false posi-

tive and correct detection is shown for both evaluated

methods.

Figure 7 shows how well our method segments

a hand in an image in comparison with the baseline

method. Additionally, six examples of hand segmen-

tation using our method are depicted in Figure 8.

Finally, we have to say that our method correctly

locates a hand for different images. In our experi-

ments we get 100% of correct localization. There-

fore, any application requiring a hand localization

step could take advantage of our proposal, hand track-

ing and hand biometrics are two potential applica-

tions.

AnAccurateHandSegmentationApproachusingaStructurebasedShapeLocalizationTechnique

325

(a) (b)

Figure 7: Hand segmentation comparison. (a) Input image,

(b) target segmentation, (c) output using our method, and

(d) output using the skin color model.

Figure 8: Examples of hand segmentation using our pro-

posed approach.

5 CONCLUSIONS

We have presented a novel approach for the hand seg-

mentation task. Our method estimates the hand loca-

tion to capture a training hand region. Then, we pro-

pose an adaptive color model based on the image’s

skin color to segment the hand. This allows us to han-

dle illumination changes and diverse skin colors. We

compare our method with a skin color based model,

achieving notable improvement in the accurate seg-

mentation. One advantage of our method is that it

segments the hand in an accurate way without requir-

ing a lengthy time consumption for the training stage.

Additionally we have presented a novel approach

for hand localization which could be used in other ap-

plications like hand tracking for instance.

We would like to extend our proposal in order to

focus on the segmentation of other kinds of objects as

well as that within the context of hand biometrics.

REFERENCES

Canny, J. (1986). A computational approach to edge detec-

tion. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 8(6).

Huang, D.-S., Jia, W., and Zhang, D. (2008). Palmprint ver-

iﬁcation based on principal lines. Pattern Recognition,

41.

Jones, M. J. and Rehg, J. M. (2002). Statistical color models

with application to skin detection. Int. Journal Com-

put. Vision, 46.

Kovesi, P. D. (2000). MATLAB and Octave functions

for computer vision and image processing. Available

from: <http://www.csse.uwa.edu.au/∼pk/research/

matlabfns/>.

Lew, Y., Ramli, A., S.Y.Koay, Ali, R., and Prakash, V.

(2002). A hand segmentation scheme using cluster-

ing technique in homogeneous background. In Proc.

of 2nd Student Conference on Research and Develop-

ment.

Mikolajczyk, K. and Schmid, C. (2004). Scale & afﬁne in-

variant interest point detectors. International Journal

of Computer Vision, 60(1):63–86.

Saavedra, J. M., Bustos, B., Scherer, M., and Schreck, T.

(2011). STELA: Sketch-based 3d model retrieval us-

ing a structure-based local approach. Submitted to

ACM-ICMR.

Soille, P. (1999). Morphological Image Analysis: Princi-

ples and Applications. Springer-Verlag Telos.

Viola, P. and Jones, M. (2002). Robust real-time object de-

tection. Int. Journal of Computer Vision.

Wachs, J., Klsch, M., Stern, H., and Edan, Y. (2011).

Vision-based hand gesture interfaces: Chall. and in-

nov. Communications of the ACM.

uk, E., Duta

gaci, H., and Sankur, B. (2006). Hand bio-

metrics. Image and Vision Computing, 24:483–497.

Yuan, M., Farbiz, F., Manders, C. M., and Tang, K. Y.

(2008). Robust hand tracking using a simple color

classiﬁcation technique. In Proc. of The 7th ACM SIG-

GRAPH Int. Conf. on Virtual-Reality Continuum and

Its Applications in Industry.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

326