Image Labeling by Integrating Global Information by 7 Patches

and Local Information

Takuto Omiya, Takahiro Ishida and Kazuhiro Hotta

Meijo University, 1-501 Shiogamaguchi, Tempaku-ku, Nagoya, 468-8502, Japan

Keywords: Bag of Visual Words, K-Nearest Neighbor, Image Labeling and Integration of Local and Global Features.

Abstract: We propose an image labeling method by integrating the probabilities of local and global information.

Many conventional methods put label to each pixel or region by using the features extracted from local

regions and local contextual relationships between neighboring regions. However, labeling results tend to

depend on a local viewpoint. To overcome this problem, we propose the image labeling method using not

only local information but also global information. The probability by global information is estimated by K-

Nearest Neighbor. In the experiments using the MSRC21 dataset, labeling accuracy is much improved by

using global information.

1 INTRODUCTION

The goal of image labeling is to associate a class

label such as sky, water, road, etc. with every pixel

in the image. Image labeling is one of the most

crucial steps toward image understanding and has a

variety of applications such as image retrieval and

classification. The most fundamental approach put

labels to each region using the local features (e.g.,

color, texture, etc.) extracted from the region.

However, labels in an object tend to be inconsistent

since this approach puts labels to each region

independently.

Some approaches have been proposed to

overcome this problem recently. Popular approaches

use information not only from local features but also

from local contextual relationships between regions.

In those methods, Conditional Random Field (CRF)

model is used. Shotton et al. (2006) used a CRF

model which joints the appearance of different

semantic categories. Tu (2008) introduced the auto-

context model to use contextual information.

The common problem in these approaches is that

recognition results tend to be a local optimum. It is

consider that the reason of the problem is the lack of

a global viewpoint. Since only local features and

local relationship are used, the regions easily

recognized by a global viewpoint are recognized

incorrectly.

Omiya et al. (2013) used not only local

information but also the global information to

overcome with the problem. They used local

appearance histogram and Bag of Visual Words

(BoVW) as the global similarity. However, they

assumed that each image includes only one class.

Hence, the method did not work well when there are

plural classes in an image.

Therefore, we propose a novel image labeling

method that can cope with images including plural

objects. Specifically, we choose similar patches

which are similar to an input patch from the training

patches by K-Nearest Neighbor (K-NN). The objects

in the similar patches are likely to be the same as the

objects in the input patch. Hence, we vote ground-

truth labels in K similar patches and estimate the

probabilities of each pixel. This is the global

probability in our method. We integrate local and

global probabilities, and labels on a test image are

estimated.

In experiments, we used the MSRC21 dataset

(Shotton et al., 2006). When we estimated class

labels by using only local information, class average

accuracy was 49.5% and pixel-wise accuracy was

63.6%. When we use only global information, class

average and pixel-wise accuracies were 55.7% and

61.8%. When we estimated class labels by

integrating local and global information, class

average and pixel-wise accuracies were improved to

64.2% and 82.3%. The results demonstrated the

effectiveness of integration. The accuracies are

comparable to those of recent methods.

536

Omiya T., Ishida T. and Hotta K..

Image Labeling by Integrating Global Information by 7 Patches and Local Information.

DOI: 10.5220/0005356005360541

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 536-541

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

2 OVERVIEW OF THE

PROPOSED METHOD

Figure 1: Overview of the proposed method.

We show the overview of the proposed method in

Figure 1. The proposed method consists of two

phases; local and global information. An input

image is segmented by Simple Linear iterative

Clustering (SLIC) (Achanta et al, 2012) to acquire

local information, and we extract color and texture

features from each superpixel. Local feature vectors

of each superpixel are represented as the local co-

occurrence histogram of the color and texture

features. The posterior probability of each class is

estimated by Support Vector Machine (SVM).

We divide an input image into 7 patches to

acquire global information, and we extract RootSIFT

(Arandjelovic and Zisserman, 2012) from each patch

and represent it by BoVW. We choose similar

patches by using K-NN from the training patches

and estimate the probability by voting the ground-

truth labels attached to the similar patches.

Integration is performed by the product of both

probabilities every pixel. After integration, the label

of each pixel is determined to be the class with the

highest probability.

2.1 Local Information

We explain how to extract local feature vectors in

section 2.1.1. In section 2.1.2, the probability of

local information is explained.

2.1.1 Extraction of Local Features

We estimate the class labels of each superpixel

based on color and texture features. Color features

are effective to identify objects which have their

own characteristic color (e.g., sky, grass, etc.).

However, color features are sensitive to various

illumination conditions. In addition, some objects

have a wide range of color (e.g., red car, blue cars,

etc.). Therefore, we also use texture feature which is

robust to various illumination conditions and is not

affected by color variation. We used HSV as color

features and LBP (Ojala et al., 2002) as texture

features.

The local co-occurrence histogram is composed

of HSV and LBP. If we use a large number of

features, we may represent fine difference of objects.

However, the computational cost is high. Therefore,

we used clustering by k-means in each features.

In experiments, the number of dimensions of

HSV is 100 and the number of dimensions of LBP is

50. Thus, the number of dimensions of the local co-

occurrence histogram is 5,000. These values were

determined by using validation images. This local

co-occurrence histogram is used as the local feature

vectors.

2.1.2 Probability of Local Information

We use SVM to estimate the posterior probability of

each class label. Since SVM is a binary classifier,

we use one-against-one strategy. The label of each

superpixel becomes the class which has maximum

posterior probability.

We use Hellinger kernel as the kernel function.

Hellinger kernel is reported that high accuracy with

low computational cost can be realized (Vedaldi and

Zisserman, 2012). Hellinger kernel is defined as





,





√









(1)

In this paper, we use LIBSVM (Chang and Lin,

2001) to compute the posterior probability. The local

probability 







of the -th class for the -th pixel

in an image is defined as













|







(2)

where 



corresponds to the -th class and 





is a

local feature vector for superpixel . 



|





 is

the posterior probability which is estimated by SVM.

Therefore, all pixels in the superpixel  have the

same posterior probability.

ImageLabelingbyIntegratingGlobalInformationby7PatchesandLocalInformation

537

2.2 Global Information

There are trends in the distribution of the object in

images (e.g., there is the grass in the lower part of

the image, or there is the sky in the upper part of the

image). The similar image has the similar

distribution of classes in common with the input

image. By utilizing such properties, the distribution

of the object in the input image can be estimated.

Figure 2 shows the overview of the label

estimation by global information. We choose some

similar patches which are similar to the input patch

from the training patches. We use the K-NN of

BoVW histogram to select similar patches. We vote

the ground-truth label of similar patches to the input

patch, and we convert the voting result into

probability by dividing the number of votes of each

class by the total number of votes at each pixel. This

is the global probability. In this method, the

probabilities of multiple classes are estimated if

multiple objects are present in the test image. Of

course, we can estimate the position of the multiple

objects.

Figure 2: How to estimate label by global information.

When we estimate the class label from the entire

image, there is a possibility that the labeling result of

small objects is difficult since small objects are

strongly influenced by large objects such as

background. Therefore, the proposed method is

divided the image into 7 patches as shown in Figure

3. The reason for using the patch on the center of the

image is to capture the foreground objects well, and

we also obtain spatial information by using these 7

patches. The size of each patch is fixed to 160 × 160

pixels. In addition, each patch is cropped with

overlap from the image, and it is represented by

BoVW of RootSIFT (Arandjelovic and Zisserman,

2012).

We explain how to extract global feature vectors

in section 2.2.1. In section 2.2.2, the probability of

global information is explained.

Figure 3: 7 patches for K-NN.

2.2.1 Extraction of Global Features

We extract RootSIFT of 4 scales (4, 8, 12, and 16)

per 2 pixels from 7 patches. It has been reported that

RootSIFT gives higher accuracy than SIFT in

(Arandjelovic and Zisserman, 2012). We compose

BoVW histogram of RootSIFT in each patch. This

BoVW histogram becomes the global feature vector.

In experiments, the number of visual words is set to

2,000.

2.2.2 Probability of Global Information

We search similar patches from the training patches

by using K-NN. If the value of K is small, the

reliability of the estimated class label is low. On the

other hand, patches that do not similar to the input

patch are selected if the value of K is large. In this

paper, we set K = 40 which gives the optimal result

in section 3.2.

The global probability 







of the -th class

for the -th pixel in an image is defined as











∑∑



















∑∑∑























(3)

where

∑∑∑























represents the sum of

weighted votes at the -th pixel and

∑∑



















represents the sum of weighted votes of the -th class

at the -th pixel. When the label on the -th pixel in

the -th patch is not , 









is 0. 





is the weight of

the -th patch and it is computed as





























(4)

where 



is the distance of the  -th nearest

neighbor and 



is the distance of the most similar

patch. 



is the distance of the K-th nearest neighbor.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

538

The weight of the most similar patch is one and the

weight gradually decreases in accordance with the

distance. The weight of the K-th patch is zero.

2.3 Information Integration

In this section, we explain the integration of local

and global information. The probabilities of local

and global information are calculated in every pixel.

Since we assume that the local and global

information is independence, we integrate both

probabilities by the product. Therefore, it is defined

















∙







(5)

where 







, 







and 







express the

probability of the -th class for the -th pixel in an

image. After integration, the class label 



of the -th

pixel is defined as





argmax











.

(6)

3 EXPERIMENTS

We show the experimental results of our method. In

section 3.1, we describe the MSRC21 dataset

(Shotton et al., 2006) and how to evaluate the

accuracy. We show preliminary experiment in

section 3.2. In section 3.3, we show the accuracy of

our method. We show the comparison with related

works in section 3.4.

3.1 How to Evaluate Accuracy

We use the MSRC21 dataset (Shotton et al., 2006) in

the following experiments. The dataset has 591

images and contains 21 classes (building, grass, tree,

cow, sheep, sky, aeroplane, water, face, car, bicycle,

flower, sign, bird, book, chair, road, cat, dog, body,

boat). In this paper, we use 276 images for training,

59 images for validation, and 256 images for testing.

Therefore, the number of test patches is 1792 (= 256

× 7), the number of training patches is 1932 (= 276 ×

7) and the number of validation patches is 413 (= 59

× 7).

We use pixel-wise accuracy and class average

accuracy for evaluation. Class average accuracy is

the average percent of correctly labeled pixel in each

class. Pixel-wise accuracy is the percent of correctly

labeled pixels in total. Since the number of pixels in

each class is different, the two accuracies become

different value.

3.2 Preliminary Experiment for Global

Information

We show class average and pixel-wise accuracy

after integration in Figure 4. These accuracies were

obtained using validation images. After K = 40,

class average and pixel-wise accuracy are almost

unchanged. The larger K is, the higher

computational cost is. Therefore, we set K = 40 in

the following experiments.

Figure 4: Class average and pixel-wise accuracy (%) after

integration. The value of K in K-NN is different.

3.3 Results on the MSRC21 Dataset

Table 1 shows the accuracies of our method. We see

that integration of local and global information is

effective for image labeling. The accuracy of cow

class is 48.6% by only local information, while the

accuracy is 57.9% by only global information. In the

case of building, the accuracy is 41.1% by only local

information and the accuracy is 33.4% by only

global information. Thus, local and global

information have a complementary relationship each

other, and the accuracy of our method is improved

by using local and global information.

We show the examples of labeling results in

Figure 5. As shown in Figure 5, chair, boat and bird

are not labeled well. Common points of those classes

are that within-class variance is large and the

number of training sample is small. Boat class in the

MSRC21 dataset includes various ships, e.g. small

craft or large passenger ship. Chair class includes

various kinds of chairs, e.g. plastic chair or wooden

chair. It is considered that the class with large

within-class variance could not be characterized well

by using local color and texture feature. The label by

global information is estimated by voting the

ground-truth label which is attached to the similar

patches. Hence, the vote of the class with a small

number of training samples decreases and the class

is not easily classified. For example, the grass

classified with high accuracy has 2,574,052 pixels in

ImageLabelingbyIntegratingGlobalInformationby7PatchesandLocalInformation

539

training samples, but the boat classified with low

accuracy has only 91,320 pixels in training samples.

Therefore, those objects tend to be mislabeled by

only global information.

Figure 5: Examples of image labeling. (a) original image.

(b) ground-truth. (c) only local information. (d) only

global information. (e) integration of local and global

information. (A), (B) and (C) are the examples that

classification accuracy is improved by integrating local

and global information. (D), (E) and (F) are the examples

that our method did not work well.

3.4 Comparison with Related Works

We show the comparison with related works in

Table 2. Shotton et al. (2006) used a CRF model

which joints the appearance of different semantic

categories. They used only local features and local

contexts without using global information. In

comparison with their method, our method gives

higher accuracy in both measures because we used

global information.

Tu (2008) introduced the auto-context model to

use contextual information. Tu (2008) also use only

local information but their class average accuracy is

higher than one of our method. This is because

global information of our method is difficult to

classify the objects with large within-class variance.

Gould (2012) proposed a method which

considers a non-local constraint that penalizes

inconsistent pixel labels between disjoint image

regions having similar appearance. Pixel-wise

accuracy of our method is higher than one of this

method. This is because our method is able to

estimate the position of the object by only global

information and is not easy to intersperse the image

with the wrong labels.

We compare our method with the method

(Omiya et al., 2013). Class average of our method is

lower than one of Omiya et al. (2013). This is

because it is difficult for global information in the

proposed method to classify infrequent and small

objects (e.g. bird, chair, boat, etc.). On the other

hand, pixel-wise accuracy is improved. Since global

information in Omiya et al. (2013) cannot estimate

the position of objects, their pixel-wise accuracy is

lower than ours. Moreover, the classes which often

appear in background are classified with high

accuracy since our global information can recognize

spatial information by using 7 patches. For example,

our accuracies of sky, grass, water, etc. are higher

than those of Omiya et al. (2013). Omiya et al.

(2013) used the result of object categorization as

global information. Therefore, humans must

determine the main object in an image manually, and

other objects in the image are ignored. However, our

global information can estimate class label

automatically by using K-NN and is able to

recognize multiclasses. Consequently, the proposed

method is superior to the method of Omiya et al.

(2013).

4 CONCLUSIONS

In this paper, we introduced the global information

that can support multiple objects in an image. In

experiments, good class average accuracy and pixel-

wise accuracy are obtained by integrating local and

global information. In comparison with related

works, the proposed method gave high pixel-wise

accuracy and sufficient class accuracy.

Our global information is obtained by voting the

ground-truth label attached to K similar patches. The

labeling result tends to be influenced by objects with

large area (e.g. sky and grass). Thus, we plan to

decide the region with high probability and carry out

K-NN without the region that already determined.

The proposed method will estimate the label for

small objects correctly by estimating the probability

iteratively. This is a subject for future works.

Table 1: The accuracy (%) of our method.

Local Global Integration

Building

41.1 33.4 61.2

Grass

87.7 80.0 93.0

Tree

73.1 66.3 81.3

Cow

48.6 57.9 61.3

Sheep

59.2 61.2 72.4

Sky

89.1 82.8 95.3

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

540

Table 1: The accuracy (%) of our method (cont.).

Local Global Integration

Aeroplane

44.7 86.9 69.6

Water

60.6 63.8 80.4

Face

61.3 80.1 74.5

Car

52.3 57.9 66.7

Bicycle

69.8 97.0 89.4

Flower

42.2 61.0 70.7

Sign

27.4 72.2 55.6

Bird

17.5 12.5 30.2

Book

65.7 58.5 68.4

Chair

8.7 46.2 34.0

Road

75.3 41.7 73.0

Cat

48.4 41.6 57.9

Dog

32.7 22.7 41.2

Body

27.4 37.9 59.6

Boat

7.0 7.1 13.6

Class ave

49.5 55.7 64.2

Pixel-wise

63.6 61.8 82.3

Table 2: Comparison with related works.

Class ave(%) Pixel-wise(%)

Our method 64.2 82.3

Omiya et al.

(2013)

72.5 76.2

Shotton et al.

(2006)

57.7 72.2

(2008)

68.4 77.7

Gould

(2012)

71.1 81.0

ACKNOWLEDGEMENTS

This work was supported by KAKENHI

No.24700178.

REFERENCES

Shotton, J., Winn, J., Rother, C., Criminisi, A., (2006).

‘Textonboost: joint appearance, shape and context

modeling for multi-class object recognition and

segmentation.’ European Conference on Computer

Vision, pp. 1-15.

Tu, Z., (2008). ‘Auto-context and its application to high-

level vision tasks.’ Computer Vision and Pattern

Recognition, pp. 1-8.

Omiya, T., Hotta, K., (2013). ‘Image labeling using

integration of local and global features.’ International

Conference on Pattern Recognition Applications and

Methods, pp. 613-618.

Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C,

(2004). ‘Visual categorization with bags of keypoints.’

ECCV Workshop on statistical learning in computer

vision, pp. 1-22.

Fix, E., Hodges, J., (1951). ‘Discriminatory analysis

nonparametric discrimination: Consistency properties.’

Technique Report No. 4, U.S. Air Force School of

Aviation Medicine, Randolf Field Texas.

Achanta, R., Shaji A., Smith, K., Lucchi, A., Fua, P.,

Susstrunk, S., (2012). ‘SLIC Superpixels Compared to

State-of-the-art Superpixel Methods.’ IEEE

Transactions on Pattern Analysis and Machine

Intelligence, Vol. 34, No. 11 pp. 2274-2282.

Arandjelovic, R., Zisserman, A., (2012). ‘Three things

everyone should know to improve object retrieval.’

Computer Vision and Pattern Recognition. pp. 2911-

2918.

Ojala, T., Pietikainen, M., Maenpaa, T., (2002).

‘Multiresolution gray-scale and rotation invariant

texture classification with local binary patterns.’ IEEE

Transactions on Pattern Analysis and Machine

Intelligence, Vol. 24, No. 7, pp. 971-987.

Vedaldi, A., Zisserman, A., (2012) ‘Efficient additive

kernels via explicit feature maps.’ IEEE Transactions

on Pattern Analysis and Machine Intelligence, Vol. 34,

No. 3, pp. 480-492.

Chang, C.-C., Lin, C.-J., (2001). LIBSVM: a library for

support vector machines. Available at:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (Accessed:

11 November 2014).

Fei-Fei, L., Perona, P., (2005). ‘A Bayesian hierarchical

model for learning natural scene categories.’

Computer Vision and Pattern Recognition, Vol. 2, pp.

524-531.

Gould, S., (2012). ‘Multiclass pixel labeling with non-

local matching constraints.’ Computer Vision and

Pattern Recognition, pp. 2783-2790.

ImageLabelingbyIntegratingGlobalInformationby7PatchesandLocalInformation

541