FOOD REGION DETECTION USING

BAG-OF-FEATURES REPRESENTATION AND COLOR FEATURE

Ruiko Miyano, Yuko Uematsu and Hideo Saito

Keio University, 3-14-1 Hiyoshi, 223-8522 Kohoku-ku, Japan

Keywords:

Food Region Detection, Bag-of-Features, Visual Words, Color Feature, SURF, Support Vector Machine.

Abstract:

Food image processing has recently attracted attention, because many people take photos of food. For food

image processing, recognition of captured food is an essential technology, but region detection of the food

area from captured photos is also very important procedure for food recognition. In this paper, we propose a

novel method for automatic region detection of food from photos using two kinds of features in input image.

To detect food regions, we use a method which is widely used in generic object recognition. We divide an

image into small subregions and represent each subregion as Bag-of-Features representation using local feature

descriptors and color feature. Using two features, we recognize food subregions and ﬁnally connect them as

food regions. Our experiments show that the proposed method can detect food region in high accuracy.

1 INTRODUCTION

In recent years, the number of camera users has in-

creased because digital cameras are very popular con-

sumer products. Since a human face is the most pop-

ular subject captured with the consumer digital cam-

eras, techniques for face detection have been exten-

sively developed. As well as human faces, food is

also targeted when recording food logs or memoriz-

ing special moments of travel. Additionally, there are

some applications which make food appearance more

delicious in a photo. For these reasons, food is the

subject frequently taken in everyday life.

Some researches dealing with food images have

attracted attention in recent years. Especially, au-

tomatic food image recognition is important as the

generic object recognition. Such systems aim to clas-

sify unknown food images and estimate volume, nu-

trition and so on. Using these systems, people who

often take a photo of food can easily analyze their

photos, and people can easily record everyday food

from photos.

However, these researches have some problems.

In these researches, food is located in the entire screen

or food region is estimated based on circle detection.

When there are several plates in the captured image,

therefore, they have to cut out the regions of food in

advance. Moreover, if the plate does not have circle

shape, plate detection based on circle detection will

fail. Therefore, we aim to automatically and precisely

detect food regions from single image.

2 RELATED RESEARCH

Joutou et al. proposed a food image recognition sys-

tem for 50 kinds of food images (Joutou and Yanai,

2009). They extract Bag-of-Features (BOF), color

histogram and Gabor texture feature, and then ap-

ply Multiple Kernel Learning (MKL) method to those

features in order to classify query images into 50 cat-

egories. Yang et al. proposed a fast food image recog-

nition system (Yang et al., 2010). They use pair-

wise statistics which represents geometric relation-

ship such as distance and orientation between many

pairs of local features. Puri et al. proposed a food in-

take assessment system which recognizes food types

and estimates volumes and nutrition information (Puri

et al., 2009). They recognize food types by using

color feature and texture feature, and estimate vol-

umes by 3D reconstruction.

There are many researches about object detec-

tion. Most popular research is face detection (Vi-

ola and Jones, 2004). Face detection is implemented

in most digital cameras, which is used to lock fo-

cus and adjust ﬂash. To detect face region, they use

Haar-like feature which considers characteristic pat-

terns of human faces. On the other hand, food images

do not have such particular characteristic patterns be-

cause foods have several colors, shapes and ingredi-

ents. Therefore, we focus on a method used in generic

object recognition. To detect object region based on

generic object recognition, it is useful to use subwin-

dow (Lampert et al., 2009; Wei and Tao, 2010).

709

Miyano R., Uematsu Y. and Saito H..

FOOD REGION DETECTION USING BAG-OF-FEATURES REPRESENTATION AND COLOR FEATURE.

DOI: 10.5220/0003854507090713

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 709-713

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

In generic object recognition, we need to com-

pare a query image to pre-trained images. To effec-

tively compare them, the images have to be repre-

sented as low-dimensional data. The BOF representa-

tion (Csurka et al., 2004) using local feature descrip-

tors and color information (Swain and Ballard, 1991)

are most general methods.

In our proposed method, we do not recognize

types of foods but detect regions where the food is

located in a single photo. We employ sliding window

approach that slides target subregion to detect food

regions. For recognizing a subregion as food or not,

BOF representation and color feature are applied.

3 PROPOSED METHOD

In this paper, we propose a method to automatically

detect food regions from a single image. Fig. 1 shows

the overview of our proposed method. Our method

consists of two phases: training and detection.

Training ImageTraining Image

Feature Extraction

HistogramHistogram

Training phase

Detection phase

Yes

Subregion

Query Image

Query Subregion

Result

Connect Food Subregion

Histogram

Feature Extraction

Vote

Food Subregion?

Result

Connect Food Subregion

Vote

Result

Connect Food Subregion

Food Subregion?

Vote

Result

Connect Food Subregion

Histogram

Food Subregion?

Vote

Result

Connect Food Subregion

Figure 1: Overview.

In training phase, we extract two kinds of image

features, which are BOF representation and color fea-

ture, from training images (730 food images). The

food categories which can be detected depend on

training images. We collected eight kinds of food,

which are pasta, salad, curry, hamburger, ramen,

pizza, rice omelet and okonomiyaki.

In detection phase, we cut out a subregion from

a query image as a query subregion, and extract two

image features from the query subregion in the same

way as training phase. The size of a subregion is

200 × 200 pixels and we slide a subregion with 20

pixels. The subregion recognized as food using the

image features is called a food subregion. We slide

a query subregion around the query image and con-

nect only food subregions as food region in the query

image.

3.1 Features

In this section, we explain two kinds of image fea-

tures. In contrast with human face images, food im-

ages do not have characteristic patterns. Therefore

we combine BOF representation using local feature

descriptors, and color feature in the image.

3.1.1 Bag-of-Features Representation

Local characteristic patterns in images are important

in generic object recognition. To utilize local char-

acteristic patterns, the BOF representation is widely

used. In this representation, an image is represented

as the distribution of feature descriptor vectors.

In training phase, we extract SURF from training

images, which are 100 food images and 100 non food

images , and cluster all feature points into 500 clus-

ters using K-Means++ method. Then we deﬁne the

centroids of 500 clusters as Visual Words.

We compute histogram of Visual Words in both

training and detection phase. We extract SURF from

training image or query subregion. Then we quantize

each feature point into 500 Visual Words (Fig. 2(a))

and count the number of occurrences (Fig. 2(b)). In

Fig. 2, the different visual words are represented by

different colors for the detected feature points. Fi-

nally, we normalize histogram using t f-id f weight-

ing. In this way, we compute Visual Word histogram,

which is so-called BOF representation.

(a) Visual Words

visual word

frequency

(b) Histogram of visual words

Figure 2: BOF representation.

3.1.2 Color Feature

Not only local feature points but also color informa-

tion is important in food images, because most of

foods consists of ingredients that have speciﬁc colors.

Figure 3: Three parts for color feature.

To extract color feature, we divide an image into

3 parts concentrically as shown in Fig. 3. This is be-

cause most foods are placed in a circle. We also divide

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

710

RGB brightness into 4 parts. Then we create 64-bin

color histogram from each parts. Totally, we extract a

192 dimensional color feature from each image.

3.2 Recognition

In this section, we explain the method to recognize

subregions as food in a query image. For the recog-

nition, our method compares two image features be-

tween each query subregion and training images. We

propose two methods to compare these image fea-

tures. One is histogram intersection and the other is

Support Vector Machine (SVM).

3.2.1 Histogram Intersection

First, we explain the method to compute similarity of

image features using histogram intersection (Swain

and Ballard, 1991). We compute the histogram in-

tersection for each histogram between the query sub-

region and the training images, and compute the aver-

age of two histogram intersections. If the average is

larger than a pre-deﬁned threshold, these two images

are regarded as similar images.

We calculate histogram intersection between one

query subregion and all training images, and count

the number of similar images. If the number of sim-

ilar images is larger than a pre-deﬁned threshold, the

query subregion is recognized as food subregion.

3.2.2 Support Vector Machine

Second, we explain the method to classify query sub-

regions into categories of food using SVM. SVM is

one of the common classiﬁcation methods that has

high generalization performance, even the dimension

of data is very high. The simplest SVM can classify

data into two classes. It needs known positive data

and negative data, and estimates hyperplane which

correctly divides these data into two classes.

In training phase, we create 2-class SVM classiﬁer

for each food category and each feature using training

images, which are 730 food images and 100 non food

images. In detection phase, we predict class of query

subregion using SVM classiﬁers created in training

phase.

3.3 Connection of the Food Subregions

In this section, we explain the method to connect the

food subregions.

In section 3.2, we recognize several food subre-

gions by the sliding window recognition. We vote for

all pixels included in the food subregions as show in

Fig. 4(a). If the voted count of each pixel is larger

1 1 1 1

1 2 2 2 1

1 2 3 3 2 1

1 1 2 2 1 1

1 1 1 1

(a)

1 1 1 1

1 1

1 1 1

(b) (c)

Figure 4: Connection of the food subregions.

than a pre-deﬁned threshold, that pixel is regarded as

inside the food region, which is labeled as shown in

Fig. 4(b). Finally, as shown in Fig. 4(c), we draw the

bounding box of those pixels.

4 EXPERIMENTS

We carried out 2 experiments, which are evaluation of

two features and detection of food regions, to evaluate

our proposed method.

4.1 Evaluation of Two Features

In this experiment, we use 100 food images and 50

non food images as query images in order to evaluate

the effectiveness of two features; BOF representation

and color feature. Here, we do not cut out query sub-

regions via sliding window but create histograms of a

whole image and recognize a query image as food or

non food.

Table 1: The recognition results.

Correct answers

Food Non food

Accuracy

rate

BOF (HI) 91 / 100 40 / 50 87.3 %

Color (HI) 90 / 100 34 / 50 82.7 %

Both (HI) 96 / 100 38 / 50 89.3 %

BOF (SVM) 98 / 100 30 / 50 85.3 %

Color (SVM) 100 / 100 33 / 50 88.7 %

Both (SVM) 98 / 100 50 / 50 98.7 %

Table 1 shows the recognition results using both

approaches. It seems the combination of BOF rep-

resentation and color feature is effective because the

accuracy rates using both two features is 89.3% and

98.7 % which is larger than the results with a single

feature. We can also ﬁnd that SVM approach is better

than histogram intersection approach because it has

general versatility. Histogram intersection approach

depends on training images because we compare a

query subregion and all training images. Therefore

it cannot be applied to data which is not included in

training images.

FOOD REGION DETECTION USING BAG-OF-FEATURES REPRESENTATION AND COLOR FEATURE

711

4.2 Detection of Food Regions

In this section, we show the result of food region

detection. We compare three methods, which are

circle detection, histogram intersection and SVM.

Histogram intersection and SVM are based on the

method explained in section 3.2. Circle detection is

used in Puri’s research (Puri et al., 2009). We recog-

nize detected circular regions as food after detecting

circle by Hough transform.

To compare them, we compute difference between

ground truth images and each detected image. Fig.

5(b) shows manually created ground truth image of

Fig. 5(a). Fig. 5(c) shows a mask of detection result.

And Fig. 5(d) shows the difference between Fig. 5(b)

and Fig. 5(c). Red region represents false negative,

and green region represents false positive.

(a) (b) (c) (d)

Figure 5: Computation of difference.

We computed false detection rate R according to

following equation

R =

false negative + false positive(pixel)

width× height(pixel)

. (1)

If ground truth image is same as detection result, dif-

ference is 0 %.

Table 2 shows the average of false detection rates

between each ground truth image and each detection

result (389 food images and 15 non food images).

Table 2: False detection rate.

Circle

detection

Histogram

intersection SVM

R 35.5 % 29.2 % 25.4 %

Fig. 6 shows the examples of detection result. Fig.

6(a) are based on histogram intersection approach.

Fig. 6(b), 6(c) are based on SVM approach. We can

see that food regions of our eight categories are al-

most detected from these images. We compare three

methods in Fig. 7. The food region can be detected

using histogram intersection and SVM, but it cannot

be detected using circle detection. In circle detection,

it depends on a shape of plate. For example, if the

plate is square, or whole plate is not in query image

(a) (b) (c)

Figure 6: Food region detection.

(a) Circle detection (b) Histogram in-

tersection

Figure 7: Comparison of three methods.

like Fig. 7(a), detection will be fail. Therefore, our

sliding window approach is efﬁcient, even the plate

has non circular shape.

5 CONCLUSIONS

We proposed an automatic detection method of food

regions from a single image. By using BOF repre-

sentation and color feature, we recognized subregions

as food or non food, and then all food subregions are

connected to create food regions.

We carried out two experiments to evaluate our

proposed method. First, we recognized whole query

image to evaluate the effectiveness of two features.

Because of combining two features, we achieved 98.7

% accuracy rate based on SVM approach. Second, we

applied our method to query images. We can auto-

matically detect food regions from query images due

to sliding window approach, and false detection rate

is 25.4 % based on SVM approach.

As future works, we plan to add more features

to increase recognition accuracy, improve the method

for connecting food subregions and extend training

images to increase categories of food.

ACKNOWLEDGEMENTS

A part of this research was supported by the Research

Grant of Keio Leading-edge Laboratory of Science &

Technology.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

712

REFERENCES

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and

Bray, C. (2004). Visual categorization with bags of

keypoints. In Workshop on statistical learning in com-

puter vision, ECCV, volume 1, pages 1–22.

Joutou, T. and Yanai, K. (2009). A food image recognition

system with multiple kernel learning. In Proc. of ICIP.

Lampert, C. H., Blaschko, M. B., and Hofmann, T. (2009).

Efﬁcient subwindow search: A branch and bound

framework for object localization. IEEE Trans. on

PAMI, 31(12):2129–2142.

Puri, M., Zhu, Z., Yu, Q., Divakaran, A., and Sawhney, H. S.

(2009). Recognition and volume estimation of food

intake using a mobile device. In Proc. of WACV, pages

1–8.

Swain, M. J. and Ballard, D. H. (1991). Color indexing. In-

ternational Journal of Computer Vision, 7(1):11–32.

Viola, P. A. and Jones, M. J. (2004). Robust real-time face

detection. International Journal of Computer Vision,

57(2):137–154.

Wei, Y. and Tao, L. (2010). Efﬁcient histogram-based slid-

ing window. In Proc. of CVPR, pages 3003–3010.

Yang, S., Chen, M., Pomerleau, D., and Sukthankar, R.

(2010). Food recognition using statistics of pairwise

local features. In Proc. of CVPR, pages 2249–2256.

FOOD REGION DETECTION USING BAG-OF-FEATURES REPRESENTATION AND COLOR FEATURE

713