TOWARDS EMBEDDED WASTE SORTING

Using Constellations of Visual Words

Toon Goedem´e

De Nayer Technical University, Embedded System Design (EmSD)

Jan De Nayerlaan 5, 2860 Sint-Katelijne-Waver, Belgium

Katholieke Universiteit Leuven, VISICS, ESAT/PSI, Kasteelpark Arenberg 10, 3001 Heverlee, Belgium

Keywords:

Waste sorting, local image features, SURF, SIFT, visual words.

Abstract:

In this paper, we present a method for fast and robust object recognition, especially developed for implemen-

tation on an embedded platform. As an example, the method is applied to the automatic sorting of consumer

waste. Out of a stream of different thrown-away food packages, speciﬁc items — in this case beverage cartons

— can be visually recognised and sorted out. To facilitate and optimise the implementation of this algorithm

on an embedded platform containing parallel hardware, we developed a voting scheme for constellations of

visual words, i.e. clustered local features (SURF in this case). On top of easy implementation and robust

and fast performance, even with large databases, an extra advantage is that this method can handle multiple

identical visual features in one model.

1 INTRODUCTION

We do not live in a world with unlimited resources,

therefore the principle of the TetraPak company is ’a

package should save more than it costs’. One key

issue in their recyling process is sorting the bever-

age carton fraction out of the consumer waste stream.

Although sometimes beverage cartons are seperately

collected, at most places a mixed ’recyclable’ frac-

tion is seperately collected, which has to be sorted

out afterwards. Sorting out some subfractions is easy,

e.g. by using magnets for ferrometals. Some other

subfractions are less easily automated and have to be

sorted manually. This is the case with beverage car-

tons also. In waste processing plants, people have to

pick out the beverage cartons from a stinking never-

ending stream of waste on conveyor belts ...

Although techniques such as the measurement of

UV light reﬂection can help the automated sorting

process, we present in this work a reliable visual

method. The system’s input consists of images from

a camera which is placed above the conveyor belt.

These images are rapidly matched with a database of

beverage carton photos. In real-time, a large fraction

of all beverage cartons can be identiﬁed and picked

out. Missing items in the database can be quickly

added, on the basis of a photograph of the beverage

carton.

The remainder of this text is organised as follows.

Section 2 gives an overview of relevant related work.

In section 3, our algorithm is descibed. Some real-

waste experiments are presented in section 4. The pa-

per ends with a conclusion in section 5.

2 RELATED WORK

Since long, general object recognition is one of the

core research subjects in computer vision. Numerous

techiques are proposed, traditionally mainly based on

the template matching technique (Rosenfeld and Kak,

1976). A few years ago, a major revolutionin the ﬁeld

was the appearance of the idea of local image fea-

tures (Tuytelaars et al., 1999; Lowe, 1999). Indeed,

looking at local parts instead of the entire pattern to be

recognised has the inherent advantage of robustness

to partial occlusions. In both template and query im-

age, local regions are extracted around interest points,

each described by a descriptor vector for comparison.

The development of robust local feature descriptors,

like e.g. Mindru’s generalised colour moment based

ones (Mindru et al., 1999), added robustness to illu-

mination and changes in viewpoint.

Many researchers proposed algorithms for lo-

cal region matching. The differences between ap-

proaches lie in the way in which interest points, local

image regions, and descriptor vectors are extracted.

Goedemé T. (2008).

TOWARDS EMBEDDED WASTE SORTING - Using Constellations of Visual Words.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 93-98

DOI: 10.5220/0001082300930098

 SciTePress

An early example is the work of Schmid and Mohr

(Schmid et al., 1997), where geometric invariance

was still under image rotations only. Scaling was han-

dled by using circular regions of several sizes. Lowe

et al. (Lowe, 1999) extended these ideas to real scale-

invariance. More general afﬁne invariance has been

achievedin the work of Baumberg (Baumberg, 2000),

that uses an iterative scheme and the combination of

multiple scales, and in the more direct, constructive

methods of Tuytelaars & Van Gool (Tuytelaars et al.,

1999; Tuytelaars and Gool, 2000), Matas et al. (Matas

et al., 2002), and Mikolajczyk & Schmid (Mikola-

jczyk and Schmid, 2002). Although these methods

are capable to ﬁnd very qualitative correspondences,

most of them are too slow for use in a real-time appli-

cation as the one we envision here. Moreover, none

of these methods are especially suited for the imple-

mentation on an embedded computing system, where

both memory and computing power must be as low as

possible to ensure reliable operation at the lowest cost

possible.

The classic recognition scheme with local fea-

tures, presented in (Lowe, 1999; Tuytelaars and Gool,

2000), and used in many applications such as in our

previous work on robot navigation (Goedem´e et al.,

2005; Goedem´e et al., 2006), is based on ﬁnding

one-on-one matches. Between the query image and

a model image of the object to be recognised, bijec-

tive matches are found. For each local feature of the

one image, the most similar feature in the other is se-

lected.

This scheme contains a fundamental drawback,

namely its disability to detect mat-ches when multiple

identical features are present in an image. In that case,

no guarantee can be given that the most similar fea-

ture is the correct correspondence. Such pattern rep-

etitions are quite common in the real world, though,

especially in man-made environments. To reduce the

number of incorrect matches due to this phenomenon,

in classic matching techniques a criterium is used sich

as comparing the distance to the most and the sec-

ond most similar feature (Lowe, 1999). Of course,

this practice throws away a lot of good matches in the

presence of pattern repetitions.

In this paper, we present a possible solution to this

problem by making use of the visual word concept.

Visual words are introduced (Sivic and Zisserman,

2003; Li and Perona, 2005; Zhang and Schmid, 2005)

in the context of object classiﬁcation. Local features

are grouped into a large number of clusters with those

with similar descriptors assigned into the same clus-

ter. By treating each cluster as a visual word that

represents the specic local pattern shared by the key-

points in that cluster, we have a visual word vocabu-

lary describing all kinds of such local image patterns.

With its local features mapped into visual words, an

image can be represented as a bag of visual words,

as a vector containing the (weighted) count of each

visual word in that image, which is used as feature

vector in the classication task.

In contrast to the in categorisation often used

bag-of-words concept, in this paper we present the

constellation-of-words model. The main difference is

that not only the presence of a number of visual words

is tested, but also their relative positions.

3 ALGORITHM

Figure 1 gives an overview of the algorithm. It con-

sists of two phases, namely the model construction

phase (upper row) and the matching phase (bottom

row).

First, in a model photograph(a), local features are

extracted (b). Then, a vocabulary of visual words is

formed by clustering these features based on their de-

scriptor. The corresponding visual words on the im-

age (c) are used to form the model description. The

relative location of the image centre (the anchor) is

stored for each visual word instance (d).

The bottom row depicts the matching procedure.

In a query image, local features are extracted (e).

Matching with the vocabulary yields a set of visual

words ( f). For each visual word in the model de-

scription, a vote is cast at the relative location of the

anchor location (g). The location of the object can

be found based on these votes as local maxima in a

voting Hough space (h). Each of the following sub-

sections describes one step of this algorithm in detail.

Local Feature Extraction. We chose to use SURF

as local feature detector, instead of the often used

SIFT detector. SURF (Bay et al., 2006; Fasel and

Gool, 2007) is developed to be substantially faster,

but at least as performant as SIFT.

Interest Point Detector. In contrast to SIFT (Lowe,

1999), which approximates Laplacian of Gaussian

(LoG) with Difference of Gaussians (DoG), SURF

approximates second order Gaussian derivatives with

box ﬁlters, see ﬁgure 2. Image convolutions with

these box ﬁlters can be computed rapidly by using in-

tegral images as deﬁned in (Viola and Jones, 2001).

Interest points are localised in scale and image space

by applying a non-maximum suppression in a 3 × 3

neighbourhood. Finally, the found maxima of the de-

terminant of the approximated Hessian matrix are in-

terpolated in scale and image space.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

Figure 1: Overview of the algorithm. Top row (model building): (a) model photo, (b) extracted local features, (c) features

expressed as visual words from the vocabulary, (d) model description with relative anchor positions for each visual word.

Bottom row (matching): (e) query image with extracted features, ( f) visual words from the vocabulary, (g) anchor position

voting based on relative anchor position, (h) Hough voting space.

Figure 2: Left: two ﬁlters based on Gaussian derivatives.

Right: their approximation using box ﬁlters.

Figure 3: Middle: Haar wavelets. Left and right: examples

of extracted SURF features.

Descriptor. In a ﬁrst step, SURF constructs a circu-

lar region around the detected interest points in order

to assign a unique orientation to the former and thus

gain invariance to image rotations. The orientation is

computed using Haar wavelet responses in both x and

y direction as shown in the middle of ﬁgure 3. The

Haar wavelets can be easily computed via integral im-

ages, similar to the Gaussian second order approxi-

mated box ﬁlters. Once the Haar wavelet responses

are computed, they are weighted with a Gaussian cen-

tred at the interest points. In a next step the dom-

inant orientation is estimated by summing the hori-

zontal and vertical wavelet responses within a rotat-

ing wedge, covering an angle of

in the wavelet re-

sponse space. The resulting maximum is then cho-

sen to describe the orientation of the interest point de-

scriptor. In a second step, the SURF descriptors are

constructed by extracting square regions around the

interest points. These are oriented in the directions

assigned in the previous step. Some example win-

dows are shown on the right hand side of ﬁgure 3. The

windows are split up in 4× 4 sub-regions in order to

retain some spatial information. In each sub-region,

Haar wavelets are extracted at regularly spaced sam-

ple points. In order to increase robustness to geomet-

ric deformations and localisation errors, the responses

of the Haar wavelets are weighted with a Gaussian,

centred at the interest point. Finally, the wavelet re-

sponses in horizontal d

and vertical directions d

are

summed up over each sub-region. Furthermore, the

absolute values |d

| and |d

| are summed in order to

obtain information about the polarity of the image in-

tensity changes. The resulting descriptor vector for all

4× 4 sub-regions is of length 64. See ﬁgure 4 for an

illustration of the SURF descriptor for three different

image intensity patterns. More details about SURF

can be found in (Bay et al., 2006) and (Fasel and

Gool, 2007).

Visual Words. As explained before, the next step

is forming a vocabulary of visual words. This is ac-

complished by clustering a big set of extracted SURF

features. It is important to build this vocabulary using

a large number of features, in order to be representa-

TOWARDS EMBEDDED WASTE SORTING - Using Constellations of Visual Words

Figure 4: Illustrating the SURF descriptor.

Figure 5: The position of the anchor point is stored in the

model as polar coordinates relative to the visual word scale

and orientation.

tive for all images to be processed.

The clustering itself is easily carried out with the

k-means algorithm. Distances between features are

computed as the Euclidean distance between the cor-

responding SURF descriptors. Keep in mind that this

model-building phase can be processed off-line, the

real-time behaviour is only needed in the matching

step.

In the ﬁctive ladybug example of ﬁgure 1, each

visual word is symbolicly presented as a letter. It can

be seen that the vocabularyexists of a ﬁle linking each

visual word symbol with a mean descriptor vector of

the corresponding cluster.

3.1 Model Construction

All features found on a model image are matched with

the visual word vocabulary, as shown in ﬁg. 1 (c). In

addition to the popular bag-of-words models, which

consist of a set of visual words, we add the relative

constellation of all visual words to the model descrip-

tion.

Each line in the model desription ﬁle consists of

the symbolic name of a visual word, and the rela-

tive coordinates (r

rel

, θ

rel

) to the anchor point of the

model item. As anchor point, we chose for instance

the centre of the model picture. These coordinates

are expressed as polar coordinates, relative to the in-

dividual axis frame of the visual word. Indeed, each

visual word in the model photograph has a scale and

an orientation because it is extracted as a SURF fea-

ture. Figure 5 illustrates this. The resulting model is

a very compact description of the appearance of the

model photo. Many of these models, based on the

same visual word vocabulary, can be saved in a com-

pact database. In our beverage carton sorting applica-

tion, we build a database of all different carton prints

to be recognised.

3.2 Matching

Once a database of objects to be recognised is built,

these objects can be detected in a query image. In

our application, a camera overviews a section of the

conveyor belt. The object detection algorithm here

described gives cues where beverage cartons are lo-

cated. With this information, a mechanical device can

sort out the beverage cartons.

This part of the algorithm is time-critical. We are

spending lots of efforts in speeding up the matching

procedure, in order to be able to implement it on an

embedded system.

The ﬁrst operation carried out on incoming im-

ages is extracting SURF features, exactly as described

in section 3. After local feature extraction, matching

is performed with the visual words in the vocabulary.

We used Mount’s ANN (Approximate Nearest Neigh-

bour) (Arya et al., 1998) algorithm for this, which is

very performant. As seen in ﬁg. 1 ( f), some of the vi-

sual words of the object are recognised, amidst other

visual words.

Anchor Location Voting. Because each SURF fea-

ture has a certain scale and rotation, we can recon-

struct the anchor pixel location by using the feature-

relative polar coordinates of the object anchor. For

each instance in the object model description, this

yields a vote for a certain anchor location. In ﬁgure 1

(g), this is depicted by the black lines ending with a

black dot at the computed anchor location.

Ideally, all these locations would coincide at the

correct object centre. Unfortunately, this is not the

case due to mismatches and noise. Moreover, if there

are two identical visual words in the model descrip-

tion of an object (as is the case in the ladybug exam-

ple for words A, C and D), each detected visual word

of that kind in the query image will cast to different

anchor location votes, of which only one can be cor-

rect.

Object Detection. For all different models in the

database, anchor location votes can be quickly com-

puted. Next task is to decide where a certain object

is detected. Because a certain object can be present

more than once in the query image, it is clear that

a simple average of the anchor position votes is not

a sufﬁcient technique, even if robust estimators like

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

RANSAC are used to eliminate outliers. Therefore,

we construct a Hough space, a matrix which is initi-

ated at zero and incremented at each anchor location

vote, ﬁg. 1 (h). The local maxima of the resulting

Hough matrix are computed and interpreted as de-

tected object positions.

4 EXPERIMENTS

For preliminary experiments, we implemented this al-

gorithm using Octave and an executable of the SURF

extractor. Figure 6 shows some typical results of dif-

ferent phases of the algorithm. The test images were

made by pouring out a ’recyclable fraction’ garbage

bag and taking 640 × 480 photographs of it from

about 1 meter distance.

In ﬁg. 6, ﬁrst two model photographs are shown,

for two types of beverage cartons. Each of such im-

ages, having a resolution of about 100× 150 pixels,

yielded a thourough description of the carton print in

a model description containing on the average 65 fea-

tures, what boils down to a model ﬁle size of only 3.5

KB.

In the middle of the top row, the anchor position

voting output is shown for the milk carton detection

step. From matched visual words, black lines are

drawn towards the anchor position. It is clearly visi-

ble that many lines point at the centres of both milk

cartons. In the Hough voting space, next to it, this

leads to two black spots at the positions of the milk

cartons. The bottom row shows comparable experi-

mental results for other query and model images.

The cartons were detected by ﬁnding local max-

ima in the Hough space. We performed experiments

on 25 query images, containing in total 189 milk car-

tons. We were able to detect 84% of the trained types.

Detection failures were mostly due to a large occlu-

sion of one carton by another object.

5 CONCLUSIONS AND FUTURE

WORK

In this paper, we presented an algorithm for object de-

tection based on the concept of visual word costella-

tion voting. The preliminary experiments proved the

performance of this approach. The method has the

advantages that it is computing-power and memory

efﬁcient and that it can handle pattern repetitions in

the models.

We applied this method on the vision-based sort-

ing process of consumer waste, by detecting the

beverage cartons based on a database of previously

trained beverage carton prints.

As told before, our aim in this work is an em-

bedded implementation of this algorithm. The Oc-

tave implementation presented here is only a ﬁrst step

towards that. But we believe the proposed approach

has a lot of advantages. The SURF extraction phase

can mostly be migrated to a parallel hardware imple-

mentation on FPGA. Visual word matching is sped

up using the ANN-libraries, making use of Kd-trees.

Of course a large part of the memory is used by the

(mostly sparse) hough space. A better description of

the voting space will lead to a great memory improve-

ment of the algorithm.

REFERENCES

Arya, S., Mount, D., Netanyahu, N., Silverman, R., , and

Wu, A. (1998). An optimal algorithm for approximate

nearest neighbor searching. In J. of the ACM, vol. 45,

pp. 891-923.

Baumberg, A. (2000). Reliable feature matching across

widely separated views. In Computer Vision and Pat-

tern Recognition, Hilton Head, South Carolina, pp.

774-781.

Bay, H., Tuytelaars, T., and Gool, L. V. (2006). Speeded up

robust features. In ECCV.

Fasel, B. and Gool, L. V. (2007). Interactive museum guide:

Accurate retrieval of object descriptions. In Adaptive

Multimedia Retrieval: User, Context, and Feedback,

Lecture Notes in Computer Science, Springer, volume

4398.

Goedem´e, T., Nuttin, M., Tuytelaars, T., and Gool, L. V.

(2006). Omnidirectional vision based topological nav-

igation. In International Journal of Computer Vision

and International Journal of Robotics Research, Spe-

cial Issue: Joint Issue of IJCV and IJRR on Vision and

Robotics.

Goedem´e, T., Tuytelaars, T., Vanacker, G., Nuttin, M., and

Gool, L. V. (2005). Feature based omnidirectional

sparse visual path following. In IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems,

IROS 2005, pp. 1003-1008, Edmonton.

Li, F.-F. and Perona, P. (2005). A bayesian hierarchical

model for learning natural scene categories. In Proc.

of the 2005 IEEE Computer Society Conf. on Com-

puter Vision and Pattern Recognition, pages 524531.

Lowe, D. (1999). Object recognition from local scale-

invariant features. In International Conference on

Computer Vision.

Matas, J., Chum, O., Urban, M., and Pajdla, T. (2002). Ro-

bust wide baseline stereo from maximally stable ex-

tremal regions. In British Machine Vision Conference,

Cardiff, Wales, pp. 384-396.

Mikolajczyk, K. and Schmid, C. (2002). An afﬁne invariant

interest point detector. In ECCV, vol. 1, 128–142.

TOWARDS EMBEDDED WASTE SORTING - Using Constellations of Visual Words

Figure 6: Some experimental results. Top row: model photos of milk and juice cartons, query image with matching visual

words (white) and relative anchor locations (black) for the milk carton, hough space. Bottom row: Two query images with

detected milk cartons, one with detected juice carton.

Mindru, F., Moons, T., , and Gool, L. V. (1999). Recogniz-

ing color patters irrespective of viewpoint and illumi-

nation. In Computer Vision and Pattern Recognition,

vol. 1, pp. 368-373.

Rosenfeld, A. and Kak, A. (1976). Digital picture process-

ing. In Computer Science and Applied Mathematics,

Academic Press, New York.

Schmid, C., Mohr, R., and Bauckhage, C. (1997). Local

grey-value invariants for image retrieval. In Interna-

tional Journal on Pattern Analysis an Machine Intel-

ligence, Vol. 19, no. 5, pp. 872-877.

Sivic, J. and Zisserman, A. (2003). Video google: A text

retrieval approach to object matching in videos. In

Proc. of 9th IEEE Intl Conf. on Computer Vision, Vol.

Tuytelaars, T. and Gool, L. V. (2000). Wide baseline stereo

based on local, afﬁnely invariant regions. In British

Machine Vision Conference, Bristol, UK, pp. 412-422.

Tuytelaars, T., Gool, L. V., D’haene, L., , and Koch, R.

(1999). Matching of afﬁnely invariant regions for vi-

sual servoing. In Intl. Conf. on Robotics and Automa-

tion, pp. 1601-1606.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Computer

Vision and Pattern Recognition.

Zhang, M. Marszalek, S. L. and Schmid, C. (2005). Lo-

cal features and kernels for classication of texture and

object categories: An in-depth study. In In Technical

report, INRIA.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications