Robust Method of Vote Aggregation and Proposition Veriﬁcation for

Invariant Local Features

Grzegorz Kurzejamski, Jacek Zawistowski and Grzegorz Sarwas

Lingaro Sp. z o.o., Puławska 99a, 02-595 Warsaw, Poland

Keywords:

Computer Vision, Image Analysis, Multiple Object Detection, Object Localization, Pattern Matching.

Abstract:

This paper presents method for analysis of the vote space created from the local features extraction process

in a multi-detection system. The method is opposed to the classic clustering approach and gives a high level

of control over the clusters composition for further veriﬁcation steps. Proposed method comprises of the

graphical vote space presentation, the proposition generation, the two-pass iterative vote aggregation and the

cascade ﬁlters for veriﬁcation of the propositions. Cascade ﬁlters contain all of the minor algorithms needed for

effective object detection veriﬁcation. The new approach does not have the drawbacks of the classic clustering

approaches and gives a substantial control over process of detection. Method exhibits an exceptionally high

detection rate with conjunction with a low false detection chance in comparison with alternative methods.

1 INTRODUCTION

Object detection based on local features is well known

in the computer vision ﬁeld. Many different re-

searches brought about different features and meth-

ods of scene analysis in search of a particular object.

Recently developed feature points are proven to be

well suited for a speciﬁc object detection, rather than

a generalised object’s class identiﬁcation. As many

applications of local features has been evaluated, the

ability to describe selective elements of rich graph-

ics is the main purpose of invariant local features in

many ﬁelds. Amongst the most popular local features

are e.g. SIFT, SURF, BRISK, FREAK, MSER. These

are commonly described as feature points or feature

regions. They are easy to manipulate and to match.

Use of such local features gives not only the ability to

describe the object in many ways, but also to achieve

invariance for basic object and image transformation,

as skew, rotation, blur, noise. Invariant local features

used in conjunction with invariant characteristic re-

gion detectors, such as Harris-Afﬁne or SIFT detector,

provide data for thorough scene-object analysis, lead-

ing to detection of an object in the scene. Giving an

appropriate set of features one can determine the ex-

act position of the object in the scene with a speciﬁed

scale, rotation and even minor, linear deformations.

The straightforward approach to a detection task

is to identify local features in the scene and the pat-

tern and match them against themselves, using an ap-

propriate metric. Common evaluations of such sys-

tems use brute KNN classiﬁcation or its derivatives

as FLANN KNN search or BBF search. These give

good approximation of, theoretically ideal, results

with substantially lower computational cost. Some

of the applications make use of a LSH hashing but

this approach does not provide practical distance data

needed for further analysis. Matching local features

provides set of correspondences, that can be ﬁltered

later. Under the assumption, that the scene contains

one or no instance of the object, correspondence can

be put into the object-related or noise-related class. To

distinguish which class a particular correspondence

belongs to, few methods have been developed. Fre-

quently used method for such purpose is RANSAC,

giving very good results, even for a high level of

noise-related correspondences. After the classiﬁca-

tion of correspondences the model can be assumed

and a homography can be calculated. Well designed

parameters and ﬁlters can lead to high detection rate

and low false detection chance. RANSAC and quan-

titative analysis of correspondence data become in-

efﬁcient, when a contribution of noise-related corre-

spondences in whole correspondence group grows.

Because of that such approach is not sufﬁcient for

scenes, where the objects occupy small share of the

image. Methods mentioned above will not work well

with multiple objects present in the image as well.

Systems for multi-detection purposes incorporate

divide and conquer approach. Each correspondence

252

Kurzejamski G., Zawistowski J. and Sarwas G..

Robust Method of Vote Aggregation and Proposition Veriﬁcation for Invariant Local Features.

DOI: 10.5220/0005267002520259

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 252-259

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

can be assigned to one of N+1 classes, where N is

the number of objects in the scene. There is no

known, straightforward method of assigning corre-

spondences. Most applications use correspondences

as votes in a multidimensional space. Vote space

can be clustered with common clustering algorithms.

Each cluster can be processed with a single-object de-

tection algorithm. Clustering approaches can be di-

vided into two groups. The ﬁrst one consists of sparse

clustering, where each cluster should contain all the

needed vote data of a speciﬁc class. Such methods

show low detection rate because of a far from ideal

clustering process and a high level of parametrization.

Second group is dense clustering, where clusters may

contain only small portion of a particular correspon-

dence class. Its analysis leads to creating or support-

ing hypothesis of the object’s occurrence in the scene.

The ﬂagships in that matter are Hough-like methods.

Such approaches show high detection rate but high

false positive rate as well. There are some works

that try to segment the scene with known context, as

shown in the work of Iwanowski et al. (Iwanowski

et al., 2014).

In our tests both clustering approaches lacked the

ability to attain a very high detection rate with a very

low false positive rate at the same time. For our test

cases a processing power is not a limitation and the

images are of a very high quality. Our test data con-

tains from zero to up to 100 objects per image and

presents different environmental conditions. Most of

the stat-of-the-art publications do not test detection

capabilities for such complex tasks. We found that

current approaches cannot maintain good detection

rate to false positive rate ratio on satisfactory level in

many real life applications.

This paper presents the method of vote space anal-

ysis, a part of invention shown in (Kurzejamski et al.,

2014). The method can be adjusted to a vast variety of

object detection purposes, where the effectiveness and

a low false positive rate is crucial. The method has

been developed to work well with huge amount of fea-

ture data, extracted from high quality images. Most of

the algorithms used in the new approach come with

a logical justiﬁcation. The new method uses the it-

erative vote aggregation, starting from proposition’s

positions. Propositions are generated from a graph-

ical vote space analysis. Aggregated data undergoes

analysis and ﬁltration. Whole process has a two-pass

model, that makes the method robust to some speciﬁc

object positioning in the scene. Cascade of a specially

selected set of ﬁlter algorithms has been utilized to re-

ject most of the false positive detections.

2 RELATED WORK

Local features in the image can be tracked a long way

in the literature. We present state-of-the-art feature

points extracting and describing methods, that can be

used in our method, and similar frameworks for multi-

object detection purposes developed through the last

years.

2.1 Feature Points

Our method should be used with conjunction with

scale-invariant and rotation-invariant features for the

best results. Usage of local features lacking any of

this characteristics may come with a need for rejec-

tion of some parts of our method, but can be imple-

mented nevertheless.

The best known feature points, up to this point, are

SIFT points developed by Lowe (Lowe, 1999), which

became a model for various local features bench-

marks. The closest alternative to SIFT is SURF (Bay

et al., 2008), that comes with a lower dimension-

ality and, in the result, a higher computing efﬁ-

ciency. There are also known attempts to incorpo-

rate additional enhancements into SIFT and SURF

as PCA-SIFT (Ke and Sukthankar, 2004) or Afﬁne-

SIFT (Morel and Yu, 2009). SIFT and SURF and

its derivatives are computationally demanding dur-

ing matching process. In last years there has been

big development in feature points based on binary

test pairs, that can be matched and described in a

very fast manner. The ﬂagships of this approach

are BRIEF (Calonder et al., 2010), ORB (Rublee

et al., 2011), BRISK (Leutenegger et al., 2011) and

FREAK (Alahi et al., 2012) features. Most of the

cited algorithms can be used to create dense and

highly discriminative voting space, which holds sub-

stantial object correspondence data needed to accom-

plish many of the real-world detection tasks.

2.2 Frameworks

There are few approaches to conduct multi-object

multi-detection, meaning detecting multiple differ-

ent objects on the scene, where any object can be

visible in multiple places. Viola and Jones (Viola

and Jones, 2001) developed cascade of boosted fea-

tures, that can efﬁciently detect multiple instances of

the same object in a one pass of the detection pro-

cess. The method needs a time consuming, learn-

ing process with thousands of images. Method has

been mostly tested on general objects, as people,

cars, faces. Most straightforward method for multi-

detection is using all of the sliding windows as used,

RobustMethodofVoteAggregationandPropositionVerificationforInvariantLocalFeatures

253

for example, in Sarwas’ and Skoneczny’s work (Sar-

was and Skoneczny, 2015). Most of them are un-

fortunately computationally expensive. High effec-

tiveness can be achieved with Histogram of Oriented

Gradients (Dalal and Triggs, 2005) and Deformable

Part Models (Felzenszwalb et al., 2010). The biggest

drawbacks for our application is that Deformable Part

Models needs learning stage and Histogram of Ori-

ented Gradients is not rotation invariant. Blaschko

and Lampert in (Blaschko and Lampert, 2008) uses

SVM to enhance the sliding window process. Ef-

ﬁcient subwindows search has been used in (Lam-

pert et al., 2008). In addition, branch-and-bound

approaches, as in (Yeh et al., 2009), are promising

for multi-detection purposes with conjunction with

Bag-of-words descriptors. Lowe (Lowe, 2004) pro-

posed generalized Hough Transform for clustering

vote space with SIFT correspondence data. Authors

of (Azad et al., 2009) created a 4D voting space

and used combination of Hough, RANSAC and Least

Squares Homography Estimation in order to detect

and accept potential object instances. Zickler in et

al. (Zickler and Efros, 2007) used angle differences

criterion in addition to RANSAC mechanisms and

vote number threshold. Zickler et al. in (Zickler and

Veloso, 2006) used custom probabilistic model in ad-

dition to Hough algorithm.

In our system’s application we could use only one

generic pattern image per object so we rejected most

of the learning-based global descriptors.

3 ALGORITHM

The algorithm presented by authors is built upon two

mechanisms: the vote spaces creation and a vote ag-

gregation for each of the vote spaces created. The

vote space is created for each pattern. Its adjacency

data is projected onto the (X, Y) plane, creating vote

images (one for each vote space). The vote images

are analysed in search for object’s position proposi-

tions. This mechanism is shown in the Algorithm 1.

The aggregation process is performed for each vote

space and for its each proposition, starting from the

proposition with the highest adjacency value. Aggre-

gation consists of two passes with slightly different

vote gathering approaches. The ﬁrst pass is needed

to estimate the detected object’s area in the scene, so

the second aggregation pass would gather only votes

considered to be from that particular object’s instance.

The structure of each pass is presented in Algorithm

Algorithm 1: Vote Data, Vote Image and proposi-

tions creation.

Data: Original Patterns (OPT), Scene Image

(SCN)

Result: Vote Data and propositions for

object’s centres for each pattern.

1 Feature points extraction on OPT and SCN;

2 foreach pattern in OPT do

3 Find correspondences (COR) between

pattern and SCN feature points;

4 foreach correspondence in COR do

5 Reject if has low distance value;

6 Reject if has high hue difference value;

7 Calculate adjacency value;

8 end

9 Creation of vote space (VS) from COR;

10 Creation of vote image (VI) from VS;

11 Search for propositions (PR) in VI;

12 Sort PR list;

13 end

Algorithm 2: Vote aggregation and detection accep-

tance.

Data: VI, VS, PR

Result: Occurrences (OCR) in the SCN for a

particular pattern

1 foreach Proposition in PR do

2 Gather all votes in local area from VS;

3 Unique ﬁltering for gathered votes (V);

4 Cascade ﬁltering for V;

5 if not rejected by cascade ﬁltering then

6 Estimate object’s area;

7 Gather all votes with a Flood Fill

algorithm;

8 Unique ﬁltering for new V;

9 Cascade ﬁltering for new V;

10 if not rejected by cascade ﬁltering then

11 Calculate object’s area;

12 Create occurrence entry in OCR;

13 Erase all vote data in occurrence’s

area in VS and VI;

14 else

15 reject proposition

16 end

17 else

18 reject proposition

19 end

20 end

3.1 Vote Image Creation

First part of our method is the vote space and vote

image creation (lines 9 and 10 of Algorithm 1). Vote

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

254

space consists of multiple dimensions: X, Y, Scale,

Rotation and Distance. Each vote contains speciﬁc

X and Y position of the center of the object. The

Distance may be the result of using speciﬁc metric

for particular feature points. For SIFT the standard

procedure is to use L2 distance for its feature vector,

which contains gradient data in the area around the

characteristic point. One may use some additional in-

formation in distance calculation, as color difference.

Someone can use ranking method as the LSH hashing

instead of the L2 metric as well.

Vote image is the projection of the adjacency data

available in vote space onto X and Y dimensions.

Vote image has one intensity channel, created by

normalizing adjacency sum cue. Another approach

would be to use the distance value instead of adja-

cency as the main cue. We found L2 metric, as well

as many other distance-based approaches, as insufﬁ-

cient.

Votes in the vote spaces are built upon ﬁltered cor-

respondence sets. The distance threshold used in line

5 of Algorithm 1 was calculated as:

thr =

MIN(V ) + MAX(V )

, (1)

where V is the votes group and the MIN and MAX

operators return the value of a vote with minimal and

maximal distance value from the group. The rejection

function D is presented in equation 2.

D(v) =



accept, dist(v) ≤ thr

re ject, dist(v) > thr

. (2)

We transform distance value into normalized adja-

cency value in range from 0 to 1 (line 7 of Alg. 1). 1

indicates perfect match. 0 indicates near to rejection

difference between feature points. We transformed

distance values into adjacency (adj) values with a spe-

ciﬁc function:

adj(v) = 1 −



dist(v)

thr



. (3)

Adjacency values are gathered in a single chan-

nel, gray vote image. Vote image can be optionally

normalized for visualization purposes. Such normal-

ized image has been shown in Figure 1. If the feature

extraction and matching process are highly discrim-

inative, the object instances in the scene should be

recognizable by a human. Manual veriﬁcation of vote

image gives some level of valuable insight into votes

intersperse in the image and a level of a false votes

groupings recognizable by a human.

Last step of Algorithm 1 is the search for propo-

sitions in a vote image. Proposition is a point in

the vote image and corresponding part of vote space,

where the potential object’s center is located. We used

Good Features To Track by Tomasi and Shi (Shi and

Tomasi, 1994) to detect multiple local maximas in the

vote image and used them as the propositions. The

number of propositions should be much higher than

a number of objects in the image. It is trivial to set

the Good Features To Track to ﬁnd all the important

points in the image, but it leads to generation of thou-

sands of propositions. Number of propositions will

signiﬁcantly impact the algorithm’s processing time,

so it is not possible to ignore the need for a trade-off

in detector’s parameter adjustment. For each propo-

sition’s X and Y position the adjacency sum for cor-

responding votes in vote space is calculated and used

for sorting purposes in line 12. The highest adjacency

value proposition should be the ﬁrst taken later into

a vote aggregation process. As the adjacency sum is

proportional to the channel value in the vote image,

the cue for sorting stage is easy to compute. Sort-

ing the propositions ensures, that the strongest vote

grouping will be processed ﬁrst. In case of the posi-

tive object recognition, the vote data corresponding to

object’s detection area will be erased from vote space

and image.

3.2 Vote Aggregation

Second part of our approach contains a vote aggre-

gation mechanism. Vote aggregation starts from a

proposition’s position, which should be the center of

a local vote grouping in the vote image. Data of the

vote groupings can signiﬁcantly vary for different ob-

ject instances in the scene. The best instances can be

represented by hundreds of votes, when the weakest

positive object response can be connected with only

a few. Generic clustering may ignore such clusters

and merge it with the bigger ones. Generic clustering

algorithms has generic parameters, that are hard to ad-

just with object-oriented logic or even intuition. Some

clustering approaches tend to cluster all the available

vote data, even if the noise (false correspondences)

ﬁlls most of the vote space.

We propose an iterative 2-pass vote aggregation

process for selective clustering purposes. In each

pass the unique ﬁltering and the cascade ﬁltering take

place, which reject false positive detections. Two pass

design prevents situations in which aggregation area

contains multiple objects. Pass one of the aggregation

collects all the votes in local area of proposition’s po-

sition (line 2 of Alg. 2). The size of a local area may

be a function of a corresponding pattern size. After

gathering of all the votes in the local area, the unique

ﬁltering is performed and the resulting group of votes

RobustMethodofVoteAggregationandPropositionVerificationforInvariantLocalFeatures

255

(a) Vote image (b) Blurred and normalized vote image

Figure 1: Sample of vote image generated while localizing red, herbal tea casing.

is tested with a cascade of ﬁlters (lines 3 and 4 of Alg.

2).

In the second pass of the process the aggregation is

conducted with Flood Fill algorithm, starting from the

proposition’s position (line 7 of Alg. 2). The Flood

Fill range is limited to a scaled down object’s area.

Such limitation can be constructed with a scale and

rotation estimation from the ﬁrst pass of the aggre-

gation. The limitation ensures, that the aggregation

process will not collect the votes from neighbouring

object instances. Second pass of the algorithm con-

tains unique ﬁltering and cascade ﬁltering as well, as

the vote collection may be different in this pass.

For each group of votes, a unique ﬁltering should

be performed in each pass (lines 3 and 8 of Alg.

2). Unique ﬁltering preserves only one vote with the

highest adjacency corresponding to the same feature

point in the pattern. We can do so, because we want

the aggregated votes to be connected with only one

object. If multiple votes are connected with one spe-

ciﬁc feature in the pattern we can assume that only the

strongest vote isn’t the noise.

Most of the feature point detectors incorporate

mechanisms of rejecting the points located along the

edges. Unfortunately, this mechanisms work only in a

micro scale. In high resolution some graphical struc-

tures, that for human seem as a straight edge, has

a very complicated, uneven shape for characteristic

points detector. Characteristic points located along

the edges have similar features, so may be matched

with the same feature in the pattern. It leads to gen-

eration of many false propositions, which can some-

times be accepted by a cascade ﬁlters.

Some of the false positive detections in our ex-

periments were initiated as a bunch of feature points

placed along a simple, steep gradients and edges. For

instance, when the scene presented product shelves,

more than a half of false detections contained edge

of the shelf near the center and its vote data present

mostly along the shelf’s edge.

3.3 Cascade Filtering

Cascade ﬁltering (lines 4 and 9 of Alg. 2) is a pro-

cess of validating vote group with a cascade of ﬁl-

ters. Each ﬁlter can accept aggregated votes or reject

them. Any rejection will result in dropping the aggre-

gation process and removing the processed proposi-

tion from the propositions sorted queue. No vote data

is removed from vote space or vote image in that sit-

uation. If all the ﬁlters in ﬁrst pass accepts the vote

group, process may estimate the size and rotation of

the object represented by the majority of votes (line

11 of Alg. 2).

Cascade ﬁlters comprise of: (1) vote count thresh-

olding, (2) adjacency sum thresholding, (3) scale vari-

ance thresholding, (4) rotation variance thresholding,

(5) feature points binary test, (6) global normalised

luminance cross correlation thresholding. First pass

of the vote aggregation uses ﬁlters: (1), (2), (3) and

(4). Second pass of the process uses ﬁlters: (3), (4),

(5) and (6).

Vote count thresholding is a simple ﬁlter, thresh-

olding number of votes in the aggregated group.

Lowe in his work (Lowe, 2004) proposed generalised

Hough transform for object detection. In this method

he has assumed that only three votes are enough to

identify the object. Unfortunately, such assumption

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

256

leads to many false positive detections. Three local

features are not enough to describe complex, generic

graphics. We tested vote count thresholding for val-

ues from 3 up to 20. We found 6 as the optimal value

for ﬁltering out too weak responses. If the vote group-

ing represents real object instance and has less than 6

votes, it means that the prior algorithm processes has

too low effectiveness.

Adjacency sum thresholding rejects all the groups

of votes with sum of adjacency values less than a

threshold value. This ﬁlter in certain circumstances

can be used instead of the vote count thresholding.

Nevertheless the rejection data from these two ﬁlters

may give an insight into vote certainty levels of the

detection. Even huge vote groupings with more than

100 votes may have a very low adjacency sum value.

Scale variance thresholding rejects all the groups

of votes with a scale value variance higher than the

threshold value. One may rebuild this ﬁlter into mech-

anism separating noise signal from positive detection

signal with a Gaussian model. For our purposes such

method is computationally too expensive. Simple

variance thresholding rejects many false detections

and is easy to compute.

Rotation variance thresholding rejects all the

groups of votes with rotation value’s variance higher

than the threshold value. Rotation variance threshold-

ing works analogically to scale variance thresholding

but using the rotation values. A rotation variance is

not straightforward to compute. We set twelve buck-

ets for rotation values and choose the three buckets

with the highest count number. Its resultant was taken

as an average rotation. All the values has been rotated

so the average rotation was assigned to 180 degrees.

Then the variation in regards to 180 degrees has been

computed and used for thresholding.

Feature binary test uses feature points correspon-

dence data preserved in each vote. We created mul-

tiple luminance binary tests for random feature pairs

in the scene, which are represented by votes in aggre-

gated votes group. We created identical tests for cor-

responding feature points on the pattern side. Each

set of binary tests provided a binary string that can be

compared with a hamming distance. The normalised

distance can be thresholded.

Normalised luminance cross correlation is used as

a last ﬁlter. It needs the exact object’s graphics patch

extracted from the scene. It’s computationally expen-

sive, but can ﬁlter out many false positive detections,

that cannot be ﬁltered by previous ﬁlters. The im-

ages are resized to the size of 50x50 pixels before the

calculation of the cross correlation. The ﬁltering is

conducted only in the second pass of the aggregation

process, where the theoretical object’s frame can be

calculated from the data from the ﬁrst pass.

4 EXPERIMENTS

Our testing platform, incorporating method described

in this paper, has been developed to search product lo-

gos and casings on scenes presenting market shelves

and displays. The database used for the test for this

paper consists of 120 shelf photos taken in 12MPx

resolution and scaled down to 3MPx for testing pur-

poses. The pattern group consist of 60 generic pat-

terns of logos and product wrappings. Each shelf

photo was tested with each one of the patterns, giving

7200 detection processes. The photos contained usu-

ally three classes of products so most of the patterns

could generate only false positive detections. Aver-

age number of products presented in the scenes was

emph23.6. Patterns were scaled to have the bigger

size between 512 and 256 pixels. In the application

of product search on the market shelves we describe

high quality images as photos bigger than 2MPx, with

a minimum of ten thousands pixels for the smallest

searched object and all of the logos text readable for

a human. Our aggregation approach bases its effec-

tiveness upon chosen local features. We used SIFT

implementation for main experiments. Main advan-

tage of our method lays in ﬁltering out false detec-

tions and processing all possible occurrences. SIFT is

a state-of-the-art features detector and descriptor. Our

test showed that 100% of the actual object instances

were processed through our cascade ﬁltering with a

proper proposition’s location. That’s thanks to dense

proposition detections and a straightforward vote im-

age creation.

Detection effectiveness lays in proper vote group

ﬁltering. Amount of positive detections rejected dur-

ing cascade ﬁltering results from all the computer

vision algorithms incorporated into detection system

and can be hardly used to measure aggregation ef-

fectiveness without proper comparisons with similar

methods in the same application ﬁeld. False detection

rate yields more analytical data. We found no false

positive detections during our tests, that were fault

of not sufﬁcient description capability of feature de-

scriptor. All of false detections were the result of too

loose parameters, that were needed for very high pos-

itive detection rate. Nevertheless we came across 203

false detections in 129 of 7200 detection processes,

resulting in more than 1% (Table 1) false detection

chance per detection process. This result seems low,

but at the same time means emph66.2% chance, that

the false detection will take place when looking for

any product instance from our patterns database.

RobustMethodofVoteAggregationandPropositionVerificationforInvariantLocalFeatures

257

(a) Pattern

(b) Scene

Figure 2: Sample of detection results.

Our method has been compared to the method us-

ing the HOG descriptor. For the training stage we

generated set of 60 derivative images for each pat-

tern through small afﬁne transformations. We used

all other patterns as a negative images. We used im-

plementation of HOG method, called Classiﬁer Tool

For OpenCV and FANN (HOG, 2014). Our method

achieved only slightly better detection rate, but sig-

niﬁcantly lower chance for false detections. The aver-

age number of false detections were almost two times

higher for the HOG approach (Table 2).

Table 1: Detection rate and false detection chance for our

tests.

Method Detection Rate False Detection Chance

Ours 81.3% 1.79%

HOG 73.6% 21.42%

Table 2: Average number of false detections for process,

where the false detection occurred.

Method Average Number of False Detections

Ours 1.57

HOG 3.18

During experiments with product casings we en-

countered number of problems with association of

detections to a speciﬁc result group. Some prod-

ucts are very similar, with only slight local graphi-

cal differences. This is particularly true for the same

brand with different aromas or casing sizes. Figure 2

presents one of such cases, where tea casing has iden-

tical logo for its few variations with one being visu-

ally very different from the others. We decided to in-

terpret only the visually off tea as a false detection.

In retail ﬁeld the rest of the detections should be pro-

cessed further to discriminate different variations of

the products. One can use partial patterns with a bag-

of-words approach on top of our aggregation method

to do so.

5 CONCLUSIONS

In this paper the method of vote aggregation designed

for use in multi-object multi-detection systems has

been introduced. Aggregation process yields promis-

ing results in tests, leading to analysis of each poten-

tial object in the image. The unique ﬁltering leaves

out many false object occurrence propositions and the

cascade ﬁltering rejects most of the false positive de-

tections, that is crucial for presented application. Sys-

tem built upon the aggregation method can achieve

more than 80% detection rate with the false detection

chance below 2%. It is still far from industrial stan-

dards, but there are many places for improvement as

well.

Presented method is designed to analyze very high

quality images. Images processed in tests were taken

by a hand, resulting in high amount of blurred and

skewed visual data. The method of image acquisition

should be analyzed further. In future work we will

incorporate estimates of the best parameters for pre-

sented method as well as solve simple parametriza-

tion dependencies. We are going to test the system

with a two-phase approach, where second phase of

the detection would use the patterns extracted directly

from the scene. The pattern size has too much impact

on the detection rate, as the feature points approach

works the best, when the objects in the scene and in

pattern images have the same size. We are going to

evaluate resizing options for better detection results.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

258

ACKNOWLEDGEMENTS

This work was co-ﬁnanced by the European Union

within the European Regional Development Fund.

REFERENCES

(2014). Classiﬁer tool for opencv and fann v. 4.11.8.

http://classiﬁeropencv.codeplex.com/. Accessed:

2014-12-18.

Alahi, A., Ortiz, R., and Vandergheynst, P. (2012). FREAK:

Fast retina keypoint. In Computer Vision and Pat-

tern Recognition (CVPR), 2012 IEEE Conference on,

pages 510–517.

Azad, P., Asfour, T., and Dillmann, R. (2009). Combin-

ing Harris interest points and the SIFT descriptor for

fast scale-invariant object recognition. In Intelligent

Robots and Systems, 2009. IROS 2009. IEEE/RSJ In-

ternational Conference on, pages 4275–4280.

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008).

Speeded-up robust features (SURF). Computer Vision

and Image Understanding, 110(3):346 – 359. Simi-

larity Matching in Computer Vision and Multimedia.

Blaschko, M. and Lampert, C. (2008). Learning to localize

objects with structured output regression. In Forsyth,

D., Torr, P., and Zisserman, A., editors, Computer Vi-

sion ECCV 2008, volume 5302 of Lecture Notes in

Computer Science, pages 2–15. Springer Berlin Hei-

delberg.

Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010).

BRIEF: Binary robust independent elementary fea-

tures. In Daniilidis, K., Maragos, P., and Paragios,

N., editors, Computer Vision ECCV 2010, volume

6314 of Lecture Notes in Computer Science, pages

778–792. Springer Berlin Heidelberg.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893 vol. 1.

Felzenszwalb, P., Girshick, R., McAllester, D., and Ra-

manan, D. (2010). Object detection with discrim-

inatively trained part-based models. Pattern Analy-

sis and Machine Intelligence, IEEE Transactions on,

32(9):1627–1645.

Iwanowski, M., Zieli

nski, B., Sarwas, G., and Stygar, S.

(2014). Identiﬁcation of products on shop-racks by

morphological preprocessing and feature-based detec-

tion. In Chmielewski, L., Kozera, R., Shin, B.-S.,

and Wojciechowski, K., editors, Computer Vision and

Graphics, volume 8671 of Lecture Notes in Computer

Science, pages 286–293. Springer International Pub-

lishing.

Ke, Y. and Sukthankar, R. (2004). PCA-SIFT: a more

distinctive representation for local image descriptors.

In Computer Vision and Pattern Recognition, 2004.

CVPR 2004. Proceedings of the 2004 IEEE Computer

Society Conference on, volume 2, pages II–506–II–

513 Vol.2.

Kurzejamski, G., Zawistowski, J., and Sarwas, G. (2014).

Apparatus and method for multi-object detection in a

digital image. EU Patent 14461566.3.

Lampert, C. H., Blaschko, M., and Hofmann, T. (2008). Be-

yond sliding windows: Object localization by efﬁcient

subwindow search. In Computer Vision and Pattern

Recognition, 2008. CVPR 2008. IEEE Conference on,

pages 1–8.

Leutenegger, S., Chli, M., and Siegwart, R. (2011). BRISK:

Binary robust invariant scalable keypoints. In Com-

puter Vision (ICCV), 2011 IEEE International Con-

ference on, pages 2548–2555.

Lowe, D. (1999). Object recognition from local scale-

invariant features. In Computer Vision, 1999. The Pro-

ceedings of the Seventh IEEE International Confer-

ence on, volume 2, pages 1150–1157 vol.2.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. volume 60, pages 91–110.

Kluwer Academic Publishers.

Morel, J.-M. and Yu, G. (2009). ASIFT: A new framework

for fully afﬁne invariant image comparison. SIAM J.

Img. Sci., 2(2):438–469.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.

(2011). ORB: An efﬁcient alternative to SIFT or

SURF. In Computer Vision (ICCV), 2011 IEEE In-

ternational Conference on, pages 2564–2571.

Sarwas, G. and Skoneczny, S. (2015). Object localization

and detection using variance ﬁlter. In Chora

s, R. S.,

editor, Image Processing & Communications Chal-

lenges 6, volume 313 of Advances in Intelligent Sys-

tems and Computing, pages 195–202. Springer Inter-

national Publishing.

Shi, J. and Tomasi, C. (1994). Good features to track. In

Computer Vision and Pattern Recognition, 1994. Pro-

ceedings CVPR ’94., 1994 IEEE Computer Society

Conference on, pages 593–600.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Computer Vi-

sion and Pattern Recognition, 2001. CVPR 2001. Pro-

ceedings of the 2001 IEEE Computer Society Confer-

ence on, volume 1, pages 511–518.

Yeh, T., Lee, J., and Darrell, T. (2009). Fast concurrent ob-

ject localization and recognition. In Computer Vision

and Pattern Recognition, 2009. CVPR 2009. IEEE

Conference on, pages 280–287.

Zickler, S. and Efros, A. (2007). Detection of multiple de-

formable objects using PCA-SIFT. In Proceedings

of the 22Nd National Conference on Artiﬁcial Intelli-

gence - Volume 2, AAAI’07, pages 1127–1132. AAAI

Press.

Zickler, S. and Veloso, M. M. (2006). Detection and local-

ization of multiple objects. In Humanoid Robots, 2006

6th IEEE-RAS International Conference on, pages

20–25.

RobustMethodofVoteAggregationandPropositionVerificationforInvariantLocalFeatures

259