A FRAMEWORK TO IMPROVE MATCHING RESULTS OF

WIDELY SEPARATED VIEWS

Cosmin Ancuti, Codruta Orniana Ancuti and Philippe Bekaert

Hasselt University - tUL -IBBT, Expertise Center for Digital Media, Belgium

Keywords:

Local feature points, Matching, SIFT, Color, Wide-baseline.

Abstract:

Matching images is a crucial step in many computer vision applications. In this paper we present an alternative

strategy built on the SIFT operator to solve the problem of wide-baseline matching. We ﬁrst show how to add

the color information to the SIFT descriptors of extracted keypoints. Practically, the SIFT descriptor vector

is blended with the main parameters (contrast, correlation and energy) of the color co-occurrence histogram

computed in the same image patch. Afterward, in order to better improve the matching results of images taken

under large variations of the camera viewpoint angle, the valid matches obtained by the previous strategy are

employed to estimate the geometry between patches of corresponding keypoints. This overcomes the lack of

afﬁne invariance of the existing operators (including SIFT), allowing to use a more appropriate region shape

where descriptors will be calculated for better preciseness. In our experiments the proposed method shows

a substantial improvement of the matching results compared with the results obtained by the original local

operator.

1 INTRODUCTION

Matching images that represent projections of the

same 3D scene/object is a fundamental task in com-

puter vision. Several important applications such as

3D reconstruction, camera calibration, panoramic im-

ages, texture and object classifying, image retrieval,

robot localization rely on the accuracy of this task.

The problem of image matching is in general solved

base on local feature points. The feature points (key-

points, interest points) are those locations where the

image has signiﬁcant variation in at least two direc-

tions. First, a certain number of local feature points

are extracted independently in both images. For efﬁ-

ciency, these locations are ﬁltered by employing an

invariant detector in order to extract feature points

with a high repeatability ratio. Secondly, the ex-

tracted feature points are described as distinctive as

possible based on the information contained by their

neighbor regions. Finally, the corresponding feature

points are found by computing different distance cri-

teria (e.g. Euclidean, Mahalanobis) between descrip-

tors vectors.

There has been much research in image match-

ing in the last decades. Apparently, the most known

detector was introduced by Harris (C.G.Harris and

Stephens, 1988). This basic detector is invariant

only to rotation and translation and fails for more

complex geometric transformations like modiﬁcation

of the scale or truly afﬁne. Lindeberg solved the

scale invariance of the detectors introducing the auto-

matic scale selection principle (Lindeberg, 1999). He

searched for 3D maxima in the Laplacian of Gaus-

sian (LoG) scale space. Lowe (Lowe, 2004) approx-

imates the LoG scale space through Difference of

Gaussian (DoG). Recently (Mikolajczyk and Schmid,

2004b) the basic Harris and Hessian detectors have

been adapted to scale and afﬁne spaces.

In order to ﬁnd correct matches of images of the

same scene the feature points have to be described

as distinctive as possible. The simplest method is

to compute cross correlation between vectors of pix-

els from certain regions. Unfortunately this approach

fails in real cases when the similar patches of the

images are related by complex geometric transfor-

mations. Different approaches (Freeman and Adel-

son, 1991; Belongie et al., 2002; Mikolajczyk and

Schmid, 2004a; Lowe, 2004) of descriptors have been

introduced in the literature. SIFT (Lowe, 2004) com-

putes a histogram of gradient locations and orienta-

tions and has been shown to outperform the other de-

scriptors (Mikolajczyk and Schmid, 2004a). More re-

475

Ancuti C., Orniana Ancuti C. and Bekaert P. (2010).

A FRAMEWORK TO IMPROVE MATCHING RESULTS OF WIDELY SEPARATED VIEWS.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 475-479

DOI: 10.5220/0002843004750479

 SciTePress

cently, SURF (Bay et al., 2006) uses Hessian matrix

and Haar wavelets response combined with the prop-

erties of integral images in order to speed up the pro-

cessing time.

The most challenging problem is to match im-

ages of the same scene taken under signiﬁcant vari-

ation of the camera viewpoint position. The region-

based methods (Tuytelaars and Gool, 2000; Matas

et al., 2002; Tuytelaars and Gool, 2004; Forss´en and

Lowe, 2007) identiﬁes salient corresponding image

regions. In general the local feature-point based ap-

proaches (Pritchett and Zisserman, 1998; Baumberg,

2000; Xiao and Shah, 2003) are more robust to oc-

clusions and cluttering, ﬁnding a higher number of

correct matches but also being characterized by less

processing time than region-based methods.

Recent studies (Mikolajczyk and Schmid, 2004a;

Moreels and Perona, 2007) disclosed that none of the

existing operators is fully invariant to afﬁne changes.

Moreles and Perona (Moreels and Perona, 2007) ob-

served that when the difference between camera view-

point angles is signiﬁcant (higher than 25-30

◦

) the

state of the art detectors/descriptors fail. This is due

to the clasped region shape where the descriptors are

computed.

This paper presents an alternative strategy built

on the local feature points that aims to improve the

matching results for extreme cases where the differ-

ence between camera viewpoint angles varies signif-

icantly. Our approach is built on the well-known

SIFT operator. Since SIFT has been designed only

for grayscale images and neglects the color, we ﬁrst

show how to add effectively the color information in

order to increase the descriptor distinctness. Addi-

tionally, to increase the afﬁne invariance of the local

feature points, after extracting several valid matches,

an approximate geometry is computed between their

neighbor patches. The estimated geometric parame-

ters are used in order to deﬁne a new shape of the

region where descriptors vectors will be computed.

The method has the advantage that does not em-

ploy expensivereﬁnements (e.g Ransac, epipolar con-

straints). Moreover, the method is general being pos-

sible to be used in combination with other operators,

too. The comparative results demonstrates the utility

of our method that is able to ﬁnd a considerable addi-

tional number of correct matches.

The reminder of the paper is organized as follow-

ing. In the next section is presented the strategy to add

effectively the color information to the SIFT descrip-

tors. Then, we show how the geometry that relates

the corresponding patches of valid matches can be es-

timated. Finally, before concluding, we presents and

discuss several comparative results.

Figure 1: Adding color to SIFT descriptor. In the top part

of the ﬁgure are shown the results obtained by SIFT (13

valid matches) while by adding the color (bottom part of

the ﬁgure) we obtained 19 valid matches.

2 INCREASE DISTINCTNESS OF

DESCRIPTORS BY COLOR

The recent studies (Mikolajczyk et al., 2005; Moreels

and Perona, 2007) proved that the most effective lo-

cal operator to match images is SIFT (Scale Invari-

ant Feature Transform) (Lowe, 2004). Several at-

tempts (Ke and Sukthankar, 2004; Abdel-Hakim and

Farag, 2006) tried to improve some parts of the orig-

inal implementation but the original version still re-

mains the most reliable. In the following we give a

brief presentation of how this operator is computed.

The feature points are scale invariant and are searched

in the DoG (Difference of Gaussian) scale space. The

DoG is built by subtracting images that previously

have been convolved (blurred) with a Gaussian func-

tion with a standard deviation that increases mono-

tonically. A keypoint is extracted only if its value is

greater or smaller than all its 26 neighbors. Addition-

ally, the keypoints with strong response to edges and

low contrast are rejected.

The signature (descriptor) computation is based

on the image gradient magnitudes and orientations

calculated in the circular neighbor regions of the fea-

ture points. The image pyramid level is determined

by the computed characteristic scale of the respective

feature point. For every feature point a 4x4 orien-

tation histogram is constructed on a 4x4 sub-region

of the feature point computed from a 16x16 centered

region. Each histogram has 8 bins corresponding to

every 45

◦

However, the SIFT was designed only for

grayscale images neglecting the important informa-

tion of color. Therefore, in this section we describe a

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

476

strategy to increase the distinctness of the local fea-

ture points descriptors by color. Our approach is

built on the color co-occurrence histogram (Chang

and Krumm, 1999), an extension in the color space

of the well known co-occurrence matrix (Haralick

et al., 1973) that estimates the spatial gray level de-

pendencies of the pixels. Given a color pixel p

the image, the color co-occurrence histogram (CCH)

counts the number of occurrences of the color pixel

pair (p

, p

), with p

representing an adjacent color

pixel located at the distance d = (∆x, ∆y). The color

co-occurrence histogram can be seen as a function

of the color pixel values and the displacement vec-

tor between them. For a given image patch P of

the size N×N the CCH value is counting the num-

ber of times when the pixel pair (p

) matches the

the color combination (c

, c

CCH(x, y, c

, c

) =

∑

x=1

∑

x=1

(x, y)

∑

Ω

(x+ ∆x, y+∆y)

(1)

where Ω represents the number of pixels located at

the distance (∆x, ∆y) while the function Ψ is given by

the following expression:

(x, y) =











1 , c(x, y) = c

0 , otherwise

(2)

where c(x, y) is the color level of the pixel located at

(x, y) and c

is a color level. We compute the CCH in

the same neighbor patches of the extracted keypoints.

In our experiments the number of color levels is re-

duced by a standard k-mean quantization to a value

of n

= 256 while the CCH is computed only for an

offset distance d = (1, 1).

After the color co-occurrence histogram matrix is

normalized its elements are referred as P

i, j

. We com-

pute its main parameters: contrastC =

∑

i, j

(i− j)

correlation Cor =

∑

i, j

(1− µ

)(1− µ

i, j

and en-

ergy E =

∑

i, j

in order to built the new descrip-

tor vector that blends the SIFT descriptor and the

three parameters of the CCH computed in the same

surrounding region of the ﬁltered keypoints. In our

experiments we consider that the original SIFT de-

scriptor has a weight impact of 0.7 while each of the

CCH parameters contributein the ﬁnal descriptor vec-

tor with a weight factor of 0.1. To ﬁnd the valid

matches we use the same strategy as Lowe (Lowe,

2004) by evaluating the distance computed between

the ﬁrst best match and the second best match.

3 MORE EFFECTIVE

MATCHING

As observed in our experiments and shown in ﬁgure 1

by adding the color information to the SIFT descrip-

tors the matching results are only partially improved.

In order to ﬁnd a more important additional number of

valid matches we adopt the following method. Sup-

posing that several correct corresponding points have

been ﬁltered by the previously presented strategy, in

order to ﬁnd the optimal shape region where descrip-

tors are computed we estimate the geometry that re-

lates the patches of two corresponding feature points.

In the worst scenario when no valid corresponding

points have been found two corresponding locations

in the input images are selected manually.

Limiting the geometry only up to afﬁne the rela-

tion between the surrounding regions P

, P

of two

corresponding keypoints is expressed as following:

γP

(A x+ d)+ η = P

(x) (3)

where A is a 2D afﬁne matrix, d is the translation vec-

tor, γ is the reﬂection angle of the light source while

η represents the camera gain. Since in our experi-

ments we consider only small photometric variations

between images the last two parameters ( γ, η) are ap-

proximated to the unit value.

Finding the optimal geometric transformation of

the corresponding patches can be seen as minimiza-

tion problem. In order to ﬁnd the optimal parameters

the following energy function is minimized:

E = kP

A (α, s, h

, h2) − P

(4)

where the parameters of the afﬁne transformation are

represented by the rotation angle α, scale ratio s,

shearing h

and stretching h

. The process of con-

vergence is very sensitive and for decent output re-

sults the initial values of the afﬁne matrix parameters

need to be as close as possible from their real val-

ues. The camera rotation is contained by the matrix

R (α), the isotropic scale S is speciﬁed by the pa-

rameter s and the shearing and stretching matrix Γ is

expressed by an expansion factor in a considered di-

rection and a contraction factor on a perpendicular di-

rection. In our approach the initial value, from where

the minimization process starts, is approximated by

= R (α

)S (s

The initial value of the scale s

is determined

based on the automatic scale selection principle (Lin-

deberg, 1999). Lindeberg postulates that in the ab-

sence of additional evidence, the selected scale (char-

acteristic scale) is the scale where a function of some

combinations of normalized derivatives attains a lo-

cal maximum. Due to the fact that we extract fea-

A FRAMEWORK TO IMPROVE MATCHING RESULTS OF WIDELY SEPARATED VIEWS

477

Figure 2: Comparative results. From left to right: Grafﬁti - SIFT found only 13 valid matches (yellow circle) while our

approach found 52 correct matches (green crosses), Wall - SIFT found only 17 valid matches (yellow circle) while our

approach found 44 correct matches (green crosses), Wadham College - SIFT found only 20 valid matches (yellow circle)

while our approach found 60 correct matches (green crosses).

ture points using DoG, that is also based on the au-

tomatic scale selection principle, every keypoint has

attached a characteristic scale value. Therefore, the

initial value of the scale between images is computed

as an average of the ratio between characteristic scales

of all the matched keypoints.

The initial value of the rotation angle parameter

, we rely on the distribution of the image gradient

orientation and magnitude. An orientation histogram

is built in the surroundingsof each feature points, with

every bin counting the contribution of the point gra-

dient orientation weighted by its gradient magnitude

and by a gaussian-weighted circular window of the

respective region. The dominant orientation is deter-

mined by the highest peak of the histogram (Lowe,

2004). We estimate the initial state of the rotation pa-

rameter α

by averaging the ratio of the dominant ori-

entation of the correspondent feature points.

Practically, in our approach the estimated param-

eters determine the new shape of the region where the

descriptors will be computed. The same procedure

of computing the descriptor as presented in the previ-

ous section is repeated but employing the new shape

of the neighbor regions around feature points that is

determined based on the estimated geometry.

4 EXPERIMENTAL RESULTS

We tested our method for real images with the related

geometry being precomputed. Our approach is com-

pared with the original SIFT using the well known

INRIA

database but also several other images taken

under large variation of the camera viewpoint angle.

To evaluate our method we assume that we know the

geometry that relates the tested pairs of images. The

evaluation of the results is done using repeatability

criteria for the planar scenes. This criterion intro-

duced by Schmid et al. (Schmid et al., 2000) takes

into account locations as well as detected scales of

the points. The score of repeatability for a pair of im-

ages represents the ratio between the number of point-

to-point matches and the minimum number of points

detected in images.

In the left side of the Figure 2 is presented the

matching results for grafﬁti images. In this case the

difference of the camera viewpoint angle is approx-

imately 50

◦

. Note that our method is able to ﬁnd

52 valid matches while SIFT is able to ﬁnd only 13

correct matches. By only adding the color we ob-

tained 19 valid matches (not shown). In the middle

of the Figure 2 are shown two images of the wall IN-

RIA data set when the difference of the angle is 60

◦

Again our method outperforms SIFT ﬁnding 44 valid

http://www.robots.ox.ac.uk/ vgg/research/afﬁne/

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

478

matches against only 17 found by SIFT.

In the right side of the Figure 2 are presented two

images of the Wadham College 3D model of the Ox-

ford data set. In this case to validate the matching re-

sults the criteria based on the homography is not valid

anymore. Instead, we make use of the fundamental

matrix that characterizes the geometry between the

views. As can be observed again our method outper-

forms SIFT ﬁnding a considerable additional number

of correct matches.

5 SUMMARY AND

CONCLUSIONS

This paper introduces an alternative strategy for wide-

baseline image matching. The method is built on the

widely-used SIFT operator. We ﬁrst show how the

distinctness of the SIFT descriptor vectors can be in-

creased by adding the color information. Then, by

estimating the geometry that relates patches of corre-

sponding feature points found in the previous stage

we are able to deﬁne a new shape of the regions

where descriptors are computed more accurately. Our

framework demonstrates to improve considerable the

matching results compared with the results obtained

by the original SIFT operator. For future work we

would like to take into consideration more impor-

tant photometric variations between images but also

to demonstrates the utility of the method for several

practical computer vision applications.

REFERENCES

Abdel-Hakim, A. E. and Farag, A. A. (2006). CSIFT: A sift

descriptor with color invariant characteristics. IEEE

CVPR.

Baumberg, A. (2000). Reliable feature matching across

widely separated views. IEEE Conf. on Computer Vi-

sion and Pattern Recog., CVPR.

Bay, H., Tuytelaars, T., and Gool, L. V. (2006). SURF:

Speeded up robust features. in Proceedings of Euro-

pean Conference on Computer Vision, pages 404–417.

Belongie, S., Malik, J., and Puzicha, J. (2002). Shape

matching and object recognition using shape contexts.

IEEE Tran. on Patt. Anal. and Mach. Intell.

C.G.Harris and Stephens, M. (1988). A combined corner

and edge detector. in Proceedings of Fourth Alvey Vi-

sion Conference, 18:147–151.

Chang, P. and Krumm, J. (1999). Object recognition with

color cooccurrence histograms. IEEE Conf. on Com-

puter Vision and Pattern Recog., CVPR.

Forss´en, P.-E. and Lowe, D. (2007). Shape descriptors for

maximally stable extremal regions. In IEEE Interna-

tional Conference on Computer Vision.

Freeman, W. and Adelson, E. (1991). The design and use of

steerable ﬁlters. IEEE Transactions on Pattern Analy-

sis and Machine Intelligence, 13:891–906.

Haralick, R. M., Shanmugam, K., and Dinstein, I. (1973).

Textural features for image classiﬁcation. IEEE

Transactions on Systems, Man and Cybernetics.

Ke, Y. and Sukthankar, R. (2004). PCA-SIFT: A more

distinctive representation for local image descriptors.

IEEE CVPR.

Lindeberg, T. (1999). Feature detection with automatic

scale selection. International Journal of Computer Vi-

sion, 30(2):77–116.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. In International Journal of Com-

puter Vision, volume 20, pages 91–110.

Matas, J., Chum, O., Martin, U., and Pajdla, T. (2002). Ro-

bust wide baseline stereo from maximally stable ex-

tremal regions. In BMVC.

Mikolajczyk, K. and Schmid, C. (2004a). A performance

evaluation of local descriptors. IEEE Conf. on Comp.

Vision and Pattern Recog.

Mikolajczyk, K. and Schmid, C. (2004b). Scale & afﬁne in-

variant interest point detectors. Int. J. Comput. Vision,

60(1).

Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A.,

Matas, J., Schaffalitzky, F., Kadir, T., and Gool, L. V.

(2005). A comparison of afﬁne region detectors. Int.

J. Comp. Vision,.

Moreels, P. and Perona, P. (2007). Evaluation of features

detectors and descriptors based on 3d objects. Int. J.

Comput. Vision, 73(3):263–284.

Pritchett, P. and Zisserman, A. (1998). Wide baseline stereo

matching. In IEEE ICCV.

Schmid, C., Mohr, R., and Bauckhage, C. (2000). Evalua-

tion of interest point detectors. International Journal

of Computer Vision, pages 151–172.

Tuytelaars, T. and Gool, L. V. (2000). Wide baseline stereo

matching based on local, afﬁnely invariant regions. In

Proceedings of British Machine Vision Conference.

Tuytelaars, T. and Gool, L. V. (2004). Matching widely

separated views based on afﬁne invariant regions. Int.

J. Comput. Vision, 59(1).

Xiao, J. and Shah, M. (2003). Two-frame wide baseline

matching. IEEE Int. Conf. on Comp. Vision.

A FRAMEWORK TO IMPROVE MATCHING RESULTS OF WIDELY SEPARATED VIEWS

479