Location Estimation of an Urban Scene using Computer Vision

Techniques

Paul Gordan, Hanniel Boros and Ion Giosan

Faculty of Automation and Computer Science, Computer Science Department,

Technical University of Cluj-Napoca, Romania

Keywords:

Location Estimation, Computer Vision, Image Processing, Image Segmentation, Feature Detection, Feature

Extraction, Image Matching.

Abstract:

The process of adding the geographical identiﬁcation data to an image is called geotagging and is important

for a range of applications starting from tourism to law enforcement agencies. The most convenient way of

adding location metadata to an image is GPS geotagging. This article presents an alternative way of adding

the approximate location metadata to an urban scene image by ﬁnding similar images in a dataset of geotagged

images. The matching is done by extracting the image features and descriptors and matching them. The dataset

consists in geotagged 360

◦

panoramic images. We explored three methods of matching the images, each one

being an iteration of the previous method. The ﬁrst method used only feature detection and matching using

AKAZE and FLANN, the second method performed image segmentation to provide a mask for extracting

features and descriptors only from buildings and the third method preprocessed the dataset to obtain better

accuracy. We managed to improve the accuracy of the system by 25%. Following the in-depth analysis of the

results we will present the results as well as future improvements.

1 INTRODUCTION

Many domains deal with images as primary sources of

information with the purpose of processing and inter-

preting them. Scene understanding is a very intrigu-

ing and important problem nowadays.

The research in these ﬁelds has also seen a big

growth and together with the technological explosion

of the past years, had lead to the development of many

applications that are based on image and video manip-

ulation.

Self-driving cars, face recognition and augmented

reality rely on computer vision ﬁelds such as image

processing and pattern recognition.

The location determination of a visual scene has

a very simple and straightforward methodology. The

process is called GPS geotagging, meaning that the

image (metadata) is associated with the GPS location

of the actor who performed the recording. However,

this is strictly tied to some preconditions: the device

must present GPS functionality which must be en-

abled. It is simple and straightforward however this

simplicity contains its main limitation: if the precon-

dition is not satisﬁed, the location can not be linked to

the image.

Determining the location of an image without

GPS geotagging requires human resources and it is

very time-consuming. This consists of taking the ac-

tual image for which the location needs to be esti-

mated and comparing it with a set of geotagged im-

ages.

Having a system that estimates the possible loca-

tions of an urban scene image based only on image

content analysis would be a very useful and helpful

application for many (e.g. law enforcement, tourism

applications, augmented reality). Human resources

are still needed to extract the most accurate estima-

tion from the results generated. However, this pro-

cess saves a considerable amount of time compared to

the classical approach, because the most complex and

time-consuming part is performed by the computer.

This paper presents a system for location estima-

tion based on computer vision techniques. Section

two contains some theoretical considerations and re-

lated work. In section three the proposed solution

is detailed. Section four describes the benchmarking

method used for the performance evaluation together

with the most relevant results obtained. Section ﬁve

summaries the whole paper drawing the main conclu-

sions.

268

Gordan, P., Boros, H. and Giosan, I.

Location Estimation of an Urban Scene using Computer Vision Techniques.

DOI: 10.5220/0008949102680275

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

268-275

ISBN: 978-989-758-402-2; ISSN: 2184-4321

2 RELATED WORK

The computer vision techniques and algorithms

which were researched and used were the following:

image segmentation, feature extraction and feature

matching. These will be detailed in the ensuing sub-

sections.

2.1 Image Segmentation

Image segmentation is the process of labeling each

pixel from an image by assigning it to a class. It

is used in many ﬁelds and applications, such as au-

tonomous driving (e.g. pedestrian detection), medi-

cal image processing (e.g. recognizing cancer cells),

content-based image retrieval, machine vision and

recognition tasks (e.g. face detection and recogni-

tion).

Some state-of-the-art architectures for image seg-

mentation which were researched were: SegNet, PSP-

Net and ICNet. Eventually, the ICNet architecture

was used, because of its speed and accuracy. When

compared to the other previously mentioned segmen-

tation methods, ICNet proved to have good accuracy

without having a major increase in execution time.

This can be seen in ﬁgure 1. Another reason to use

ICNet was the fact that it is open source (Zhao, 2019)

and the authors provided a set of pre-trained weights

for the network on various datasets. We used the

weights trained on the Cityscapes dataset, which is

similar to our dataset, both presenting very similar ur-

ban scenes, aspect ratios and resolutions.

Figure 1: ICNet architecture compared to other segmenta-

tion methods (Zhao et al., 2018).

ICNet (Image Cascade Network) is a high-

efﬁciency segmentation technique. According to

(Zhao et al., 2018), the idea is to let low-resolution

images go through the full semantic perception net-

work ﬁrst, for a coarse prediction map, then cascade

the output to be used as guidance for medium and

high-resolution features. The main point of the ar-

chitecture is to use a CFF (Cascade Feature Fusion),

in order to take into account the prediction with the

lower resolution. The architecture of ICNet is shown

in ﬁgure 2.

Figure 2: ICNet architecture (Zhao et al., 2018).

2.2 Feature Extraction

Feature extraction is the process of extracting speciﬁc

points from an image and their corresponding descrip-

tors, called features. This process is very important

because by using the features extracted from an im-

age, one should be able to uniquely identify the im-

age (by comparing the features from one image to the

features from another image). The main algorithms

researched for our system were: Scale-invariant fea-

ture transform (SIFT) and Accelerated KAZE.

2.2.1 Scale-invariant Feature Transform

SIFT (Lowe, 1999) is an algorithm in computer vision

used to detect and describe local features in images.

The algorithm transforms image data into linear scale-

invariant coordinates that are relative to local features.

Those features are invariant to image translation, scal-

ing and rotation and are robust to illumination and

viewpoint changes. The algorithm uses blob detec-

tion to generate a large number of features that cover

the image (Lowe, 1999).

SIFT has proven to be very efﬁcient in object

recognition applications. However its main drawback

is that it requires very large computational power,

leading to a high computational cost.

2.2.2 Accelerated KAZE

AKAZE (Alcantarilla and Solutions, 2011) is another

algorithm (based on KAZE) used for feature detec-

tion and description. AKAZE features are invariant

to scale, rotation and limited afﬁne transformations.

KAZE (Alcantarilla et al., 2012) was created with

Location Estimation of an Urban Scene using Computer Vision Techniques

269

the idea to detect and describe 2D features in non-

linear scale-space extrema to obtain a better localiza-

tion accuracy and distinctiveness. The Gaussian blur-

ring used in other object recognition algorithms (e.g.

SIFT), does not respect the natural boundaries of ob-

jects since image details and noise are smoothed to

the same degree at all scale levels. To make blurring

adaptive to image features, KAZE makes use of non-

linear diffusion ﬁltering alongside the AOS (Additive

Operator Splitting) method. With this ﬁltering, the

image noise is reduced but the object boundaries are

kept (Andersson and Marquez, 2016).

Because the process of solving a series of PDEs

(required by the method of non-linear diffusion ﬁlter-

ing) is computationally costly, an accelerated version

of KAZE was created, called Accelerated KAZE or

AKAZE (Alcantarilla and Solutions, 2011). The al-

gorithm works in the same way as KAZE but there are

some differences (Andersson and Marquez, 2016):

• it uses a faster method to create the non-linear

scale-space called the Fast Explicit Diffusion

(FED)

• it uses a binary descriptor (a modiﬁed version of

the Local Difference Binary (LDB) descriptor) to

further increase speed

2.3 Feature Matching

Feature matching is the process of ﬁnding corre-

sponding points in different images. It is very de-

pendent on the process of feature extraction. If the

features extracted are not as particular to the image

as possible, some features may match even if they do

not represent the same segment or part of an image. It

is important to ﬁnd a balance between the number of

features extracted, because the time complexity of the

matching grows with the number of features from the

images.

2.3.1 FLANN

FLANN

is a library for performing fast approximate

nearest neighbor searches in high dimensional spaces,

containing a collection of algorithms which work best

for nearest neighbor search and a system for automat-

ically choosing the best algorithm and optimum pa-

rameters depending on the dataset (Muja and Lowe,

2009). This library was used in our system.

2.3.2 Matching Techniques

As the authors motivate in (Tareen and Saleem, 2018),

the choice of the feature-detector descriptor is a criti-

https://www.cs.ubc.ca/research/ﬂann/

cal decision in feature-matching applications. They

present a comprehensive comparison and analysis

of SIFT, SURF, KAZE, AKAZE, ORB and BRISK,

which are among the fundamental scale, rotation and

afﬁne invariant feature-detectors, each having a desig-

nated feature-descriptor and possessing its advantages

and disadvantages.

The performance of feature detector-descriptors

on matching was evaluated on the following transfor-

mations: scaled versions (5% to 500%), rotated ver-

sions (0 to 360 degrees), viewpoint changes and afﬁne

invariance.

Regarding the accuracy of image matching, SIFT

was found to be the most accurate, overall. AKAZE

and BRISK are the runner-ups.

Authors of (Pusztai and Hajder, 2016) quantita-

tively compared the well-known feature detectors-

descriptors implemented in OpenCV3. Based on their

analysis the most accurate feature extraction algo-

rithm is SURF, which outperforms the other methods

in all test cases. KAZE/AKAZE are the runner-ups,

which are also very accurate.

2.3.3 Lowe’s Ratio Test

Lowe proposed in (Lowe, 1999) to use a distance ratio

test to eliminate false matches.

The author explains that the best candidate match

for each keypoint is found by identifying its nearest

neighbor in the database of keypoints from training

images. It is possible that some features from an im-

age will not have any correct match in the training

database, leading to invalid or incorrect matches. This

could happen when they come from background clut-

ter or were not detected in the training images.

An effective measure is obtained by comparing

the distance of the closest neighbor to that of the

second-closest neighbor (Lowe, 1999). This mea-

sure achieves reliable matching because the correct

matches should have the closest neighbor much closer

than the closest incorrect match.

Eventually, matches in which the distance ratio is

greater than 0.7-0.8 are rejected, leading to an elimi-

nation of 90% of the false matches while discarding

less than 5% of the correct matches for the dataset

presented (Lowe, 1999).

3 PROPOSED SOLUTION

3.1 Overview

We used 360

◦

panorama images from Google Street

View

to create a dataset. The dataset and the user-

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

270

submitted images are semantically segmented. After

the segmentation, the features of the part of the im-

age that contains buildings are extracted and matched

with the features of the panorama images to ﬁnd the

best matches. This system is called Location Estima-

tion System (LES).

Segmentation was necessary to improve the fea-

ture extraction process. It was noticed that many fea-

ture points were extracted in parts of the image con-

taining vegetation (trees and bushes). As these points

are not a local characteristic of an image urban scene,

they represented a false match. We’ve chosen to ex-

tract features only from the parts of the image labeled

as buildings, because the building features are, gener-

ally, particular to one location.

Figure 3a shows the feature points extracted from

the entire image (no segmentation), while ﬁgure 3b

shows the feature points extracted only on parts of the

image labeled as building (as the result of segmenta-

tion).

Figure 3a: Extracted feature points from an image with veg-

etation.

Figure 3b: Extracted feature points from a segmented image

with vegetation.

3.2 System Logic and Functional

Structure

The proposed system has two main, separate parts (or

processes): the dataset process and the query photo

process. They are in a hierarchical order relative to

each other, as the query photo process relies on the

results of the dataset process.

The dataset process consists of the acquisition

(downloading the panorama images), preprocessing

(removing panoramas that do not contain buildings),

processing (extracting features and descriptors) and

persistence (saving the features and descriptors) of the

dataset panorama images on the storage system. The

query photo process represents the system main func-

tionality: query photo submission, preprocessing (ac-

cepting the image if it contains buildings), processing

(extracting features and descriptors) and performing

matching with the persisted results of the dataset pro-

cessed. Based on the matching results, location esti-

mation is performed. Figure 4 illustrates the pipeline

of operations that are applied for the dataset images

and a query image.

Figure 4: Dataset and query photo processes (pipelines).

Both the dataset panoramas and the query photos

were segmented by using ICNet. The neural network

segmented the image in 19 classes but we used only

the class labeled as building. The segmented image

can be seen in ﬁgure 5.

Feature extraction was performed using the

AKAZE detector-descriptor algorithm, which is in-

variant to scale, rotation and limited distortion. Fig-

ure 6 shows the output of feature extraction on a seg-

mented image.

The extracted features were matched with the fea-

tures of the panorama images to ﬁnd the best matches,

by using FLANN

library. This contains a collec-

tion of algorithms optimized for fast nearest neighbor

search in large datasets. The matching process can be

seen in ﬁgure 7.

We scored each panorama based on the number

of correct matches with the query photo. The top

5 panoramas, in descending order by their matching

score, represent the system output.

https://www.cs.ubc.ca/research/ﬂann/

Location Estimation of an Urban Scene using Computer Vision Techniques

271

Figure 5: Segmented image with legend for each pixel.

Figure 6: Segmented image with AKAZE features.

Figure 7: Matching query image with the correct panorama

image.

3.3 Implementation Details

The system was implemented using Python language

and the following libraries:

• OpenCV library

- for feature detection and

matching

• Tensorﬂow

- for semantic segmentation of the

images

• Celery

- for distributing the matching process

among multiple workers

3.3.1 Deployment

To test the performance of the system we created

a Docker image and deployed it to Amazon Light-

sail. Amazon S3 was used as storage for keeping

the persistent data such as the images, neural network

weights and extracted features and descriptors. We

used the most powerful machine provided with 8 vC-

PUs. The CPU used was Intel Xeon E5-2676 v3 with

a base frequency of 2.4 GHz and a turbo boost up to

4.8 GHz. The machine was provided with 32 GB of

memory.

3.3.2 Performance

On average, the system performed the image segmen-

tation in 1 second and the image matching in 19 sec-

onds.

3.4 Dataset

We used 360

◦

panorama images (as seen in ﬁgure 8)

from Google Street View to create a dataset by using

the following metadata for each panorama:

• panorama image

• GPS coordinates

• date taken

• camera position (yaw, pitch, roll)

• previous versions of the panorama along with the

date when the image was taken

• neighboring panoramas

For our benchmark, we created the dataset from

552 panoramas with a total of 5.25 km of streets from

the city center of Cluj-Napoca. The average distance

between each panorama is 9.51m.

https://opencv.org/

https://www.tensorﬂow.org/

http://www.celeryproject.org/

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

272

Figure 8: 360

◦

panorama image from Google Street View.

3.5 System Limitations

The system presents some limitations, which inher-

ently has its consequences on the performance and ac-

curacy of it. There are two kinds of limitations: one

is related to the dataset panorama images, while the

other to segmentation. Below are mentioned the main

problems regarding the dataset panorama images:

• blind spots: panoramas with buildings under con-

struction or reconstruction

• false spots: panoramas with buildings whose ap-

pearance has changed

• missing spots: panoramas where buildings are

(mostly) obstructed by other elements of the im-

age (vegetation, trafﬁc).

The limitations due to segmentation refer to the incor-

rect results generated by the segmentation module. In

order to label the images, segmentation is performed,

whose purpose is to generate a mask that represents

the parts of the image that consists of building(s).

Some data from the dataset is not segmented prop-

erly. The type of problems which may arise at this

level are:

• panoramas present buildings that are just partially

labeled

• panoramas present non-building elements that are

labeled as buildings

• panoramas present buildings that are not labeled

4 EXPERIMENTAL RESULTS

4.1 Benchmark

A benchmarking module was implemented to verify

the accuracy and correctness of the results generated

by the LES.

The benchmarking process for a query image con-

sists of the following steps (presented in order):

1. compute all the matches for the query image

2. select the top ﬁve matches from the panorama im-

age dataset and take the location of these as the

result (we used a simple score calculation: the

score is given by the total number of supposedly

correct matches between the query image and the

panorama image)

3. compute the distance between each result location

and the correct location (GPS-geotagged) of the

query image

4. take the location of the panorama which has the

minimum distance to the query image’s actual lo-

cation as the system result

This process was applied to each image from the

testing and validation datasets.

4.2 Results

The benchmarking was effectuated on two sepa-

rate sets of data. The testing dataset consists of

cropped and undistorted images extracted from ran-

dom panoramas from the dataset. The validation

dataset consists of images taken in the city of Cluj-

Napoca, by using different devices (mobile phones

and DSLR camera). These validation images were

GPS geo-tagged to be able to get their actual location

and compare it to the result location generated by the

system.

4.2.1 Test Data Results

The test image dataset consists of 100 panoramas or

part of panoramas from the dataset. 95 of the images

were correctly matched with the panorama dataset.

4.2.2 Validation Data Results

The validation data consists of four separate sets of

images taken on different days and conditions. The

total number of validation photos was 109, from

which 91 were considered valid as the result of pre-

processing. The rest were considered invalid due

to not having enough building labeled pixels (under

30%).

The primitive LES refers to raw matching (ex-

tracting features and matching them) of the query im-

age with the original dataset panoramas. The results

were the following:

• 26.37% of the images were estimated with preci-

sion under 25m

• 7.69% of the images were estimated with preci-

sion between 25m and 50m

Location Estimation of an Urban Scene using Computer Vision Techniques

273

• 10.98% of the images were estimated with preci-

sion between 50m and 75m

• 54.96% of images are estimated with precision

above 75m.

The enhanced LES integrates preprocessing and

segmentation of the images. This means that only

those images are considered valid that have at least

30% of pixels labeled as buildings. The features are

extracted only from pixels that are labeled as build-

ings. The following results are obtained:

• 52.74% of the images were estimated with preci-

sion under 25m

• 14.28% of the images were estimated with preci-

sion between 25m and 50m

• 10.98% of the images were estimated with preci-

sion between 50m and 75m

• 22% of images are estimated with precision above

75m.

These results are illustrated comparatively in ﬁg-

ure 9, which shows the number of images respective

to the precision they were estimated.

Figure 9: Comparison of primitive and enhanced LES.

5 CONCLUSIONS

We have designed and implemented a system that es-

timates the location of an urban scene image based

only on image content (information) analysis.

The motivation behind the location estimation is

related to its possible applications such as law en-

forcement agencies, tourism applications and aug-

mented reality.

The main contributions include the research and

experiments with available state-of-the-art image seg-

mentation architectures, feature extraction and match-

ing techniques. On one hand, these were studied sep-

arately, for maximizing the ﬁnal system performance

and accuracy. On the other hand, all the previously

mentioned components were combined in such a way,

that leads to a new approach regarding location esti-

mation of a visual urban scene: one based entirely

on computer vision techniques. However, the system

presents some considerable limitations.

It was demonstrated, that with the help of some

additional tools and technologies, the system can be

easily distributed, thus supporting the basis for related

real-time applications.

As further developments, extending dataset pre-

processing could increase the system accuracy. An

adapted and specialized segmentation method could

also induce a potential boost in the performance of

the system. Extending the LES pipeline could also

improve the system behavior. More feature extraction

algorithms and matching techniques could be com-

bined in the process. Additionally, more information

from the dataset could be used, such as the number

of matches of neighboring panoramas and the num-

ber of matches with the previous version of the same

panorama. This would considerably increase the like-

lihood of correct matches.

REFERENCES

Alcantarilla, P. F., Bartoli, A., and Davison, A. J. (2012).

Kaze features. In European Conference on Computer

Vision, pages 214–227. Springer.

Alcantarilla, P. F. and Solutions, T. (2011). Fast ex-

plicit diffusion for accelerated features in nonlinear

scale spaces. IEEE Trans. Patt. Anal. Mach. Intell,

34(7):1281–1298.

Andersson, O. and Marquez, S. (2016). A comparison of

object detection algorithms using unmanipulated test-

ing images : Comparing sift, kaze, akaze and orb. Ex-

amensarbete inom teknik, Grundniv

Lowe, D. G. (1999). Object recognition from local scale-

invariant features. In Proceedings of the Seventh IEEE

International Conference on Computer Vision, vol-

ume 2, pages 1150–1157 vol.2.

Muja, M. and Lowe, D. G. (2009). Fast approximate nearest

neighbors with automatic algorithm conﬁguration. In

International Conference on Computer Vision Theory

and Application VISSAPP’09), pages 331–340. IN-

STICC Press.

Pusztai, Z. and Hajder, L. (2016). Quantitative comparison

of feature matchers implemented in opencv3.

Tareen, S. A. K. and Saleem, Z. (2018). A comparative anal-

ysis of sift, surf, kaze, akaze, orb, and brisk. In 2018

International Conference on Computing, Mathemat-

ics and Engineering Technologies (iCoMET), pages

1–10.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

274

Zhao, H. (2019). Icnet: Icnet for real-time semantic seg-

mentation on high-resolution image. https://github.

com/hszhao/ICNet.

Zhao, H., Qi, X., Shen, X., Shi, J., and Jia, J. (2018).

Icnet for real-time semantic segmentation on high-

resolution images. Lecture Notes in Computer Sci-

ence, pages 418–434.

Location Estimation of an Urban Scene using Computer Vision Techniques

275