Environment Recognition based on Images using Bag-of-Words
Taurius Petraitis
, Rytis Maskeliūnas
, Robertas Damaševičius
, Dawid Połap
, Marcin Woźniak
and Marcin Gabryel
Department of Multimedia Engineering, Kaunas University of Technology, Kaunas, Lithuania
Institute of Mathematics, Faculty of Applied Mathematics, Silesian University of Technology, Gliwice, Poland
Institute of Computational Intelligence, Czestochowa University of Technology, Czestochowa, Poland
Keywords: Object Recognition, Scene Recognition, Image Processing, Bag-of-Words, SIFT, SURF.
Abstract: Object and scene recognition solutions have a wide application field from entertainment apps, and medical
tools to security systems. In this paper, scene recognition methods and applications are analysed, and the
Bag of Words (BoW), a local image feature based scene classification model is implemented. In the BoW
model every picture is encoded by a bag of visual features, which shows the quantities of different visual
features of an image, but disregards any spatial information. Five different feature detectors and two feature
descriptors were analyzed and two best approaches were experimentally chosen as being most effective
classifying images into eight outdoor categories: forced feature detection with a grid and description using
SIFT descriptor, and feature detection with SURF and description with U-SURF. Support vector machines
were used for classification. We also have found that for the task of scene recognition not just the distinct
features which are found by common feature detectors are important, but also the features that are
uninteresting for them. Indoor scenes were experimentally classified into five categories and worse results
were achieved. This shows that indoor scene classification is a much harder task and a model which does
not take into account any mid-level scene information like objects of the scene is not sufficient for the task.
A computer application was written in order to demonstrate the algorithm, which allows training new
classifiers with different parameters and using the trained classifiers to predict the classes of new images.
According to the World Health Organization
(Pascolini and Mariotti, 2011), in 2010 there were
285 million partially sighted people in the world, of
which 39 mln. were blind. Efforts to solve the
problem of integrating these people into society has
been relevant for long, e.g., the Braille writing
system has spread throughout the world as an
analogue of ordinary alphabets to the blind. With the
advancement of technology and the ever deeper
functioning of the human brain, new possibilities for
the integration of the partially sighted and the blind
into society are emerging.
Computer vision aims to provide the computer
machines with a sophisticated sensation of sight. A
system capable of extracting semantic information
from a digital video signal is also useful in the real
world, such as providing help for the disabled (the
blind), and in applications such as photo album
management. Automatical retrieval of meaningful
information is also an important step in the
development of artificial intelligence, as well as the
foundation for more complex computer vision
systems. Computer-based interpretation of visual
information on a computer can be used to help
people with disabilities (the blind) to understand the
environment, to choose the best travel routes and to
avoid any obstacles while moving. For example an
auxiliary real-time navigation system (Mann et al.,
2011) uses the Microsoft Kinect sensor on the
helmet, calculates the user's distance to the obstacle
and, if necessary, vibes to warn the user about an
The information extracted from images of
environment can be used to extract textual
information (Ezaki et al., 2004). Such information
would particularly help the visually impaired people
to orient themselves in artificial environments such
as shops, etc.
Environment recognition technology, although
widely used in the social sphere, can help people
with disabilities to integrate into society. Correctly
Petraitis T., MaskeliÅ
nnas R., DamaÅ ˛aeviÄ ius R., P
Cap D., WoÅžniak M. and Gabryel M.
Environment Recognition based on Images using Bag-of-Words.
DOI: 10.5220/0006585601660176
In Proceedings of the 9th International Joint Conference on Computational Intelligence (IJCCI 2017), pages 166-176
ISBN: 978-989-758-274-5
2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
recognizing an environment in the image is an
important task in most computer vision systems
because it provides contextual information. Objects
are easier to detect and recognize when they are
portrayed in their environment. Knowledge of the
context helps to simplify the object detection task by
narrowing the search field, the categories of objects
to be searched, etc. (Oliva and Torralba, 2007).
Human performance by far exceeds the
efficiency of computer systems when performing
environment or object recognition tasks. However,
human visual abilities are degraded in a dark
environment or after a long observation time, and it
is dangerous or impossible to work for a person
under certain conditions of work (Chan et al., 2002).
Computer-based system can detect objects of
interest in the dark by ultrasound or X-rays
penetrating opaque materials and providing
information that can not be seen by naked eye. All of
this extends the limits of human visual capabilities
for environment recognition and perception, but also
complements them with new capabilities.
The aim of this paper is to present a method for
classifying digital images into specific categories
(e.g., forest, city), which may be usable in the
environment recognition system for partially blinded
or blind system. The concept is relevant for Assisted
Living Environments (ALE) (Dobre et al., 2016),
which aim to provide devices and services to enable
independent living of disabled people. We analyze
different object and scene recognition algorithms,
compare methods for image feature extraction and
present the results of experiments.
The environment recognition methods can be
categorized into two groups: global and local
information-based methods.
Global information-based methods analyze each
scene as an individual object and classify the scenes
according to their global characteristics. Each scene
can be described by a small set of properties derived
from the information in the spectral picture. The
global scene can be characterized using the Spatial
Envelope (Oliva and Torralba, 2011) features
(naturalness, openness, roughness, expansion,
ruggedness) that represent the dominant spatial
structure of a scene. First, for each of these
properties, discriminant spectral templates are
generated. Then, by multiplying the corresponding
template from the energy spectrum of the image, a
characteristic value for that image is obtained. The
classification is performed using K-Nearest
Neighbors (KNN) classifier and reaching, on
average, 86% accuracy when classifying images into
8 categories. Relatively high accuracy achieved
shows that in order to classify environment images,
specific information about the objects contained
therein is not needed and global information about
the scene is enough. Dutt et al. (2009) used tree
structure classification. First, the picture is classified
as natural or artificial, then, depending on the
category, further classification is performed until the
end of the tree is reached. The authors argue that the
structure of the tree was intuitive, e.g., the street and
motorway categories are cut off only at the end,
because their characteristics are very similar.
Local information based methods analyze local
properties of each scene, so the analysis of an image
begins from fine details and their local properties
(quantity, position, composition), when deciding to
which category the scene belongs to. Vogel and
Schiele (2004) categorized scenes by means of a
semantic assessment of typology. These categories
reflect the most general categories of scenes that are
used as the starting point for describing each image.
But in reality, most natural scenes can be described
ambiguously, depending on the subjective point of
view. In this case, the accuracy of the classification
is not an objective indicator, since it only shows a
coincidence with subjective annotation of a picture.
That is why the authors in their work suggest
focusing not on the accuracy of scene assignment,
but on their degree of typicality. In each category,
typical and less representative examples of this
category can be found, and the differences in
typicality are the most effective feature in
classification. The assessment of the typology
should directly reflect the similarity of the image to
the prototype image of the category, i.e. less typical
images should have a lower degree of
representativity than typical images of that category.
Representativeness is calculated as the Mahalanobis
distance between image feature vector and the
category prototype vector, normalized to a range
from 0 to 1. Then, the classification is accomplished
by assigning an image to a category with the largest
value of the representativeness. The concept of
representativeness in categorizing is an important
topic in the research of visual psychophysics (Lu
and Dosher, 2013). The method reached 89.3%
classification accuracy using images with manually
marked local properties. Using the first and second
best match, achieved 98% accuracy, which shows
that misclassified scenes semantically are similar to
both categories.
The aforementioned methods have proven their
importance in environmental recognition, but they
require human intervention (Dutt et al., 2009), which
implies possible inaccuracies due to the subjective
approach of people and required additional labour.
On the contrary, in (Fei-Fei and Perona, 2005)
intermediate information is used, which include both
local and global scene features. The so-called
localized regions have a similar semantic meaning or
visual appearance. First of all, local properties are
clustered into regions, and then these are
categorized. This yields a hierarchy based on the
statistical distribution of local properties in regions
and the distribution of regions in categories.
Classification in 13 image categories and using 40
regional topics achieves 64% accuracy.
Csurka et al. (2004) also propose a fully
automated, non-interventional scene classification
model similar to (Fei-Fei and Perona, 2005), but the
classification is performed without the use of
intermediate information. The main steps are: 1)
automatic detection of specific visual image features
and descriptor descriptions, 2) attribution of
descriptions of these attributes to clusters (visual
dictionary), 3) creation of a bag of keypoints for
calculating how many attributes are assigned to each
cluster, and 4) using the special features bag as an
input vector to classifier, assigning the image to the
predicted category.
In order to achieve best results, the descriptors
obtained in the first step should be resistant to image
transformations, lighting variations and occlusion,
and at the same time be able to describe the
information necessary for categorization. The cluster
mentioned in the second step is a pool of similar
distinctive properties. These pools are made up of
vector quantization algorithms from a large set of
features. Clusters have their own centers, which are
used as words of a visual dictionary - new special
features, or visual words, assigned to the center of
the nearest cluster. This illustrates the analogy
between a language dictionary made of words and a
visual feature dictionary that consists of vectors
representing the centers of clusters. The authors
assume that the visual dictionary must be large
enough (at least 1000 visual words) to have all the
important distinctive features attributed to different
clusters, but not too large, so as not to create clusters
of noise from the images. They solve this problem
by creating several such dictionaries, using different
descriptors for each dictionary. Then the best
dictionary is selected by trial. Having a visual
dictionary, scene images can be described by the
histograms of visual words in them. The task of
categorization is reduced to a simple template
matching task. The method allowed to achieve 85%
accuracy for 7 different categories when using the
Support Vector Machine (SVM) classifier.
Gabryel and Damaševičius (2017) presented a
modified Bag-of-Words (BoW) algorithm. The
modification involves using two different types of
image features the descriptor of a keypoint and the
colour histogram, which can be obtained from the
surroundings of a keypoint. Using this additional
image feature significantly improves image
classification results by using the BoW algorithm. In
(Gabryel and Capizzi, 2017), the method was
extended with an evolutionary algorithm, which
analyses the visual words’ dictionary and modifies
histogram values before storing them in a database.
Nature-inspired optimization algorithms have
been used for preprocessing of images and
extraction of keypoints, which can be used further
for image segmentation and scene recognition.
Examples are Artificial Bee Colony (Wozniak et al.,
2015), Ant Colony (Polap et al., 2015), Firefly
Algorithm (Napoli et al., 2014), and Cuckoo Search
Algorithm (Wozniak and Polap, 2014).
Further, we review four image feature detection
methods: Scale Invasive Feature Transform (SIFT),
Speed-Up Robust Features (SURF), Features from
Accelerated Segment Test (FAST) and Maximally
Stable Extremal Regions (MSER).
SIFT (Lowe, 2004) detects special features
regardless of the scale and orientation of the image,
and allows you to reliably detect the same special
features even in slightly distorted images, adding
noise or changing the lighting and / or viewing
point. SIFT detects potentially specific features, and
measures the stability of these properties and
determines their magnitude by eliminating unstable
properties. Then, according to the local gradient
direction, for each specific feature, one or more
orientations are calculated and assigned. With this
information, image data properties can be
normalized to scale, position and orientation - so the
properties become scalable with respect to these
transformations. The method also includes a
descriptor, which detects the special properties
described by the 128-dimensional vectors. The
gradient values and orientations are initially
calculated for the position of the surrounding object,
using the Gaussian filter for the entire image. Then
the descriptor's coordinates and gradient orientations
are rotated before they are detected in the direction
of the special properties. So the descriptor maintains
a normalized orientation of the special characteristic.
SURF (Bay et al., 2006) uses second-order
Gaussian derivative approximations with a box
filter, thus losing some accuracy, but significantly
shortening the calculation time. Detecting properties
at different image scales, in contrast to SIFT, does
not need to use a Gaussian filter, but it is enough to
change the size of the box used, again avoiding
expensive time calculations. SURF descriptors only
use the 64-dimensional vector, which is easier to
generate and compare, but saves less information
that may be useful in itself.
FAST is a method for corner detection (Rosten
and Drummond, 2006). The main feature of this
detector is the speed. The FAST method is available
in real-time (using only 7% of the time for single-
frame processing) to detect corners in a PAL format
video. The algorithm is characterized by speed, but
is not resistant to large noise quantities in pictures,
and results depend on the choice of threshold value.
MSER (Matas et al., 2004) detects specific
regions of an image, which form a set of
interconnected image points that make up the
contour after the thresholding of the image. The
intensity of all points within these regions is either
lighter or darker than the points on the contour. Such
regions are invariant to scaling, lighting, orientation
and viewing point transforms.
For environment recognition we apply a method
known by several names in the literature: Bag-of-
Words (Gabryel and Capizzi, 2017), Bag of Features
(Lazebnik et al., 2006), Bag of Keypoints (Csurka et
al., 2004).
Figure 1: Outline of Bag-of-Words model.
Figure 2: Image preprocessing. (a) input image (from
dataset (Oliva and Torralba, 2001)), (b) grayscale image,
(c) grayscale image with normalized histogram, (d)
scaled image.
This model is fairly widely used and has proven its
effectiveness in solving image classification tasks
(Vogel and Schiele, 2004; Fei-Fei and Perona,
2005). The model covers almost an entire process of
recognition, but different methods can be used for
each task of the model (Figure 1).
Before applying the method, the images are
preprocessed (see Figure 2): an image is converted
to grayscale, then histogram normalization is applied
and the size of an image is reduced so that image
value does not exceed the predefined value
The first stage of the method is the detection of
features in the picture. In this step, small patches of
the image are likely to be significant for
classification. The properties found are described in
such a way that they can be compared with each
other. Thus, each attribute is assigned to the most
similar "visual word" from the previously generated
dictionary. Dictionary of visual words is derived
from the clusters of similar features. Then the image
is encoded by a vector representing the frequency of
each word in an image. The vector is used as an
input of classifier. We use and analyze four feature
detection methods: SIFT, SURF, FAST, and MSER.
The SIFT descriptor describes each specific
property using a 128-dimensional vector, which is
composed of histograms of regionsaround image
keyponts in 8 different orientations. Depending on
the distance to the keypoint, weight is assigned to
each calculated orientation. The weights are
calculated using the Gaussian function with a mean
deviation parameter equal to half of the scale of
features. The resulting vector is normalized to a
unity vector, and a threshold function is applied to
this vector with a value of and the vector is
normalized again.
The SURF descriptor describes the properties of
a 64-dimensional vector as follows. First, the
dominant orientation of keypoints is calculated.
Then, to describe the region around the keypoint, a
square region is extracted, centered on the keypoint
and oriented along the dominant orientation. The
region is split into smaller 4x4 square sub-regions,
and for each one, the Haar waweforms are extracted.
Figure 3: Example: a subset of a dictionary of visual
A variation of the SURF descriptor is U-SURF. In
this variation, the step of calculating the dominant
orientation of features is skipped, thus optimizing
the algorithm's performance, but losing resistance to
orientation transforms.
Dictionary of words is created from a large
collection of images by automatically detecting their
special properties and clustering them. We use the k-
means method for clustering. To improve the
algorithm's performance and results, we use an
improved k-means initiation method (Arthur and
Vassilvitskii, 2007), which, by choosing starting
centers, evaluates the distance of each selected
center from the data points and the points of the
existing centers. Different number of visual words
can be derived. We use 350 words (selected
heuristically). As the k-means algorithm does not
always converge or converges only after a very large
number of iterations, we set the maximum number
of iterations as 20000. Clustering is repeated twice
and clusters with the smallest variation are selected.
An example of visual words is given in Figure 3.
For mapping of keypoints to clusters, Fast
Approximate Nearest Neighbor Search Based
Matcher (Muja and Lowe, 2009) is used. Histograms
are obtained by how much and what features an
image has (Figure 4). Each histogram is normalized
so that the sum of its all column values is equal to 1.
For classification we use Support Vector
Machine (SVM) (Vapnik 1998) as a classifier. SVM
aims to find the optimal possible hyperplane, which
separates two classes in a multidimensional space.
The optimality is estimated from the distance from
the hyperplane to the data of both classes. Since not
all data can be separated linearly, the kernel trick is
used. The data is projected into a higher dimensional
space, where, perhaps, it is possible to separate
them. We use the χ2 kernel. The gamma parameter
of this kernel, determined by the trial-and-error
method, is 0.50625. For training, the number of
iteration is bounded to 70000. Since SVM is a
binary classification method, classifying data into
more than two classes requires classifiers and the
results of classification are voted. According to the
voting results, the winner is determined.
Figure 4: Calculation of histograms.
4.1 Hardware and Software
For the implementation of the methods and
experiments, we used a portable computer with an
Intel Core i7-3630M processor operating at 3.4 Ghz.
The C ++ programming language and the OpenCV
3.1 open source library (https://github.com/
Itseez/opencv) were used to implement this project.
In this version, some of the required methods are not
available, but they are available in the optional
opencv_contrib module (https://github.com/Itseez/
opencv_contrib). The CMake 3.5.0-rc3 software and
Microsoft Visual Studio 2015 compiler were used to
compile the OpenCV library and extra module
output files into binary files in the Windows 10 OS
4.2 Dataset
We use the dataset from (Oliva and Torralba, 2001).
The dataset consists of 8 categories of environmental
imagery: coast, forest, highway, city, mountain,
open country, street, high buildings. Each category
contains more than 250 annotated images each with
256 × 256 pixels resolution. Since the number of
images in each category is different, only the first
250 pictures of each category are used for the study:
200 for training and 50 for testing. Figure 5 provides
an example of images in each category.
To extend the study, we extended the original
dataset with indoor environment image categories
from (Lazebnik et al., 2006), which added two new
categories to a set of categories used in (Fei-Fei and
Perona, 2005). The new dataset has 15 indoor and
outdoor categories: coast, forest, highway, inside
city, mountain, open country, street, high buildings,
bedroom, industrial, kitchen, living room, office,
shop and suburban.
Figure 5: Examples of image categories (Oliva and
Torralba, 2001).
In Figure 6, an example of pictures from additional
categories are shown. There are 200-300 images in
this set of categories, so the first 200 of each
category are used for the tests.
Figure 6: Examples of images in additional 7 categories.
4.3 Results
First, we compare different feature detectors and
descriptors by analyzing various combinations of
them. Experiments use pictures from 8 outdoor
categories. The size of pictures is reduced to 240 ×
240 pixels. Accuracy is calculated by dividing the
number of correctly categorized images from the
amount of images used for testing. We compare
three combinations: SIFT/SIFT, SURF/ SURF and
SURF/U-SURF. The first word denotes a descriptor,
the second is a detector. The results are presented in
Figure 7. Using the SURF detector and the U-SURF
descriptors, the best accuracy (84%) obtained on
average that is 8.43% higher than the SURF / SURF
combinations. This is probably because Bag of
Words model itself is sufficiently resistant to
changes in the orientation of features, so no
additional calculation of orientation is required.
The effectiveness of the descriptors tested using
the grid as a detector has been further analyzed. The
grid step is 12, and the feature size is 6. The results
are presented in Figure 8. The SIFT descriptor (82%
accuracy) gives the best results when detecting the
properties of the grid. The U-SURF descriptor again
Figure 7: Comparison of SIFT and SURF methods.
turned out to be better than the classic SURF, so it
can be said that the orientation information used in
the model is not required in the descriptor.
Figure 8: Classification results using grid-based feature
Finally, we compare the FAST and MSER detectors.
An important FAST detector parameter - threshold
value - is indicated by the number of the name, e.g.,
FAST30. Figure 9 depict their results using different
descriptors. As in previous experiments, SURF and
U-SURF descriptors appear to be worse than SIFT
when they detect the specific properties detected
outside their detector. The best accuracy (79.75%)
was obtained using a FAST detector with a threshold
value of 30 and a SIFT descriptor. The MSER
detector for detecting regions was not effective in
detectors of extraordinary qualities.
Figure 9: Comparison of detectors: FAST and MSER.
We also have compared the performance in terms of
mean time required for encoding one image (that
includes detecting image features by describing the
descriptor and then describing the image by
histogram). The results are shown in Figure 10.
Using the SURF detector with the U-SURF
descriptor, the image is encoded on average 33%
faster than the SIFT / SIFT combination. Both
combinations yield similar results, so the
combination of SURF / U-SURF is more cost
effective in terms of time. FAST30 / SIFT was the
slowest, which is because of the fact that with a
threshold value equal to 30 FAST algorithms detect
a very large number of features.
Figure 10: Mean time of image encoding.
As training for classification used the same 200
dataset images, and testing used the remaining 50
dataset images from each category, the accuracy of
the obtained accuracy is not high. In order to obtain
more accurate and reliable results, the classifier was
trained 100 times with the two best (SURF / U-
SURF and Grid / SIFT) combinations, randomly
selecting 200 images for training and 50 for each
category. The combination of SURF / U-SURF has
an average accuracy of 83.51 ± 1.67% and a Grid /
SIFT combination of 84.99 ± 1.45% accuracy. In
Figure 11, the confusion matrices for outdoor
environment categories are presented. The vertical
axis consists of the real class, and the horizontal axis
is the predicted class. Correctly categorized pictures
are diagonal. The averages of the predictions from
100 tests were ranked, in which 50 images of each
class were classified. Both confusion matrices are
very similar and have the general features: the
pictures of the forest, high buildings are classified
most accurately, open nature pictures are often
mixed with coastal and mountain views. It should be
taken into account that the dataset used is not perfect
and may contain some ambiguous images. Also,
some categories are essentially semantically similar,
e.g., street imagery sometimes appears in urban
Figure 11: Confusion matrices for outdoor environment categories.
These experiments show that the Bag of Visual
Words model has the most effective combination of
the SURF / U-SURF and Grid / SIFT detectors and
descriptors, with both achieving over 83% accuracy.
The SURF descriptor produced good results only
when used with the SURF detector, and the SIFT
descriptor was most effective in describing the
specific features detected by the grid technique. The
FAST detector with a low threshold value parameter
turned out to detect many distinctive features, and
although it yielded a good result, it took a relatively
long time. The MSER detector has proved to be
inefficient in detecting special features.
For classification of indoor scenes, we used a
second (extended) dataset containing 15 categories
of images. Five categories of them are indoor
scenes. Since the data in this set contains different
sizes of images, they were reduced in proportion to
the size of the experiment by not exceeding 200 ×
200 pixels.
The combinations of Grid / SIFT and SURF / U-
SURF were used to detect and describe the
distinctive features. Using the grid method, its steps
and features are also reduced proportionally to 10
and 5. Because there are fewer images in the
category of this dataset, 200 are used for each
category: 150 for training and 50 for testing.
First, the classification accuracy has been tested
to recognize five indoor scenes. The test was
performed 50 times with randomly selected training
and testing images and an average accuracy of 55.85
± 2.81% with SURF / U-SURF and an accuracy of
58.16 ± 2.22% using Grid / SIFT combination was
achieved. The results are presented as confusion
matrices in Figure 12. From the results we can see
that using the Grid / SIFT combination, there is a
better separation between bedroom and kitchen
images, but basically all indoor images are mixed
together. The best of these categories are the store
images. It is noteworthy that the visual images of the
bedroom, the kitchen and the living room are quite
similar, to the person they are separated by the
objects they contain. The store's images are the best
separated, probably because the store environment is
not visually similar to home rooms, as it has many
similar and repetitive objects, little furniture, and a
small amount of open space.
Figure 12: Confusion matrices for indoor environment
Finally, classification of all 15 categories of
environment scenes has been performed. The test
was performed 50 times randomly for the selection
of 150 training and 50 test images, using a Grid /
SIFT combination with the same parameters and
obtaining an average accuracy of 67.49 ± 1.50%.
The confusion matrix is presented in Figure 13.
We can see that indoor scenes are not often mixed
with outdoor scenes - most of them are mixed
together. Two new scenes are included: the
industrial environment and the suburbs. Pictures of
the industrial environment include outdoor and
indoor scenes. The pictures of suburban scenes have
been classified quite accurately, and the industrial
environment has often been mixed with most other
categories, especially with store scenes - on average,
7.86 out of 50 pictures of the industrial environment
have been categorized as stores. As can be seen from
Figure 14, industrial scenes are not visually very
similar to other scenes, so they are poorly classified,
probably due to the lack of data used for training,
given that the pictures in this category are both
outdoor and indoor scenes.
The classification of indoor scenes in detecting
special features proved to be a much more difficult
task than the classification of outdoor (exterior)
scenes. This is partly because the indoor scenes are
created artificially, in different scene categories are
similar in their visual features.
We have analysed the use of the Bag of Words
(BoW) model for digital recognition of the
environment scenes. The BoW model is often used
to classify environmental images based on their local
features, therefore, different variants of methods
used for its constituent stages have been analysed.
A demonstration application was developed to
analyse the operation of the algorithm, which allows
us to train the classifier (we used Support Vector
Machine (SVM) with the χ2 kernel) with new
images or to select an already trained classifier and
identify the environment category of a new image.
In our experiments, 200 images of each category
were used for the training of classifiers, and it was
observed that the increase in the amount of training
data reduces the classification errors, but the
precision threshold was not reached because the data
set used is too small. The speed and efficiency of the
algorithm also depend on the methods for detecting
and describing the distinctive image features used,
so the methods for detecting the SIFT, SURF, FAST
and MSER features have been investigated, along
with the SIFT and SURF characterization methods.
We also have analysed feature detection using an
artificial grid, without reference to any local image
information. By experimentally optimizing grid
parameters - step and property sizes - this feature
Figure 13: Confusion matrices for all (outdoor and indoor) image categories.
Figure 14: Examples of incorrect classification.
detector has proved to be particularly effective with
the SIFT descriptor. Using a grid pitch of 12 and a
characterization of 6, when the images are reduced
to 240 × 240 pixels, the accuracy of 84.99 ± 1.45%
was achieved by classifying images of outdoor
environments into eight categories. Since the
features detected by the artificial grid yielded better
results than the features discovered by the SIFT
detector (using the same set of images for training
and testing, the characteristics of the grid detected
and described by the SIFT descriptor achieved 10%
larger accuracy), we argue that not only the
distribution of distinctive features is important for
scene recognition, but also information about the
"intrusive" features of the detectors.
The SURF descriptor without orientation
information (U-SURF), worked better than the
classic SURF version of the BoW model. Using an
SURF detector with a U-SURF descriptor, an
average improvement of accuracy of 8.43%
accuracy over the classic SURF descriptor was
obtained. This confirms that specific character
orientation information is not required for the
recognition of the environment by this model, and it
only complicates the recognition process.
The Speed SURF detector with the U-SURF
descriptor operates faster (the image is encoded by
about 33% faster than when using the grid detector
with the SIFT descriptor with an average encoding
time of one image equal to 0.4 s), but a slightly
lower accuracy (83.51 ± 1.67%) has been obtained.
It has been noticed that the SURF descriptor
produces good results only by describing the
features detected by the SURF detector, while the
SIFT descriptor works well with various detectors.
Other combinations of detectors and descriptors
were not as effective as the latter; their accuracy
varied from 65% to 79.75% when performing
classification using 200 images of each category for
training. The algorithm has been tested with two
most effective detector and descriptor combinations
with indoor image images and reached an accuracy
of 55.85% - 58.16% by classifying images into five
categories of indoor environment. The shop's
environment was precisely distinguished, it was
correctly recognized on average 39 out of 50 images,
and the images of the bedroom, kitchen, living room
and office scenes were often mixed together. Having
tested the algorithm's performance with a data set
containing 15 outdoor and indoor categories, the
overall accuracy of 67.49 ± 1.50% was obtained.
Again, the indoor images were often mixed with
each other, but they were rarely blended with the
images of the outdoor environment categories.
We have noticed that the recognition and
separation of indoor scenes is more complicated,
because they are artificially created environments
that have plenty of inter-categorical similarities,
uniform shapes, repetitive objects, which results in
similar distinctive features in different categories of
images, which leads to inaccuracies of classification.
The type of the room could be determined more
precisely by finding specific objects in that room,
however, for a system based solely on the
distribution of distinctive features it is difficult to do.
The results of the research presented in this paper
could be used for researchers as well as practitioners
developing environment scene recognition systems
for blind and partially sighted people.
Arthur, D., Vassilvitskii, S. 2007. K-Means: The
Advantages of Careful Seeding. Society for Industrial
and Applied Mathematics, pp. 1027-1035.
Bay, H., Tuytelaars, T., Van Gool, L. Surf: Speeded Up
Robust Features, Computer vision, ECCV 2006.
Springer, pp. 404-417. doi:
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L. 2008.
Speeded-Up Robust Features (SURF). Comput. Vis.
Image Underst. 110, 3, (June 2008), pp. 346-359.
Chan, L.A., Der, S.Z., Nasrabadi, N.M. 2002. Image
Recognition and Classification. Marcel Dekker, Inc.
Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray,
C., 2004. Visual Categorization with Bags of
Keypoints. Workshop on Statistical Learning in
Computer Vision, ECCV Prague, pp. 1-22.
Dobre, C., Mavromoustakis, C., Garcia, N., Goleva, R.,
George Mastorakis, G., 2016. Ambient Assisted Living
and Enhanced Living Environments: Principles,
Technologies and Control (1st ed.). Butterworth-
Heinemann, Newton, MA, USA.
Dutt, B.S.R., Agrawal, P., Nayak, S. 2009. Scene
Classification in Images.
Ezaki, N., Bulacu, M., Schomaker, L., 2004. Text
Detection from Natural Scene Images: Towards a
System for Visually Impaired Persons. In 17th
International Conference on Pattern Recognition, vol.
2, 683-686. doi: 10.1109/ICPR.2004.1334351
Fei-Fei, L., Perona, P., 2005. A Bayesian Hierarchical
Model for Learning Natural Scene Categories. In 2005
IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, vol. 2, pp. 524-531.
doi: http://doi.org/10.1109/CVPR.2005.16
Gabryel, M., Capizzi, G. 2017. The Bag-of-Words Method
with Dictionary Analysis by Evolutionary Algorithm.
In 16th Int. Conference on Artificial Intelligence and
Soft Computing, ICAISC 2017, Part I. Lecture Notes in
Computer Science 10246, Springer, pp. 43-51.
Gabryel, M., Damasevicius, R. 2017. The Image
Classification with Different Types of Image Features.
In 16th Int. Conference on Artificial Intelligence and
Soft Computing, ICAISC 2017, Part I. Lecture Notes in
Computer Science 10245, pp. 497-506.
Lazebnik, S., Schmid, C., Ponce, J. 2006. Beyond Bags of
Features: Spatial Pyramid Matching for Recognizing
Natural Scene Categories. In 2006 IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition, vol. 2, pp. 2169-2178. doi:
Lowe, D.G. 2004. Distinctive Image Features from Scale-
Invariant Keypoints. International Journal of
Computer Vision, vol. 60, no. 2. pp. 91-110. doi:
Lu, Z.-L., Dosher, B., 2013. Visual psychophysics: From
laboratory to theory. Cambridge, MA: MIT Press.
Mann, S., Huang, J., Janzen, R., Lo, R., Rampersad, V.,
Chen, A., Doha, T, 2011. Blind Navigation with a
Wearable Range Camera and Vibrotactile Helmet. In
19th ACM international conference on Multimedia,
pp. 1325-1328. doi: 10.1145/2072298.2072005
Matas, J., Chum, O., Urban, M., Pajdla, T. 2004. Robust
Wide-Baseline Stereo from Maximally Stable
Extremal Regions. Image and Vision Computing. vol.
22, no. 10. pp. 761-767. doi:
Muja, M., Lowe, D.G. 2009. Fast Approximate Nearest
Neighbors with Automatic Algorithm Configuration.
In VISAPP International Conference on Computer
Vision Theory and Applications, vol. 2. pp. 331-340.
Napoli, C., Pappalardo, G., Tramontana, E., Marszalek, Z.,
Polap, D., Wozniak, M. 2014. Simplified firefly
algorithm for 2D image key-points search. In 2014
IEEE Symposium on Computational Intelligence for
Human-like Intelligence, CIHLI 2014, pp. 118-125.
Oliva, A., Torralba, A. 2001. Modeling the Shape of the
Scene: A Holistic Representation of the Spatial
Envelope. 2001. International Journal of Computer
Vision, 42, 3, pp. 145-175. DOI:
Oliva, A., Torralba, A. 2011. Modeling the Shape of the
Scene: A Holistic Representation of the Spatial
Envelope. International Journal of Computer Vision,
vol. 42, no. 3. pp. 145-175. doi:
Oliva, A., Torralba, A., 2007. The Role of Context in
Object Recognition. Trends in Cognitive Sciences, vol.
11, no. 12. pp. 520-527. doi:
Pascolini D., Mariotti, S.P., 2011. Global Estimates of
Visual Impairment: 2010. The British Journal of
Ophthalmology, vol. 96, no. 5, pp. 614-618. doi:
Polap, D., Wozniak, M., Napoli, C., Tramontana, E.,
Damasevicius, R. 2015. Is the Colony of Ants Able to
Recognize Graphic Objects? In Information and
Software Technologies - 21st International
Conference, ICIST 2015, pp. 376-387.
Rosten, E., Drummond, T. 2006. Machine Learning for
High-Speed Corner Detection. In 9th European
conference on Computer Vision - Volume Part I
(ECCV'06), pp. 430-443. doi:
Vapnik, V. 1998. Statistical learning theory. New York:
Vogel, J., Schiele, B. 2004. A Semantic Typicality
Measure for Natural Scene Categorization. Pattern
Recognition, pp. 195-203. doi:
Wozniak, M., Polap, D. 2014. Basic Concept of Cuckoo
Search Algorithm for 2D Images Processing with
Some Research Results - An Idea to Apply Cuckoo
Search Algorithm in 2D Images Key-points Search. In
11th International Conference on Signal Processing
and Multimedia Applications, SIGMAP 2014, pp. 157-
Wozniak, M., Polap, D., Gabryel, M., Nowicki, R.K.,
Napoli, C., Tramontana, E. 2015. Can We Process 2D
Images Using Artificial Bee Colony? In Artificial
Intelligence and Soft Computing - 14th International
Conference, ICAISC (1) 2015, pp. 660-671.