REAL-TIME ROAD SCENE CLASSIFICATION USING INFRARED

IMAGES

David Forslund, Per Cronvall and Jacob Roll

Autoliv Electronics AB, Linköping, Sweden

Keywords:

Scene classiﬁcation, Bag of words, Visual words.

Abstract:

This paper aims at employing scene classiﬁcation in real-time to the two-class problem of separating city and

rural scenes in images constructed from an infrared sensor that is mounted at the front of a vehicle. The ’Bag

of Words’ algorithm for image representation has been evaluated and compared to two low-level methods

’Edge Direction Histograms’, and ’Invariant Moments’. A method for fast scene classiﬁcation using the Bag

of Words algorithm is proposed using a grey patch based algorithm for image element representation and a

modiﬁed ﬂoating search for visual word selection. It is also shown empirically that ﬂoating search for visual

word selection outperforms the currently popular k-means clustering for small vocabulary sizes.

1 INTRODUCTION

In image processing, scene classiﬁcation is a fun-

damental task. Providing semantic labels to image

scenes is beneﬁcial as a preparatory step for further

processing, such as object recognition. In the applica-

tion of intelligent vehicles a real-time scene classiﬁca-

tion can be useful both during day and night time. For

the night time case a visual camera cannot be used and

an alternative imaging device, e.g. an infrared sensor,

is required. This paper applies the task of scene clas-

siﬁcation to the ﬁeld of real-time infrared vision sys-

tems, but the proposed methods generalise well also

to grey-scale images. Emphasis is laid on proposing a

system suitable for the real-time two class application

of separating city and rural road scenes. There are

two major sides to scene classiﬁcation: image repre-

sentation and classiﬁcation. For image representation,

the Bag of Words (BoW) framework, which describes

the image through the distribution of small image el-

ements, visual words, has been employed and com-

pared to two low-level image representation methods,

Edge Direction Histograms (EDH) and Invariant Mo-

ments (IM). For classiﬁcation we used two classiﬁers:

Support Vector Machines (SVM) using radial basis

kernels as implemented in (Chang and Lin, 2001), and

k-Nearest Neighbour (kNN). Due to a larger mem-

ory demand, kNN in its original formulation is not

suited to be used for the real-time system, but is re-

garded as a reference for evaluation purposes. In the

BoW framework k-means clustering is traditionally

used for the formation of the visual vocabulary. This

paper uses a modiﬁed version of the ﬂoating search

algorithm initialised by k-means for this task, which

gives a vocabulary adapted to the speciﬁc classiﬁca-

tion task at hand. Our contributions are ﬁrstly, treat-

ment of scene classiﬁcation in infrared images, and

secondly, emphasis on solving the real-time problem.

Contributions to the BoW algorithm are investigation

of the use of very small vocabularies and the use of

ﬂoating search for visual vocabulary construction.

2 RELATED WORK

Scene classiﬁcation is a mature ﬁeld in image pro-

cessing, and a variety of approaches to the task have

been investigated. However, few of these deal with

computationally constrained problems such as real-

time applications. Low-level methods are compu-

tationally cheap and are interesting in this context.

(Vailaya et al., 1998) uses a variety of global low-level

features based on colour histograms, frequency do-

main DCT coefﬁcients and edges, applied to the two

class problem of separating city and landscape im-

ages. Edge based features showed best results. Their

work was extended to involve more than two classes

in (Vailaya et al., 2001). (Oliva and Torralba, 2003)

and (Oliva and Torralba, 2001) utilised the frequency

domain further by studying the statistical properties

of the Fourier spectra of image categories and apply-

ing PCA on the spectra to obtain a feature represen-

351

Forslund D., Cronvall P. and Roll J. (2010).

REAL-TIME ROAD SCENE CLASSIFICATION USING INFRARED IMAGES.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 351-356

DOI: 10.5220/0002821503510356

 SciTePress

tation. (Szummer and Picard, 1998) considers tex-

ture features (MSAR) for the indoor-outdoor problem

and compares them to colour histograms and DCT.

They also conclude that performance can be gained

by combining features of different types. Another

low-level approach, invariant moments, were applied

in (Devendran et al., 2007) to the two class prob-

lem of street-highway. The BoW image representa-

tion, which implies an additional abstraction level,

has been applied to the scene classiﬁcation task with

great success. In particular, it has shown to work well

when there are many classes to categorise. (Quelhas

et al., 2005) studied the three class problem of sep-

arating indoor, city and landscape scenes by apply-

ing a BoW representation using sparse SIFT descrip-

tors and applying probabilistic latent semantic analy-

sis (pLSA) to give a compact representation. Results

were compared to those of the low-level methods de-

ﬁned in (Vailaya et al., 1998), where BoW showed

to be superior. (Bosch et al., 2008) solves a multi-

class problem (13 scenes) also using the BoW algo-

rithm and pLSA. Several image element representa-

tions were evaluated: grey patches, colour patches,

dense grey SIFT, dense colour SIFT and sparse grey

SIFT. The dense SIFT was found to give best per-

formance. A promising recent approach to the BoW

framework is the Bag of Textons (Walker and Malik,

2003) which has been applied successfully to several

complex scene classiﬁcation problems, e.g. (Battiato

et al., 2008). Textons are however left outside the

scope of this paper. A thorough review of previous

work in scene classiﬁcation was carried out in (Bosch

et al., 2007).

3 IMAGE REPRESENTATION

The city-rural classes can be assumed to have large

intra-class variability. This allows employing less

complex algorithms while still achievinggood results.

3.1 Bag of Words

BoW originates from text retrieval, but has also been

successfully applied to image processing (Sivic and

Zisserman, 2003). The method involves extracting

local patches, image elements, from each image rep-

resenting them by some descriptor. These are then

quantised by a set of representative descriptors, a vi-

sual vocabulary, where each member is called a vi-

sual word. Each image is represented by a feature

vector, constituted by the occurrence frequencies of

the V visual words. These are measured by matching

extracted image elements to the visual words by Eu-

(a) (b)

Figure 1: An example image (a) is processed (c) by DC

level and std compensated (as in GP-DC). This is compared

to the vocabulary quantisation (d) of (a) by (b) a GP-DC

vocabulary (V=64).

clidian nearest neighbour. Each element of the feature

vector is normalised to range [0 1] on the set of train-

ing images to remove bias towards common words.

Normalisation coefﬁcients are stored in a vector re-

ferred to as the scaling vector. BoW is employed in

this paper, while the lately popular pLSA is not, since

it has shown to give little effect when the number of

scene classes is small (Bosch et al., 2008).

3.1.1 Representation of Image Elements

Image elements can be extracted densely by sampling

across the whole image with a ﬁxed spatial interval or

sparsely by applying an interest point detector. Since

many image elements are extracted from each image,

a simple representation algorithm is desired. The high

abstraction level of the BoW algorithm allows even

such simple representations to result in powerful clas-

siﬁers. A basic grey patch descriptor is obtained by

densely sampling square image regions of size n× n

and spacing m giving a descriptor length of n

with

a strong bias to visual words describing pure grey-

levels. We denote it GP-Raw. To represent more dis-

criminative structures in the image than grey levels

we remove the DC component from each patch and

normalise the result to std 1. Adding the DC-level as

an extra descriptor gives a descriptor of length n

+ 1

which we denote GP-DC. Quantisation of an image

using a vocabulary constructed by the GP-DC repre-

sentation is shown in Figure 1.

To further remove low dimensional structure we

developed two general methods to remove the gradi-

ent component from a patch. In the GP-PL, the patch

is seen as a surface z = f(x, y) where the z is the pixel

intensity. The mean gradient of the patch is removed

by subtracting a plane a z

= ax+ by+ c acquired by

least square approximation of z. The result is then

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

352

normalised to std 1 and gradient information is kept

by adding coefﬁcients a, b and c to the descriptor vec-

tor giving length n

+ 3. Another way to remove lin-

ear order structures is to remap the grey-levels based

on the histogram of the patch. The cumulative his-

togram q of an image is a monotonically increasing

function, thus the slope d of a least-squares linear ap-

proximation eq is positive with magnitude depending

on the dynamic range of the image. We discretise eq

in terms of the histogram bins and subtract it from

the patch pixels. The result is normalised to std 1.

The patch DC level and the slope d are added to the

descriptor denoted GP-HGM giving it length n

+ 2.

Alternatively eq can be discretised for each individual

pixel. To avoid non-deterministicresults, this requires

that the pixels are sorted in a controlled manner within

each histogram grey-level, taking the pixels spatial lo-

cation in the patch into account. We denote this GP-

PS. Figure 2 shows the ﬁve GP based descriptors in-

troduced in this section applied to an image patch.

Sparse extraction has been employed using the

DoG detector and the SIFT descriptor as in (Lowe,

2004). The SIFT descriptor has also been applied

densely as in (Bosch et al., 2008). For each sampled

point, SIFT descriptors were then calculated on four

different scales, using circular support patches of radii

4, 8, 12 and 16 pixels.

(a) (b) (c) (d) (e)

(f)

Figure 2: An image patch as represented by descriptors: (a)

GP-Raw, (b) GP-DC, (c) GP-PL, (d) GP-HGM, (e) GP-PS.

(f) shows relative time consumption of above descriptors.

3.1.2 Constructing a Visual Vocabulary

The vocabulary should be representative, approximat-

ing all possible image elements occurring in a sam-

ple image, and provide a representation that facilitates

separating the scene classes. To construct a vocabu-

lary, image elements are extracted from a subset of

the image dataset. From these (typically about 1 mil-

lion elements), a small set of a ﬁxed size is created to

constitute the vocabulary. In literature, this has been

carried out by applying k-means clustering to the ex-

tracted image elements, deﬁning the visual words as

the cluster midpoints. This strategy discards informa-

tion of the class membership of the image elements.

We wish to exploit this information to optimise the

vocabulary to the classiﬁcation task. Thus we employ

ﬂoating search (Pudil et al., 1994), which is a feature

selection algorithm designed to select the best subset

of a predeﬁned size out of large set of features. The

subset quality is estimated based on some criterion

function. For the application of this paper, the only

sensible criterion function is the ﬁnal classiﬁcation

rate. To obtain an algorithm of manageable speed,

classiﬁcation is carried out on a subset of 200 images,

and the criterion function is the mean value of a 4-

fold cross validation. To further increase speed, the

ﬂoating search algorithm is not allowed to pick from

all reference image elements, but from a set of 400

elements, obtained by k-means clustering the com-

plete set. Since words are matched by nearest neigh-

bour it is impossible to have a vocabulary of only one

word. Thus ﬂoating search needs to be initialised with

a two-word vocabulary, which can be found by ex-

haustive evaluation among the 400 candidates, or by

using some ﬁtness measure on the individual image

elements.

3.2 Low-Level Algorithms

Low-Level features are fast to compute and, given

the problem complexity, might provide sufﬁcient per-

formance. Edge direction histograms has shown

(Vailaya et al., 1998) to be efﬁcient for simpler scene

classiﬁcation tasks. They are well suited for the

city-rural problem, since they exploit the fact that

city scenes contain more vertical structures than ru-

ral scenes. EDH was implemented using both Canny

and Sobel edges, and adding the fraction of non-edge

pixels as an additional feature.

A set of seven central geometric image moments

proposed in (Hu, 1962), that in the discrete 2D case

can be shown to be translation, scaling and rotation

invariant, have been successfully used as features in

many image processing tasks, including scene clas-

siﬁcation (Devendran et al., 2007). These were im-

plemented as features by subdividing each image into

four regions and calculating Hu’s seven moments for

each region, giving a feature vector of length 28, nor-

malised by scaling the logarithm of each moment to

the range [0,1] over all training images.

4 TIME AND MEMORY

For a real-time system, time and memory consump-

tion is crucial. The system consists of two stages: the

off-line stage of vocabulary construction and classi-

ﬁer training, which is not severely restricted in com-

REAL-TIME ROAD SCENE CLASSIFICATION USING INFRARED IMAGES

353

putational time, and the on-line stage of feature ex-

traction and classiﬁcation which needs to be per-

formed at real-time speed. In the experiments clas-

siﬁcation time is consistently seen to be negligible

in comparison to feature extraction, which here in-

volves image element representation and matching to

visual words. Time demands depend on the repre-

sentation method (Figure 2f), the patch size n (to a

degree depending on the representation) and approxi-

mately inversely quadratically on the patch spacing m

when m ≪ image width. Time demands of the visual

word matching depend linearly on the vocabulary size

V, inversely quadratically on m and on n to a degree

depending on the matching implementation.

The only data to store in the real-time system is the

visual vocabulary, the scaling vector and the classiﬁer

model. In this implementation (for reasonable patch

sizes) it is the classiﬁer model that limits the mem-

ory requirements. A simple kNN classiﬁer requires

storage of all training vectors while an SVM using

the ν-SVM embodiment requires storage of roughly

ν× N support vectors. Thus it is the length of the fea-

ture vectorV, that limits the memory requirements. In

fact, memory requirements of both SVM:s and kNN:s

increase linearly with V.

5 RESULTS

The performance evaluation dataset consists of 8 000

324 × 256 pixel images, half from each category, ex-

tracted from video sequences recorded by a vehicle

mounted infrared sensor. These were gathered dur-

ing night time at various locations in Sweden and

Germany in varying weather conditions. The images

were sampled in the sequences with a constant spatial

interval of 20 meters in city environments and 100

meters in rural environments. Some pre-processing

was carried out to scale the IR intensities to appropri-

ate grey-levels. The ground truth was found by visual

inspection of the video sequences (not individual im-

ages). A few images from the dataset are shown in

Figure 3.

Evaluation was carried out by 4-fold cross vali-

dation on the whole dataset and performed in several

rounds. The primary, exhaustive round was carried

out for GP-Raw, GP-DC and GP-PS to ﬁnd suitable

values of parameters such as V, n and m. V was var-

ied in the range 16-400, n in the range 5-15 pixels

and m in suitable ranges for each given patch size.

The ﬂoating search algorithm is very time consuming,

and was thus not used in this exhaustive evaluation.

Instead vocabularies were generated using k-means

clustering on image elements extracted from 50-300

(a)

(b)

Figure 3: 3 images from the city (a) and rural (b) dataset.

images. Also, classiﬁer parameters were tuned for

optimal performance. Generally, classiﬁcation per-

formance increases with increasing V and n, and with

decreasing m. It can be seen in Figure 4 that perfor-

mance is good already for vocabularies of sizeV = 16

(for the GP-DC and GP-PS) which is a much smaller

vocabulary size than what has been commonly used

in literature. In fact, vocabularies have shown to be

saturated with information, introducing noisy visual

words already at surprisingly small sizes. The patch

spacing governs the amount of patches extracted from

each image. Small patch spacing gives a larger num-

ber of extracted patches, and thus a more detailed de-

scription, increasing the performance at the cost of

increased time demands. The effect of the patch size

on the classiﬁcation performance is not as transparent

as the other variables. It affects many different char-

acteristics of the algorithm such as the scale of the

detected objects, the the accuracy of the visual word

matching and the maximum possible complexity of

the visual words. A further discussion is given in

(Forslund, 2008). The GP-PL, GP-CT and GP-HGM

were evaluated separately using suitable parameters.

Though good results were obtained, they were not

surpassing those of the GP-DC and GP-PS methods

when both performance and speed were considered.

Classiﬁcation results of the different BoW em-

bodiments and the two low-level algorithms are sum-

marised in Table 1. A variety of parameters were

evaluated, but only the best results for each algorithm

are shown in the table. Due to the higher abstrac-

tion level of the BoW model, and the adaptation of

the visual vocabulary to the speciﬁc dataset, the BoW

algorithm consistently outperforms the low-level al-

gorithms. Invariant moments are not well suited for

this application since much vital information for the

task lies in the orientation of structures within the

image. The EDH features, which utilise this infor-

mation, perform much better. For varying V Sparse

SIFT consistently performs badly, due to the inability

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

354

1636 64 100 196 256 324 400

Vocabulary Size V

Performance (%)

GP−Raw n = 11, m = 5

GP−Raw n = 5, m = 5

GP−DC n = 11, m = 5

GP−DC n = 5, m = 5

GP−PS n = 11, m = 5

GP−PS n = 5, m = 5

(a)

1636 64 100 196 256 324 400

Vocabulary Size V

Performance (%)

GP−Raw n = 11, m = 5

GP−Raw n = 5, m = 5

GP−DC n = 11, m = 5

GP−DC n = 5, m = 5

GP−PS n = 11, m = 5

GP−PS n = 5, m = 5

(b)

Figure 4: The effect of varying the V for a few parameter

variations using (a) SVM and (b) kNN classiﬁer.

of the interest point detector to detect the full con-

tent of the image scene. E.g. large uniform areas,

which are frequent in the rural IR images, are ne-

glected. This is in accordance with the results of

Bosch et al. (Bosch et al., 2008) stating that sparse

extraction is not suitable for scene classiﬁcation. The

SIFT descriptor, however, is very powerful and ap-

plied densly it demonstrates the best performance of

all methods evaluated in this paper. It is however too

time consuming for the real-time application. Grey

patch based BoW descriptors on the other hand, are

fast to compute and also show good performance. GP-

Raw gives a vocabulary with many redundant visual

words representing homogenous grey-levels; the GP-

DC descriptor was introduced to overcome this issue,

and gives a very good trade off between speed and

performance. The gradient removal approach, GP-

PL, gave interesting vocabularies, but no performance

increase compared to the GP-DC representation. The

histogram based processing GP-HGM and GP-PS,

however, improved the performance compared to GP-

DC for small V (This is not visible in Table 1 since

only the best results of each algorithm are shown).

The histogram based descriptors are however too time

consuming to justify the performance gain and are not

considered for the real-time system.

A suggested real-time vocabulary, denoted RTV

in Table 1, was selected using the GP-DC descrip-

tor to form a vocabulary of size V = 16 using 7 × 7

patches sampled at a spacing of m = 3 pixels. Us-

ing this parameter set, ﬂoating search was applied as

described in Section 3.1.2 up to a vocabulary size of

V = 33. Results are displayed, and compared to us-

ing k-means, in Figure 5. For the 16 word vocabulary

used in the RTV, there is a signiﬁcant performance

gain. To boost the execution speed further, images

were downsampled by a factor s = 2 giving a per-

formance decrease of about 1.5 pp and a four time

Table 1: Summarised results. The best classiﬁcation perfor-

mance of each algorithm is given in %,(std). Note that pa-

rameters vary between the different algorithms. The RTV is

included for comparison.

Algo. Classif. SVM Classif. kNN

EDH 88.2 (1.7) 90.5 (0.7)

IM 81.0 (0.9) 72.0 (1.7)

GP-Raw 92.8 (1.0) 94.3 (0.9)

GP-DC 96.3 (0.6) 96.7 (0.3)

GP-PL 92.9 (0.7) 96.0 (0.4)

GP-PS 96.3 (0.4) 96.7 (0.4)

GP-HGM 95.4 (0.7) 96.0 (0.4)

S-SIFT 89.1 (1.3) 83.2 (1.5)

D-SIFT 96.3 (0.3) 97.0 (0.4)

RTV 92.7 (0.6) -

speedup. With the SVM parameter ν tuned according

to this conﬁguration to ν = 0.2, and support vectors

stored in single precision, this whole system requires

only 100 kB of memory. The RTV requires about 0.19

s per image (in MATLAB implementation on and In-

tel(R) Core(TM)2 Duo CPU @2.33GHz using 2 GB

of RAM) yielding a maximum allowed frame rate of

about 5.3 Hz, thus within the limits of real-time per-

formance. The classiﬁcation rate of the RTV is 92.7%

for static images.

4 9 1516 19 23 27 34 36

vocabulary size

performance (%)

Floating search, kNN

Floating search, SVM

K−means, kNN

K−means, SVM

Figure 5: The effect of varyingV for vocabularies generated

using ﬂoating search compared to using k-means.

6 CONCLUSIONS

The aim of this paper was to develop a real-time sys-

tem able to separate scenes into the two categories

city and rural scene based on images acquired from

a vehicle mounted IR camera. We used a bag of

words based method utilizing an intermediate seman-

tic representation in the form of a vocabulary of vi-

sual words and compared it to two low-level meth-

ods edge direction histograms and invariant moments.

On a set of images gathered from video sequences,

very high classiﬁcation performance was obtained for

static scenes when no real-time performance restric-

tions were made (97.0% using BoW with dense SIFT

REAL-TIME ROAD SCENE CLASSIFICATION USING INFRARED IMAGES

355

image element descriptors and V = 400). A proposed

real-time system using BoW with GP-DC image el-

ements and V = 16 gave a performance of 92.7%.

For this, several compromises were made to minimise

time and memory consumption. The choice of GP-

DC as descriptor was made due to speed considera-

tions, but since the problem at hand is of limited com-

plexity, GP-DC showed to provide excellent perfor-

mance, comparable to that of the most complex meth-

ods evaluated. The small vocabulary size was cho-

sen to comply with memory demands, but investiga-

tions showed that the performance converged towards

the maximum for quite small vocabulary sizes (Fig-

ure 4), due to information saturation in the vocabu-

laries. Thus, a very small vocabulary size did not in-

ﬂict serious performance degradations. The quality of

the vocabulary in terms of ability to separate the two

classes was increased notably when ﬂoating search

was used to select visual words compared to the com-

monly used k-means clustering. When studying the

misclassiﬁed images, many of them (about 30%) were

found to be caused by temporally limited effects such

as passing cars, turns when close to buildings, trees

planted in the city and so on. Thus temporal ﬁltering

of the classiﬁcation results would increase the general

performance substantially. This is however left as an

issue for further research. Based on this investigation,

we conclude that a road scene classiﬁcation system

that can be operated during night time at real-time

speed can be constructed to give satisfactory classi-

ﬁcation performance.

REFERENCES

Battiato, S., Farinella, G. M., Gallo, G., and Ravì, D.

(2008). Scene categorization using bag of textons on

spatial hierarchy. In ICIP, pages 2536–2539. IEEE.

Bosch, A., Muñoz, X., and Martí, R. (2007). Which is the

best way to organize/classify images by content? Im-

age and Vision Computing, 25(6):778–791.

Bosch, A., Zisserman, A., and Muoz, X. (2008). Scene

classiﬁcation using a hybrid generative/discriminative

approach. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 30(4):712–727.

Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for

support vector machines. Av. at

http://www.csie.

ntu.edu.tw/~cjlin/libsvm

Devendran, V., Thiagarajan, H., and Santra, A. K. (2007).

Scene categorization using invariant moments and

neural networks. In Proceedings of ICCIMA, vol-

ume 1, pages 164–168.

Forslund, D. (2008). Realtime scene analysis in infrared

images. Master’s thesis, Uppsala University, Sweden.

Hu, M.-K. (1962). Visual pattern recognition by moment

invariants. IRE Transactions on Information Theory,

8(2):179–187.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. Int. Journal of Computer Vision,

60(2):91–110.

Oliva, A. and Torralba, A. (2001). Modeling the shape of

the scene: A holistic representation of the spatial en-

velope. Int. Journal of Computer Vision, 42(3):145–

175.

Oliva, A. and Torralba, A. (2003). Statistics of natural im-

age categories. Network: Computation in Neural Sys-

tems, pages 391–412.

Pudil, P., Ferri, F., Novovicova, J., and Kittler, J. (1994).

Floating search methods for feature selection with

nonmonotonic criterion functions. In ICPR94, pages

279–283.

Quelhas, P., Monay, F., Odobez, J. M., Gatica-Perez, D.,

Tuytelaars, T., and Van Gool, L. (2005). Modeling

scenes with local descriptors and latent aspects. In

Tenth IEEE Int. Conf. on Computer Vision, 2005, vol-

ume 1, pages 883–890.

Sivic, J. and Zisserman, A. (2003). Video google: a text re-

trieval approach to object matching in videos. In Ninth

IEEE Int. Conf. on Computer Vision, 2003, pages

1470–1477.

Szummer, M. and Picard, R. W. (1998). Indoor-outdoor

image classiﬁcation. In Proceedings of the 1998

Int. Workshop on Content-Based Access of Image and

Video Databases, page 42.

Vailaya, A., Figueiredo, M. A. T., Jain, A. K., and Zhang,

H.-J. (2001). Image classiﬁcation for content-based

indexing. IEEE Transactions on Image Processing,

10(1):117–130.

Vailaya, A., Jain, A., and Zhang, H. J. (1998). On image

classiﬁcation: City vs. landscape. In Proceedings of

the IEEE Workshop on Content - Based Access of Im-

age and Video Libraries, pages 3–8.

Walker, L. L. and Malik, J. (2003). When is scene

recognition just texture recognition. Vision Research,

44:2301–2311.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

356