An Indoor Sign Dataset (ISD):

An Overview and Baseline Evaluation

ao L. R. Almeida, Franklin C. Flores, Max N. Roecker,

Marco A. K. Braga and Yandre M. G. Costa

Departament of Informatics, State University of Maring

a, Maring

a, Paran

a, Brazil

Keywords:

Indoor signs, Visually Impairment, Indoor Signs Dataset, Convolutional Neural Networks.

Abstract:

Visually impaired people need help from others when they need to ﬁnd speciﬁc destinations and cannot guide

themselves in indoor environments using signs. Computer Vision Systems can help them with this kind of

tasks. In this paper, we present to the research community an Indoor Sign Dataset (ISD), a novel dataset

composed of 1,200 samples of indoor signs images labeled into one of the following classes: accessibility,

emergency exit, men’s toilets, women’s toilets, wiﬁ and no smoking. The ISD dataset consists of images in

different environments conditions, perspectives, and appearance that turns the recognition task quite challen-

ging. A data augmentation technique was applied, generating 69,120 images. We also present baseline results

obtained using handcrafted features, like LBP, Color Histogram, HOG, and DAISY applied on SVM, k-NN,

and MLP classiﬁers. We further make non-handcrafted features learned using convolutional neural networks

(CNN). The best result was obtained using a CNN model, with an accuracy of 90.33%. This dataset and

techniques can be applied to design a wearable device able to help visually impaired people.

1 INTRODUCTION

Approximately of 285 million people have some vi-

sual impairment in the world, of whom 39 million are

completely blind (Prajapati and Shah, 2016). Visual

impairment is a severe condition and can turn daily

tasks into challenging ones. Indoor signs are marks

with symbols and/or text that communicates essen-

tial social rules of the environment: whether it is to

display information, to call attention or even to show

local prohibitions (Wang et al., 2013). In public pla-

ces, some typical examples of indoor signs are men’s

and women’s toilet signs, guiding signs to exit door

or stairs ahead and signs to inform local wiﬁ con-

nection. Visually impaired people are not able to re-

ceive this information and they need to use some spe-

cialized equipment to help them to move and to inte-

ract with the environment. In the last years, the de-

creasing costs of hardware and the raising attention

of the computer vision research ﬁeld, pattern recogni-

tion, and machine learning brought new perspectives

on how the technology can help visually impaired pe-

ople.

Image object recognition, in the last years, has

been gaining attention in the computer vision rese-

arch ﬁeld. Nowadays, there is a wide range of datasets

in different object recognition problems which inclu-

des human face, vehicle, food, alphanumeric charac-

ter and transit signs. These datasets have the main

goal to present themselves as a benchmark to com-

pare different techniques and methods in the speciﬁc

problem. With the rising awareness to research in the

ﬁeld of autonomous vehicles, the number of datasets

of recognition problems regarding public and trafﬁc

signs has signiﬁcantly increased in the last years.

Despite the many datasets available, only a few

of them address the indoor environment sign recogni-

tion problem, and many of them are in early stage of

development (Ni et al., 2014). One can ﬁnd in the li-

terature some works presenting the use of technology

to aid visually impaired people to recognize indoor

signs (Ni et al., 2014) (Wang et al., 2013), (Kunene

et al., 2016).

The recognition of indoor signs is not analogous to

the recognition of trafﬁc signs for two signiﬁcant con-

cerns: the highlight from the background, and appea-

rance standardization (Ni et al., 2014). Trafﬁc signs

are heavily highlighted in the background and usu-

ally located in higher spots with good sight. Trafﬁc

signs also typically have a standardized appearance,

with low or no difference in the dimensions, form and

color. In opposition, indoor signs are often located

Almeida, J., Flores, F., Roecker, M., Braga, M. and Costa, Y.

An Indoor Sign Dataset (ISD): An Overview and Baseline Evaluation.

DOI: 10.5220/0007375705050512

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 505-512

ISBN: 978-989-758-354-4

505

in neglected spots and do not have the goal of catch

much attention. Another problem is the absence of

standardization, with very few cases of joint internati-

onal use of symbols and colors. Some classes of signs

are represented commonly with the same color, such

as exit signs (green) and information sign (blue), but

the vast range of forms and symbols turn the recogni-

tion into a challenging task.

In this paper, we introduce an Indoor Sign Dataset

(ISD), a dataset created to support the development of

researches in the indoor signs recognition task. The

dataset is composed of digital images with a wide

range of different sizes, forms, colors, perspectives

and environmental conditions, such as illumination,

occlusion and noise.

To this end, we have created classiﬁcation models

by using both feature engineering and representation

learning approaches. Regarding the feature engineer-

ing approach, we have assessed the following well-

known descriptors taken from the image processing

literature: Local Binary Patterns (LBP) (Dalal and

Triggs, 2005), the Histogram of Oriented Gradients

(HOG) (Gonzalez and Woods, 2006), the DAISY Lo-

cal Descriptor (Tola et al., 2010) and Color Frequency

of the image sample. The feature vectors obtained

with these methods were submitted to Support Vec-

tor Machine (SVM) (Vapnik, 1995), k-Nearest Neig-

hbors (k-NN) (Mitchell, 1997) and Multilayer Per-

ceptron (MLP) (Mitchell, 1997) classiﬁers. Regar-

ding the feature learning approach, we have crea-

ted one model using a Convolutional Neural Network

(CNN) (Lecun, 1989) in such a way that it performs

both feature extraction, without the need for human

intervention, and classiﬁcation.

This paper is organized as follows. Section 2 pre-

sents some related works with its highlights and re-

sults. The developed dataset is presented in detail in

Section 3. Section 4 introduces theoretical foundati-

ons of the methods used in the models developed as

a baseline of performance results in the dataset, exhi-

bited in Section 5. Section 6 discusses the achieved

results and concludes the paper.

2 RELATED WORKS

By analyzing the literature, one can ﬁnd objects re-

cognition applied to a wide range of problems, but

few of them address the indoor signs recognition task.

In this section we are going to describe a brief review

of these works.

Ni et al. (Ni et al., 2014) present a dataset of ind-

oor signs constituted of artiﬁcial images obtained in

searches on Google Images. The dataset contains over

a thousand samples unevenly distributed between 21

classes, each of those having from 40 to 80 samples.

Besides, the authors presented a baseline of compa-

rison between nine models associating different fea-

ture extraction and classiﬁcation methods. Three fea-

ture extraction methods were used: Principal Compo-

nent Analysis (PCA) (Jolliffe, 2005), HOG and Dense

SIFT (DSIFT) (Lowe, 2004). For the classiﬁcation,

the authors used a MLP, SVM and kNN. The best per-

formance was obtained when DSIFT and SVM were

used in association, achieving an accuracy of 80.5%.

The authors concluded that the indoor signs recogni-

tion is a challenging due to the absence of standar-

dization, which, in some cases, as the toilet signs,

may even have hundreds of different representations.

The authors also reported that realistic conditions may

cause the degradation of the detection and classiﬁca-

tion performance. The dataset used in that work is not

publicly available.

Wang et al. (Wang et al., 2013) address the

problem of recognition signs placed on doors. The

roughly geometric forms as the principal feature to

detect the doors and the indoor signs are recogni-

zed by combining the salience of the detected door’s

image region and a matching-based bipartite graph.

Hardware was also developed to evaluate the method,

which consists of a camera, a microphone, a portable

computer and audio speakers. To evaluate the method,

the authors created a dataset, which is not publicly

available, with 146 samples of digital images of toilet

signs, open/close lift signs and directions signs. In the

ﬁrst instance, the authors evaluated the method with

four classes (men’s toilet, women’s toilet, open and

close), achieving an accuracy of 86%. In the second

instance, with four more classes added (up, down, left

and right directions), the method achieved an accu-

racy of 81.6%.

Kunene et al. (Kunene et al., 2016) present a real-

time system that can recognize indoor navigational

signs placed over plain backgrounds. The method has

four steps: detection of the sign from the background

using appearance features, enhanced segmentation by

masking out the background, extraction of Speeded-

up Robust Features (SURF) (Bay et al., 2008) and

classiﬁcation using a three-search structure. The data-

set used to evaluate the method, which is not publicly

available, is composed of seven videoclips in which

one can see eleven different signs. The method achie-

ved an average accuracy of 67.14%. The authors also

performed qualitative tests. Among ten volunteers,

seven reputed the model as suitable for sign recogni-

tion, rating the usability with 3.9 on a scale from 1 to

Lastly, Bashiri et al. (Bashiri et al., 2018) pre-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

506

Table 1: Summary of Related Works.

Work

Number

of classes

Features Classiﬁers

Freely

Available

Recognition

Rate (%)

(Ni et al., 2014) 21

PCA, HOG,

and DSIFT

MLP, k-NN,

and SVM

No 80.5 %

(Wang et al., 2013)

Saliency Map

Matching-based

bipartite graph

86 %

7 81.6 %

(Kunene et al., 2016) 4 SURF

Three-Search

Structure

No 67.14 %

(Bashiri et al., 2018)

3 CNN with

Transfer Learning

Yes

90.4

3 99.8

Original images

Augmented images

Table 2: Summary of samples per classes.

Class Number of samples

Men’s Toilet 320

Women’s Toilet 320

Accessibility 190

Exit 130

No Smoking 120

Wiﬁ 120

Total 1200

sent the MCIndoor20000 dataset, a large-scale fully-

labeled image dataset to support the development of

indoor objects detection of indoor signs for hospitals

and healthcare institutions. The dataset is composed

of 2,055 digital images from three different indoor ob-

ject categories, including 754 images of doors, 599

stairs and 702 hospital signs (with speciﬁc environ-

ment context attributes such as Clinics, Pharmacy and

Ambulatory Surgery Center). A data-augmentation

was employed in the training subset, resulting in more

than 20,000 samples. The authors used the pre-trained

CNN model AlexNet (Krizhevsky et al., 2012) to es-

timate the quality and quantity of attributes of the

MCIndoor20000 dataset. The accuracy results were

90.4% and 99.8% for the original dataset and the aug-

mented dataset, respectively.

Comparing all these works is not straightforward,

because they are not necessarily developed using the

same number of classes and not even on the same da-

tasets. Anyway, we summarize in Table 1 some in-

formation about the related works mentioned in this

section.

3 DATASET

In this work, we present to the research community

an Indoor Sign Dataset (ISD)

. The creation of this

dataset was motivated by the lack of publicly avai-

lable datasets to the development of scientiﬁc works

aiming to address image classiﬁcation in this appli-

cation domain, including images from different envi-

ronments. The dataset is composed of 1,200 samples

distributed in 6 classes: accessibility, emergency exit,

no smoking, wiﬁ, men’s toilet and women’s toilet. We

choose these classes because of their high availability

https://sites.google.com/view/indoorsigndatasetisd

in public environments. Table 2 summarizes the dis-

tribution of samples per classes.

Around 90% of the samples are digital images

of indoor signs at public environments in Argentina,

Brazil, Bulgaria, Japan, Paraguay and United States

of America captured during 2017 and 2018. The re-

maining 10% of the samples are digital images of

indoor signs acquired by searches on public dom-

ain image datasets or e-commerce’s catalog images.

The dataset’s samples have a wide range of appearan-

ces and capture conditions, such as illumination, per-

spective, sizes and occlusions. All the samples were

cropped accordingly to their squared bounding boxes

to minimize the background proportion in the image

and stored in JPEG format. Fig. 1 illustrates some

samples of our dataset. The sizes of the sample ima-

ges are variable between 50×50 and 2000×2000, the

dataset’s size mode being 354×354.

4 FEATURE EXTRACTION AND

CLASSIFICATION

In this section, we detail the process of feature ex-

traction and classiﬁcation. The feature extraction is

usually made using multiple manual or statistical se-

lection of features, commonly known as handcrafted-

features. Handcrafted-features are good for a wide

range of cases, but sometimes they cannot efﬁciently

describe the complexity of the patterns of the clas-

siﬁcation with digital images. Another approach is

the usage of CNN, a specialized kind of neural net-

work for processing data that have spatial interacti-

ons, to acquire meaningful features based on a trai-

ning stage. CNN methods have gained prominence

recently due to its high capacity to generalize patterns

in images, but it is usually costly and requires power-

ful hardware.

Figure 1: Some examples of samples in dataset. From top

to bottom, from left to right; the class of each samples are:

Emergency Exit, Accessibility, Men’s toilet, No Smoking,

Wiﬁ and Women’s Toilet.

An Indoor Sign Dataset (ISD): An Overview and Baseline Evaluation

507

4.1 Feature Extraction

The handcrafted features assessed in this work are the

following:

• LBP: This method adapts a local contrast cor-

rection as a structural texture descriptor (Ojala

et al., 1996). The rationale behind this method is

that binary patterns in the neighborhood of a pixel

are the basic properties considering the texture of

an image (Costa et al., 2012). For each pixel in

the image, the method differentiates gray intensi-

ties of P neighbors at a R distance. We used P = 8

and R = 2.

• Color Histogram: In some cases, the distribution

of colors in an image can be descriptive enough to

classify a sample (Gonzalez and Woods, 2006).

The color histogram reveals the frequency of each

element of the color space in an image. Since

the RGB color space uses 8 bits per channel (red,

green and blue), a channel concatenated histo-

gram has 768 units. To reduce this dimensiona-

lity, we grouped these units in 64 buckets for each

channel, thus reaching a descriptor of 196 dimen-

sions. Since the images of the dataset have dif-

ferent resolutions, we resized them to 128 × 128

pixels resolution.

• HOG: This technique uses the frequency of gra-

dients orientations of an image as a descriptor. It

analyzes the distribution of pixels intensities gra-

dients or edges directions to describe the shape

and appearance of an object. It is a popular des-

criptor to detect objects (Dalal and Triggs, 2005).

Before computing the HOG of the images in the

dataset, we resized them to 128×128 pixels reso-

lution and transformed them to a grayscale color

space. The cell has the size of 16 × 16 pixels and

is applied 1 × 1 in each block. The orientation pa-

rameter is set to 8.

• DAISY: DAISY is a feature descriptor based on

oriented gradient similar to Scale-invariant fea-

ture transform (SIFT) (Lowe, 2004). This techni-

que utilizes Gaussian pondering and one circu-

larly symmetric kernel allowing speed and efﬁ-

ciency in dense calculations (Tola et al., 2010).

The images were converted to grayscale and then

resized to 128 × 128 resolution. The distance be-

tween the sample points of the descriptor was set

to 16, the radius of the external ring was set to 16

pixels, and we have extracted two rings with eight

samples of the histogram from each using the 8-

neighborhood.

4.2 Classiﬁcation

To perform classiﬁcation of the images, we chose

Support Vector Machine (SVM) k-Nearest Neighbors

(k-NN) and MultiLayer Perceptron classiﬁers. We ap-

plied these classiﬁers with all the feature extraction

methods previously mentioned in Subsection 4.1. We

also collected results using CNN to perform feature

learning and classiﬁcation. The following subsecti-

ons detail how these methods work.

4.2.1 SVM

Support Vector Machine is a supervised learning mo-

del that aims to ﬁnd an n-dimensional hyperplane that

maximizes the distance between elements of distinct

classes (Vapnik, 1995). By deﬁnition, SVM acts as a

binary linear classiﬁer, but some alternatives allow the

classiﬁcation of objects that are not linearly separated

or have two or more classes.

As a formal deﬁnition of SVM we have: gi-

ven a set with n elements (~x

),(~x

),. . ., (~x

)

where y ∈

{

−1,1

}

representing the two classes and

∈ R

representing the features vector, the goal is to

ﬁnd a hyperplane n-dimensional that maximizes the

distance between the two classes. The hyperplane can

be expressed as a set of points ~x satisfying the equa-

tion

~w ·~x − b = 0 (1)

where ~w ∈ R

is the normal vector to the hyperplane

and b ∈ R is the distance parameter to the origin.

We deﬁned these values during the training of the

classiﬁer, and the inference’s equation is as

= ~w

·~x

+ b (2)

We chose Radial Basis Function (RBF) as the ker-

nel. The parameters of cost (c) and γ were optimized

using grid-search.

4.2.2 k-NN

k-Nearest Neighbors is a supervised learning model

that classiﬁes one instance according to its k nearest

neighbors (Mitchell, 1997). The central advantage of

this method is the not required ofﬂine training stage.

Nevertheless, the model needs to calculate the dis-

tance between all stored samples when classifying a

new instance. Thus, the classiﬁcation step duration is

directly proportional to the dataset size.

The algorithm considers that all the problem in-

stances match to points in a n-dimensional space

). Euclidean distance is the commonly used met-

hod to calculate the distances and formally deﬁned

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

508

as let x be an instance and a feature set of x be

(~a

(x), ~a

(x),. . . , ~a

,(x)), where a

(x) indicates the

value of i-th attribute of instance x. The Euclidean

distance between two instances x

and x

is deﬁned

by the equation 3.

d(x

) =

∑

i=1

) − a

))

(3)

The most voted class according to the k nearest

neighbors deﬁnes the classiﬁcation of the new in-

stance. The equation 4 demonstrates how to perform

it.

← c : arg max

(4)

where k

is the number of neighbors belonging to the

class c in the k nearest neighbors.

During the evaluation step using k-NN, we used

k = 5, different values of k were used, like 1, 3,7, and

9, but did not present better results.

4.2.3 Multilayer Perceptron

The MLP is used as general mapping between the

inputs and outputs variables (Mitchell, 1997). It

is a supervised learning model that uses the back-

propagation error. The MLP has one or more hidden

layers that are fully-connected. We used a MLP with

one hidden layer to connect the inputs x

, where i the

index of a sample is, with the output

predicted using

the logistic function presented on equation 5.

1 + e

−x

(5)

The error of a prediction is deﬁned as

E =

∑

(

− y

)

(6)

where y

is the true class of i-th sample. We update

the weights using the stochastic gradient-based opti-

mizer (Kingma and Ba, 2015), the learning rate used

is 10

−3

4.2.4 Convolutional Networks

Convolutional networks (Lecun, 1989), also known as

Convolutional Neural Networks (CNN), are a kind of

neural network that employs the mathematical convo-

lution operation instead of matrix multiplication. Due

to its characteristics of sparse connections, parameter

sharing and equivariant representations, Convolutio-

nal Networks leverage advantages when working with

spatially related data, such as digital images (Good-

fellow et al., 2016).

The basic unit of a Convolutional Network is the

convolutional layer, where the input is convoluted

with the kernels at some stride and outputs a feature

map. This feature map can also pass through an acti-

vation function and is subsampled into pooling steps.

Usually, in a single convolutional network, there are

many convolutional layers connected in series or pa-

rallel until the network output.

Convolutional networks are commonly associated

with standard fully-connected neural networks, such

as multilayer perceptrons (MLPs) (Mitchell, 1997).

In those cases, a convolutional network usually recei-

ves a “raw” digital image and outputs a feature map.

Then, the MLP takes this feature map as an input and

performs the classiﬁcation (Lecun, 1989). The main

advantage of using Convolutional Networks is its pur-

pose as a trainable mechanism of dimensionality re-

duction. Thus, it can adjust its parameters to create

a more generalizable feature map of the input it re-

ceives, decreasing the necessity of manual feature ex-

traction.

To simplify the model and minimize setup of hy-

perparameters, we designed all the stages of the mo-

del with the same principles, as it can also be seen

employed in the works of (Krizhevsky et al., 2012)

and (Simonyan and Zisserman, 2015). The architec-

ture of the model can be summarized in Table 3.

The model’s input has a 64×64 × 3 shape as it re-

ceives as input an RGB image with 64 pixels in width

and 64 pixels in height. The input passes through a

series of convolutional layers with tiny 3 × 3 ﬁlters.

We ﬁxed the convolution stride at one unit in each

axis. The output of the convolution also has the same

size as the input by adding a zero-valued border in the

input.

A leaky rectiﬁer activation function (LReLU) is

set up in the output of the convolution. We chose the

LReLU instead of the traditional rectiﬁer (ReLU) as

demonstrated by (Maas et al., 2013), which in some

cases ReLU activation could ”kill” some neurons and

cannot activate anymore. Each stack of convolutional

layers is ﬁnalized with a spatial pooling, performing

Table 3: Architecture conﬁguration of the model.

# Type Input Units Parameters Stride (x, y)

1 Convolutional 64 ×64 × 3 3 ×3 × 64 (1,1)

2 Convolutional 64 × 64 × 64 3 ×3 × 64 (1,1)

3 Max-Pooling 64 ×64 × 64 2 ×2 (2,2)

4 Convolutional 32 × 32 × 64 3× 3 ×128 (1, 1)

5 Convolutional 32 ×32 × 128 3 ×3 × 128 (1,1)

6 Max-Pooling 32 ×32 × 128 2× 2 (2,2)

7 Convolutional 16 ×16 × 128 3 ×3 × 256 (1,1)

8 Convolutional 16 ×16 × 256 3 ×3 × 256 (1,1)

9 Max-Pooling 16 ×16 × 256 2× 2 (2,2)

10 Flat 16 ×16 × 256 16,384 —

11 Fully-connected 16,384 256 —

12 Fully-connected 256 128 —

13 Fully-connected 128 6 —

14 Softmax 6 6 —

An Indoor Sign Dataset (ISD): An Overview and Baseline Evaluation

509

a maximum-value subsample over a 2 ×2 unit square

with a stride of 2.

In addition, a stack of two 3 × 3 convolutional lay-

ers (without spatial pooling) has the same effect of

one 5 ×5 convolutional layer, but it includes two non-

linear rectiﬁcations instead of one, which makes the

decision function more discriminative (Simonyan and

Zisserman, 2015).

Following the convolutional stage, there is a stage

of three fully-connected layers receiving as input the

result of the convolutional stage. The fully connected

layers have the structure similar to multi-layer percep-

tron (MLP). The ﬁrst layer has 256 units, the second

layer has 128 units and the third one, as it performs

the classiﬁcation, has 6 units. An LReLU activation

with α = 0.01 is also equipped in all the fully con-

nected layers. The last stage of the model takes the

fully-connected stage’s output, applies a normalized

exponential function (softmax) and “squashes” a vec-

tor of arbitrary real values into probabilities that add

up to one.

The training process follows, in general, the works

of (Krizhevsky et al., 2012) and (Simonyan and Zis-

serman, 2015). The training consists of optimizing a

multinomial logistic regression (softmax regression)

since all the classes are exclusive. The error of the

model was deﬁned as the cross-entropy of the pre-

diction and the label of the sample. The cross-entropy

is deﬁned as

y,y) = −y · log (

y) (7)

where y is the sample label,

y is the sample model’s

prediction and · is a dot product.

The training also employs a dropout regularization

(Srivastava et al., 2014) in the standard neural net-

work layers. Consider a neural network with L hidden

layers and let l ∈ {1, 2, .. . ,L} index the hidden layers

of the network. Let also z

(l)

, y

(l)

, W

and b

(l)

denote

the inputs, the outputs, weights and biases of layer l,

respectively; where y

(0)

= x. The feed-forward ope-

ration of a standard neural network can be described

(l+1)

= w

(l+1)

(l)

+ b

(l+1)

= f (z

(l+1)

) (8)

for l ∈ {0, 1, .. . ,L − 1} and for any i hidden unit

where f is an activation function. With dropout re-

gularization, the feed-forward operation becomes

(l)

∼ Bernoulli(p)

(l)

= r

(l)

◦ y

(l)

(l+1)

= w

(l+1)

(l)

+ b

(l+1)

= f (z

(l+1)

) (9)

where ◦ denotes the entrywise product of the ope-

rands. For any layer l, r

(l)

is a vector of Bernoulli in-

dependent random variables each of which has a pro-

bability p of being 1. This vector is sampled and mul-

tiplied entrywise with the outputs of that layer, y

(l)

, to

create an output

(l)

that the next layer uses as input.

We set p = 0.5 during the training phase.

Mini-batch gradient descent with the Adam Algo-

rithm (Kingma and Ba, 2015) was employed as the

optimization approach. The batch size was set to 256

and the exponential decay rates β

and β

were set to

0.9 and 0.999. The learning rate was set at 10

−3

and

do not decay in the training procedure. We trained the

model for ﬁxed 32 epochs.

Since inaccurate initialization of the parameters in

a neural network model can lead to stall the optimi-

zer due to the instability of the gradient in training

nets, we employed an optimized heuristic initializa-

tion schemes. We followed the initialization scheme

proposed by (He et al., 2015) for the convolutional

layers and the initialization proposed by Glorot and

Bengio (Glorot and Bengio, 2010) for all the fully-

connected layers. All the initial values of bias para-

meters were set to zero.

To prevent over ﬁtting in CNN many images are

necessary for the training subset, so we employed a

data augmentation technique in the training set which

is mainly composed of random transformations in

the color (brightness, contrast and saturation) and

space (translation, rotation, blur and sharpening) of

the image. Figure 2 exhibits some examples of sam-

ples after the augmentation process. We performed

the augmentation at a rate of 64, i.e., each sample

in the training set generates 63 augmented samples.

Consequently, each fold increased from 120 to 7.680

samples. Each cross-validation contains 69.120 trai-

Figure 2: From top to bottom, some augmented examples

for each sign class: men’s toilet, women’s toilet, accessibi-

lity, emergency exit, no smoking and wi-ﬁ connection.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

510

ning images and 120 original test images. All aug-

mented images are available in this dataset.

We choose not to apply a transfer learning techni-

que with well-known models (such as the VGG-net

(Simonyan and Zisserman, 2015) or the AlexNet (Kri-

zhevsky et al., 2012)), because these models usually

are designed for its usage in environments with abun-

dance of computational resources with dedicated har-

dware for graphical or tensor processing. We desig-

ned our model considering the computational resour-

ces’ constraints of an embed device. We also conside-

red the model inference time as a main factor, which

can be as close as possible of a real-time.

5 RESULTS

In this section, we present the results of the experi-

ments using our dataset. To evaluate the models that

execute the features extraction and classiﬁcation in

two steps, we used a stratiﬁed cross-validation techni-

que with 1,200 images partitioned into ten folds.

The results are summarized in Table 4. The

90.33% accuracy by model #13 states the capability

of the Convolutional Networks to generalize the most

descriptive features of the samples. The models that

employed the Color Histogram and LBP as feature

extraction presented the lowest accuracy performance

using any classiﬁer. We conjecture that the absence of

standardization of indoors signs is the primary cause

for these models to achieve such accuracy. Another

intriguing characteristic is that models with k-NN and

MLP classiﬁers always have low accuracy hit when

comparing the same feature extraction method.

The models #7 and #10 that employed the HOG

and DAISY with an SVM classiﬁer scored a signi-

ﬁcant accuracy. Although they do not achieve re-

sults such as model #13, these methods do not require

such computational effort as convolutional networks

to train or perform and are suitable for applications in

embedded and wearable devices.

The confusion matrix of the best cross-validation

using model #13 is summarized in Table 5. We per-

ceived that the most common errors are in the classiﬁ-

Figure 3: Samples of the classes Men and Women Toilet

signs.

Table 4: Experimental Results.

# Feature Classiﬁer

Mean

Accuracy

Standard

Deviation

1 LBP SVM 0.4500 0.0306

2 LBP k-NN 0.3541 0.0388

3 LBP MLP 0.2833 0.0271

4 Color Histogram SVM 0.3458 0.0408

5 Color Histogram k-NN 0.3333 0.0554

6 Color Histogram MLP 0.3411 0.0341

7 HOG SVM 0.7333 0.0294

8 HOG k-NN 0.5708 0.0442

9 HOG MLP 0.7166 0.0349

10 DAISY SVM 0.7125 0.0407

11 DAISY k-NN 0.5125 0.0572

12 DAISY MLP 0.7083 0.0344

13 CNN 0.9033 0.0163

Table 5: Confusion Matrix using CNN.

Men’s Toilet

Women’s Toilet

Accessibility

Exit

No Smoking

WiFi

Total

Men’s Toilet 27 4 1 0 0 0 32

Women’s Toilet 2 29 0 1 0 0 32

Accessibility 0 0 19 0 0 0 19

Exit 0 0 0 13 0 0 10

No Smoking 0 1 0 0 11 0 12

WiFi 0 0 0 1 0 11 12

Total 29 34 20 15 11 11 120

cation of the class of toilet signs. We suppose this be-

havior is due to the similarity of appearances of these

two signs. The Figure 3 illustrates some very similar

samples of the men’s and women’s toilet sign class

present in our dataset.

Using the Table 5 we can summarizes the values

of precision, recall and f-measure of each category

using model #9. The “men’s toilet” sign presents the

lowest recall value, i.e., the model often classiﬁes this

class of sign as others classes. The macro-f, which de-

notes the mean of the f-measure, is evaluated as 0.930.

This value shows that, in general, the classiﬁer is sta-

ble and does not favors some classes over others.

6 DISCUSSION AND

CONCLUSIONS

This paper presents an Indoor Sign Dataset (ISD)

with some of the principal classes of indoor environ-

ment signs. We performed experiments on this data-

set using different methods of feature extraction and

classiﬁcation algorithms, both in handcrafted features

mode and by using representation learning as well.

The collected results were summarized and presen-

ted according to the principal evaluation metrics re-

garding pattern recognition systems.

We emphasize the importance of having a public

dataset to improve the results about this problem. It is

not possible to compare the results with other techni-

ques, because they are the ﬁrst using this dataset, but

the results are encouraging, because they present hig-

her accuracy than works related to it and use real pho-

An Indoor Sign Dataset (ISD): An Overview and Baseline Evaluation

511

tos of different environments.

As a future work, we consider the application

of different feature extraction methods, such as co-

occurrence matrix and SURF, and different models of

CNNs to the classiﬁcation task to improve the results.

Furthermore, we intend to design a wearable device

to incorporate the software.

ACKNOWLEDGEMENTS

The authors would like to thank all the people who

contributed with images for the elaboration of this da-

taset, the Brazilian Federal Agency for Support and

Evaluation of Graduate Education within the Mini-

stry of Education of Brazil (Capes) and the Natio-

nal Council for Scientiﬁc and Technological Develop-

ment (CNPq) for its ﬁnancial support on this work.

REFERENCES

Bashiri, F. S., LaRose, E., Peissig, P., and Tafti, A. P. (2018).

Mcindoor20000: A fully-labeled image dataset to ad-

vance indoor objects detection. Data in Brief, 17:71 –

75.

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008).

Speeded-up robust features (surf). Computer Vision

and Image Understanding, 110(3):346 – 359. Simila-

rity Matching in Computer Vision and Multimedia.

Costa, Y., Oliveira, L., Koerich, A., Gouyon, F., and Mar-

tins, J. (2012). Music genre classiﬁcation using lbp

textural features. Signal Processing, 92(11):2723 –

2737.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In 2005 IEEE Compu-

ter Society Conference on Computer Vision and Pat-

tern Recognition (CVPR’05), volume 1, pages 886–

893 vol. 1.

Glorot, X. and Bengio, Y. (2010). Understanding the dif-

ﬁculty of training deep feedforward neural networks.

In Proceedings of the Thirteenth International Con-

ference on Artiﬁcial Intelligence and Statistics, vo-

lume 9 of Proceedings of Machine Learning Research,

pages 249–256, Chia Laguna Resort, Sardinia, Italy.

PMLR.

Gonzalez, R. C. and Woods, R. E. (2006). Digital Image

Processing (3rd Edition). Prentice-Hall, Inc., Upper

Saddle River, NJ, USA.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving

deep into rectiﬁers: Surpassing human-level perfor-

mance on imagenet classiﬁcation. In 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 1026–1034.

Jolliffe, I. (2005). Principal Component Analysis. Ameri-

can Cancer Society.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. International Conference on

Learning Representations.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Proceedings of the 25th Internatio-

nal Conference on Neural Information Processing Sy-

stems - Volume 1, NIPS’12, pages 1097–1105, USA.

Curran Associates Inc.

Kunene, D., Vadapalli, H., and Cronje, J. (2016). Indoor

sign recognition for the blind. In Proceedings of the

Annual Conference of the South African Institute of

Computer Scientists and Information Technologists,

SAICSIT ’16, pages 19:1–19:9, New York, NY, USA.

ACM.

Lecun, Y. (1989). Generalization and network design stra-

tegies. Elsevier, Zurich, Switzerland.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Compu-

ter Vision, 60(2):91–110.

Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectiﬁer

nonlinearities improve neural network acoustic mo-

dels. In ICML Workshop on Deep Learning for Audio,

Speech and Language Processing.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill,

Inc., New York, NY, USA, 1 edition.

Ni, Z., Fu, S., Tang, B., He, H., and Huang, X. (2014).

Experimental studies on indoor sign recognition and

classiﬁcation. In 2014 IEEE Symposium on Compu-

tational Intelligence and Data Mining (CIDM), pages

489–494.

Ojala, T., Pietik

ainen, M., and Harwood, D. (1996). A com-

parative study of texture measures with classiﬁcation

based on featured distributions. Pattern Recognition,

29(1):51 – 59.

Prajapati, R. and Shah, P. (2016). Design and testing algo-

rithm for real time text images: Rehabilitation aid for

blind. International Journal of Science Technology &

Engineering, 2(11):275 – 278.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

International Conference on Learning Representati-

ons.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A simple way

to prevent neural networks from overﬁtting. Journal

of Machine Learning Research, 15:1929–1958.

Tola, E., Lepetit, V., and Fua, P. (2010). Daisy: An efﬁ-

cient dense descriptor applied to wide-baseline stereo.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 32(5):815–830.

Vapnik, V. N. (1995). The Nature of Statistical Learning

Theory. Springer-Verlag New York, Inc., New York,

NY, USA.

Wang, S., Yang, X., and Tian, Y. (2013). Detecting sig-

nage and doors for blind navigation and wayﬁnding.

Network Modeling Analysis in Health Informatics and

Bioinformatics, 2(2):81–93.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

512