ularies compared to text vocabularies has been pro-
posed in order to clarify conditions for applying text
techniques to visual words. To present this study, the
author described four methods for building a visual
vocabulary from two images collections (Caltech-
101 and Pascal) based on two low-level descriptors
(SIFT and SURF) combined with two clustering algo-
rithms :K-means and SOM (Self-Organizing Maps).
The experiments showed that visual words distribu-
tions highly depend the clustering method (Martinet,
2014).
In addition, ontologies based images retrieval ap-
proaches have been proposed in order to extract visual
information guided by its semantic content (Hyv
¨
onen
et al., 2003) (Sarwara et al., 2013).
In (Kurtz and Rubin, 2014), a novel approach
based on semantic proximity of image content using
relationships has been proposed. This method is com-
posed of two steps: 1) annotation of query image by
semantic terms that are extracted from ontology and
construction of a term vector modeling this image,
and 2) comparison of this query image to the others
that are previously annotated using a computed dis-
tance between term vectors that are coupled with an
ontological measure.
In the context of image retrieval based on visual
words, when low-level features are extracted, result-
ing visual words are gathered using only their appear-
ance similarity in the clustering step. Consequently,
similar visual words do not guarantee semantic sim-
ilar meaning. That tends to reduce the retrieval ef-
fectiveness with respect to the user. Moreover, in in-
terest points detection step, many detectors can lose
some interest points and increase the vector quantiza-
tion noise. This can result in poor visual vocabulary
that decrease the search performance.
Our motivations are to build visual vocabulary and
ontologies based on images annotations in order to
enhance image retrieval accuracy. The goal is to in-
troduce an image retrieval system which aims to inte-
grate two image aspects: visual features and semantic
contents based on images annotations.
Our idea is to combine, during the image retrieval
process, similarity between visual words to semantic
similarity.
Moreover, the evaluation of our proposal is to
achieve two image retrieval strategies:
• A visual retrieval strategy based on visual similar-
ity between visuals words ;
• A strategy based on integrating both visual and
semantic similarities. In this case, semantic simi-
larity is based on concepts that are provided from
ontologies.
3 VISUAL VOCABULARY AND
ONTOLOGIES-BASED IMAGE
RETRIEVAL SYSTEM
In this section, we define our visual vocabulary and
ontologies-based image retrieval system architecture.
Our idea is to build visual vocabulary using low-level
features and building ontologies based on concepts
that are extracted from images annotations.
As depicted in Figure 1, our image retrieval sys-
tem is composed of two main phases (online phase
and offline phase). The offline phase, which corre-
sponds to the visual vocabulary and ontologies’ build-
ing phase, is composed of two steps: (1) building the
visual vocabulary and (2) building ontology. The on-
line phase, which corresponds to the image retrieval
phase, is composed of two steps: (1) query image pro-
cessing and (2) image retrieval.
In the next section, the different steps of our image
retrieval system will be detailed.
3.1 Offline Phase: Visual Vocabulary
and Ontologies’ Building
Our main idea is to develop an image retrieval system
based on building the visual vocabulary and ontolo-
gies.
3.1.1 Building the Visual Vocabulary
This step allows to generate visual vocabulary accord-
ing to three steps: interest points detection, comput-
ing descriptors and the clustering phase.
Interest Points Detection: In computer vision many
detectors of interest points are developed. In order to
produce effective vocabulary we have used the SIFT
detector to extract the local interest points because -
using this descriptor- a large number of interest points
can be extracted from images.
Computing Descriptors (or Feature Extraction):
This step consists in extracting features by comput-
ing SIFT descriptor for each point which is detected
in the previous step.
Clustering: This step consists in clustering local de-
scriptors which are computed in the previous step, the
goal is to represent each feature by the centroid of
the cluster it belongs. In our case, we have used the
K-means algorithm that is the most widely used clus-
tering algorithm for visual vocabulary generation.
Towards Visual Vocabulary and Ontology-based Image Retrieval System
561