Video Object Recognition and Modeling by SIFT Matching

Optimization

Alessandro Bruno, Luca Greco and Marco La Cascia

Dipartimento di Ingegneria Chimica, Gestionale, Informatica, Meccanica,

Università degli studi di Palermo, Palermo, Italy

Keywords: Object Modeling, Video Query, Object Recognition.

Abstract: In this paper we present a novel technique for object modeling and object recognition in video. Given a set

of videos containing 360 degrees views of objects we compute a model for each object, then we analyze

short videos to determine if the object depicted in the video is one of the modeled objects. The object model

is built from a video spanning a 360 degree view of the object taken against a uniform background. In order

to create the object model, the proposed techniques selects a few representative frames from each video and

local features of such frames. The object recognition is performed selecting a few frames from the query

video, extracting local features from each frame and looking for matches in all the representative frames

constituting the models of all the objects. If the number of matches exceed a fixed threshold the

corresponding object is considered the recognized objects .To evaluate our approach we acquired a dataset

of 25 videos representing 25 different objects and used these videos to build the objects model. Then we

took 25 test videos containing only one of the known objects and 5 videos containing only unknown objects.

Experiments showed that, despite a significant compression in the model, recognition results are

satisfactory.

1 INTRODUCTION

The ever-increasing popularity of mobile devices

such as smartphones and digital cameras, enables

new classes of dedicated applications of image

analysis such as mobile visual search, image

cropping, object detection, object recognition, data

representation (object modeling), etc.... Object

modeling and object recognition are two of the most

important issues in the field of computer vision.

Object modeling aims to give a compact and

complete representation of an object. Object models

can be used for many computer vision applications

such as object recognition and object indexing in

large database.

Object recognition is the core problem of

learning visual object categories and visual object

instance. Researchers of computer vision considered

two types of recognition: the specific object case and

the generic category case. In the specific case the

goal is to identify instances of a particular object. In

the generic category case the goal is to recognize

different instances of objects as belonging to the

same conceptual class. In this paper we focused our

attention on the first case (the specific instance of a

particular object). More in details we developed a

new technique for video object recognition and

modeling (data representation).

Matching and learning visual objects is a

challenge on a number of fronts. The instances of

the same object can appear very differently

depending on variables such as illumination

conditions, object pose, camera viewpoint, partial

occlusions, backgroud clutter.

Object recognition is accomplished by finding a

correspondence between certain features of the

image and comparable features of the object model.

The two most important issues that a method must

address are what constitutes a feature, and how is the

correspondence found between image features and

model features. Some methods use global features,

which summarize information about the entire

visible portion of an object, other methods use local

features invariant to affine transforms such as local

keypoints descriptors (Lowe, 2004).

We focus our work on methods that use local

features, such as local keypoints descriptors such as

SIFT.

662

Bruno A., Greco L. and La Cascia M..

Video Object Recognition and Modeling by SIFT Matching Optimization.

DOI: 10.5220/0004828006620670

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 662-670

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

The contributes of our paper are: a new technique

for object modeling; a new method for video object

recognition based on objects matching; a new video

dataset that consists of 360 degree video collection

of thirty objects (CVIPLab, 2013).

We suppose to analyze the case in which a person

take a video of an object with a videocamera and

then wants to know information about the object.

The scenario of our system is to upload the video to

a system able to recognize the video object taken by

the videocamera.

We developed a new model for video objects by

giving a very compact and complete description of

the object. We also developed a new video object

recognition

based on object matching that achieves

very good results in terms of accuracy.

The rest of this papers is organized as follows:

in section 2 we describe the related work of the state

of the arts in object modeling and object recognition;

in section 3 a detailed description of the video object

models dataset is given; in section 4 we describe the

proposed method for object recognition; in section 5

we show the experimental results; the section 6 ends

the paper with some conclusions and future works.

2 RELATED WORKS

In this section we show the most popular method for

object modeling and object recognition with

particular attention to video oriented methods.

2.1 Object Modeling

The most important factors in object retrieval are the

data representation (modeling) and the search

(matching) strategy. In (Li, 1999) the authors use

multiresolution modeling because it preserves

necessary details when they are appropriate at

various scales. Features such as color, texture, shape

are used to build object models, more particularly

GHT (the Generalized Hough Transform) is adopted

over the others shape representations because it is

robust againts noise and occlusion. Moreover it can

be applied hierarchically to describe the object at

multiple resolution.

In recogniton kernel (Li,1996) based method, the

features of an object are extracted at levels that are

the most appropriate to yield only the necessary

details; in (Day,1995) the authors proposed a

graphical data model for specifying spatio-temporal

semantics of video data for object detection and

recognition. The most important information used in

(Chen, 2002) are the relative spatial relationships of

the objects in function of time evolution. The model

is based on capturing the video content in terms of

video objects. The authors differentiate foreground

video objects and background video objects. The

method includes the detection of background video

objects, foreground video objects, static video

objects, moving video objects, motion vectors. In

(Sivic, 2006) Sivic et al. developed an approach to

object retrieval which localizes all the occurrences

of an object in a video. Given a query image of the

object, this is represented by a set of viewpoint

invariant region descriptors.

2.2 Object Recognition

Object recognition is one of the most important issue

in computer vision community. Some works use

video to detect moving objects by motion. In

(Kavitha, 2007), for example, the authors use two

consecutive frames to first estimate motion vectors

and then they perform edge detection using canny

detector. Estimated moving objects are updated with

a watershed based transformation and finally merged

to prevent over-segmentation.

In geometric based approaches (Mundy, 2006)

the main idea is that the geometric description of a

3D object allows the projected shape to be

accurately analyzed in a 2D image under projective

projection, thereby facilitating recognition process

using edge or boundary information.

The most notable appearance-based algorithm is

the eigenface method (Turk, 1991) applied in face

recognition. The underlying idea of this algorithm is

to compute eigenvectors from a set of vectors where

each one represents one face image as a raster scan

vector of gray-scale pixel values. The central idea of

feature-based object recognition algorithms lies in

finding interesting points, often occurring at

intensity discontinuity, that are invariant to change

due to scale, illumination and affine transformation.

Object recognition algorithms based on views or

appearances, are still a hot research topic (Zhao,

2004) (Wang, 2007). In (Pontil,1998)) Pontil et al.

proposed a method that recognize the objects also if

the objects are overlapped. In recognition systems

based on view, the dimensions of the extracted

features may be of several hundreds. After obtaining

the features of 3D object from 2D images, the 3D

object recognition is reduced to a classification

problem and features can be considered from the

perspective of pattern recognition. In (Murase, 1995)

the recognition problem is formulated as one of

appearance matching rather than shape matching.

The appearance of an object depends on its

VideoObjectRecognitionandModelingbySIFTMatchingOptimization

663

shape, reflectance properties, pose in the scene and

the illumination conditions. Shape and reflectance

are intrinsic properties of the object, on the contrary

pose and illumination vary from scene to scene. In

(Murase, 1995) the authors developed a compact

representation of objects, parameterized by object

pose and illumination (parametric eigenspace,

constructed by computing the most prominent

eigenvectors of the set) and the object is represented

as a manifold. The exact position of the projection

on the manifold determines the object's pose in the

image. The authors suppose that the objects in the

image are not occluded by others objects and

therefore can be segmented from the remaining

scene.

In (Lowe, 1999) the author developed an object

recognition system based on SIFT descriptors

(Lowe, 2004), more particularly, the author used

SIFT keypoints and descriptors as input to a nearest-

neighbor indexing method that identifies candidate

object matches. The features of SIFT descriptors are

invariant to image scaling, translation and rotation,

partially invariant to illumination changes and affine

or 3D projection. The SIFT keypoints are used as

input to a nearest-neighbor indexing method, this

identifies candidate object matches.

In (Wu, 2011) the authors analyzed the features

which characterize the difference of similar views to

recognize 3D objects. Principal Component Analysis

(PCA) and Kernel PCA (KPCA) are used to extract

features and then classify the 3D objects with

Support Vector Machine (SVM). The performances

of SVM, tested on Columbia Object Image Library

(COIL-100) have been compared. The best

performance is achieved by SVM with KPCA.

KPCA is used for feature extraction in view-based

3D object recognition.

In (Wu, 2011) different algorithms are shown by

comparing the performances only for four angles of

rotation (10° 20° 45° 90°). Furthermore, the

experimental results are based only on images with

dimensions 128 x 128.

Chang et al. (Chang, 1999) used the color co-

occurrence histogram (that adds geometric

information to the usual color histogram) for

recognizing objects in images. The authors

computed model of color co-occurrence histogram

based on images of known objects taken from

different points of view. The models are then

matched to sub-regions in test images to find the

object. Moreover they developed a mathematical

probabilistic model for adjusting the number of

colors in color co-occurrence histogram.

In (Jinda-Apiraksa, 2013) the focus is on the

problem of near-duplicates (ND), that are similar

images that can be divided in identical (IND) and

non-identical (NIND). IND is formed by

transformed versions of an initial image (i.e. blurred,

cropped, filtered), NIND by pictures containing the

same scene or objects. In this case, the subjectivity

of “how much” two image are similar is a hard

problem to face off. They present a NIND ground

truth derived by asking directly to ten subjects and

they make it available on the web.

A high-speed and high-performance ND retrieval

system is presented in the work of (Dong, 2012).

They use an entropy-based filtering to eliminate

points that can lead to false positive, like those

associated to near-empty regions, and a sketch

representation for filtered descriptors. Then they use

a query expansion method based on graph cut.

Recognizing in video includes the problem of

detection and in same cases tracking of the object.

The paper of (Chau, 2013) is an overview on

tracking algorithms classification where the authors

divide the different approaches in point, appearance

and silhouette tracking.

In our method we use SIFT for obtaining the

object model from multiple views (multiple frames)

of the object in the video. In our method the

recognition of the object is performed by matching

the keypoints of the sampled frames from the video

with the keypoints of the objects models. Similarly

to the method of Peng Chang et al. (Chang, 1999)

we used object modeling for object recognition but

we preferred to extract local features (SIFT) rather

than global features such as the color co-occurrence

histogram.

3 DATASET CREATION AND

OBJECT MODELING

The recognition algorithm is based on a collection of

models built from videos of known objects. To test

the performance of the proposed method we first

constructed a

dataset of videos representing several

objects. Then the modeling method is proposed.

3.1 Dataset

3.1.1 Video Description of the Object

For each object of the dataset the related video

contain a 360 degree view of the object starting from

a frontal position. This is done using a turntable, a

fixed camera and a uniform background. Video

resolution is 1280 x 720p (HD) at 30 fps and the

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

664

lenght is approximately 15 seconds.

3.1.2 Relation with Real Applications

This type of dataset try to simulate a simple video-

acquisition that can be done with a mobile device

(i.e. a smartphone) circumnavigating an object that

have to be added to the known object database. In

real application the resulting video have to be re-

elaborated, for example trying to estimate motion

velocity and jitter. If a video contain a partial view

of the object (i.e. less than 360 degree) recognition

task can be still performed but only for the visible

part of the object.

3.1.3 Image Dataset

The constructed dataset is formed by videos of 25

different objects. As the angular velocity of the

turntable is constant, a subset of 36 frame is sampled

uniformly for each object so extracting views that

differ by 10 degrees of rotation (see fig. 1). So,

starting from the video dataset, an image dataset is

also constructed with these samples containing 900

views of the 25 objects. Although original

background is uniform, shadows, light changes or

camera noise can produce a slightly changing

resulting color. In the extracted views the original

background is segmented and replaced with a real

uniform background (i.e. white) that not produce

SIFT keypoint, so storing only the visual

information about the object (fig. 2).

3.2 Object Modeling

Starting from the image dataset of 900 images a

reduced version is extacted to have, for each object,

only a subset of the initial 36 images representing

the visual model to be used for recognition.

3.2.1 Overview

For each object, the model is extracted as follow:

1. SIFT descriptors and keypoints are

calculated for all views;

2. for each view, only SIFT points that match

with points in previous or next view are used

as view descriptors;

3. the number of point of each view is used as

discrete function and local maxima and

minima are extracted;

4. object model is obtained taking images

corresponding to maxima and minima.

Figure 1: Complete 360 degree view of the video object.

Figure 2: On the right, the video object frame, on the left

the video object, without background.

3.2.2 Maxima and Minima Extraction

Rotating an object by few degrees, most part of the

object that is visible starting the rotation generally is

still visible at the end. This is related to the object

geometry (shape, occluding parts, symmetrics) and

the pattern features (color change, edges).

Calculating SIFT descriptors of two consecutive

views (views that differ by 10 degrees of rotation), it

is expected that a large part of the descriptors will

match.

For each view, if only the keypoints matching

with the previous and the next view are considered

and the others are discarded, the remaining

keypoints are representative of the shared visual

informations in a three images range. Only repeated

and visible points in at least two views are present in

the resulting subset. The number of remaining points

is used like a discrete similarity function and local

maxima and minima are extracted. Taking local

minima of this function, the related images are the

most visually different in their neighborhood, so

these represent views that contain a visual change of

the object. Local maxima, on the other hand,

correspond to pictures that contain common details

in their neighborhood, so being representative of

this. Only views corresponding to local maxima and

minima are used to model the object, so taking the

images that contain “typical” views (maxima) and

visual breaking views (minima) such as in fig. 3. In

Fig. 4 and 6 we plot a curve that shows, for a given

view (x-axis), the number of SIFT points that match

(y-axis) with points in previous or next view.

VideoObjectRecognitionandModelingbySIFTMatchingOptimization

665

The curves shown in fig. 4 and 6 can be

characterized by a lot of local maxima and minima

that could correspond to views that are very close

each other. This would go in the opposite direction

from the objective of our method, that, on the

contrary, aims to represent the object with the fewest

possible views. This is the reason why we also apply

a 'smooth' interpolation function to the curves shown

in fig. 4 and 6. The results of 'smooth' interpolations

are depicted in fig. 5 and 7, showing curves very

close to the original ones (fig, 4 and 6). Furthermore

the curves in fig.5 and 7 have a number of local

maxima and minima lower than the curves in fig. 4

and 6. Since now on we call 'dataset model' the

model of the object that consists of 36 images/views

(that differ by 10 degree of rotation). We call 'full

model' the model that consists of the views that

correspond to local maxima and minima in not

smoothed curves (such as in fig. 4 and 6). We call

'smoothed model' the model that consists of the

views corresponding to local maxima and minima in

smoothed curves (as in fig. 5 and 7). In tab. 1 we

show, for each object, the size of full and smoothed

model and the model compression. The latter is the

ratio between the number of views composing the

current model (i.e. 'full model' or 'smoothed model')

and the number of views composing the 'dataset

model'.

Figure 3: On the left side, Panda Object View corresponds

to a local maxima (0 degree view) of the curve in fig. 4, on

the right side Panda Object View corresponds to a local

minima (110 degrees view) of the curve in fig.4.

4 PROPOSED RECOGNITION

METHOD

Given the dataset and the extracted object models,

we propose a method that performs recognition

using a video as query input. Input query video may

contain or not one of the known objects, the only

hypothesis on the video is that if it contains an

object of the database then the object is almost

always visible in the related video even if subject to

changes on scale and orientation.

4.1 Proposed Method

The proposed recognition follows this steps:

1. extract N frames from video query;

2. match every frame with all components of all

models;

3. counting the number of matching points for all

the views of the models and all frames of the

video, take the maximum value. The object

related to this match is the recognized object, if

the number of matches exceeds a fixed

threshold (10 in our experiments).

4.1.1 Refining Matches

If the models give a complete representation of the

appearance of the object, step two is crucial for

recognition task. Experimental results shows that

results can be corrupted in real-word query because

cluttered background can lead to incorrect or

multiple matches.results can be corrupted in real-

word query because cluttered background can lead

to incorrect or multiple matches. To make a more

robust matching phase, it is important to exclude

these noisy points. This can be done using RANSAC

Figure 4: The chart of matching keypoints for all views of

Panda Object. Yellow circles are local maxima and

minima.

Figure 5: The smoothed chart for Panda Object (blue line).

Yellow stars are local maxima and minima. Red dash line

is the original chart.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

666

Figure 6: The chart of matching keypoints for all views of

Tour Eiffel Object. Yellow circles are local maxima and

minima.

(Fischler, 1981) in the matching operation, to

exclude points that don’t fit an homography

transformation (fig. 8). Furthermore, considering

that a single keypoint of an image can have more

than one match with keypoints of the second image,

we consider multiple matches of the same keypoint

as a single match.

Figure 7: The smoothed chart for Tour Eiffel Object.

Yellow stars are local maxima and minima. Red dash line

is the original chart.

5 RESULTS

Object recognition using video and image dataset

was done using the MATLAB implementation of

SIFT present in (Vedaldi, 2010) and RANSAC

implementation present in (Kovesi, 2003) following

the process described in section 4.1. To achieve

matches with less but more robust points the

Table 1: In this table, experimental results and statistical values about the video object modeling are shown: object id,

object name, the number of the views composing the object 'full model' and the object 'smoothed model', the compression

factor (i.e the ratio between the number of object model views and the number of all the object views in the dataset).

obj. ID

name full model compression smoothed model compression

1 Dancer 14 38.89% 10 27.78%

2 Bible 15 41.67% 9 25.00%

3 Beer 7 19.44% 5 13.89%

4 Cipster 12 33.33% 5 13.89%

5 Tour Eiffel 17 47.22% 10 27.78%

6 Energy Drink 17 47.22% 7 19.44%

7 Paper tissue 13 36.11% 13 36.11%

8 Digital camera 13 36.11% 7 19.44%

9 iPhone 13 36.11% 9 25.00%

10 Statue of Liberty 17 47.22% 11 30.56%

11 Motorcycle 9 25.00% 7 19.44%

12 Nutella 19 52.78% 9 25.00%

13 Sunglasses 23 63.89% 15 41.67%

14 Watch 16 44.44% 9 25.00%

15 Panda 15 41.67% 7 19.44%

16 Cactus 17 47.22% 11 30.56%

17 Plastic plant 19 52.78% 9 25.00%

18 Bottle of perfume 13 36.11% 5 13.89%

19 Shaving foam 10 27.78% 8 22.22%

20 Canned meat 20 55.56% 9 25.00%

21 Alarm clock (black) 15 41.67% 11 30.56%

22 Alarm clock (red) 15 41.67% 8 22.22%

23 Coffee cup 20 55.56% 11 30.56%

24 Cordless phone 15 41.67% 7 19.44%

25 Tuna can 17 47.22% 7 19.44%

Tot. 381 219

Mean Value 15.24 42.33% 8.76 24.33%

VideoObjectRecognitionandModelingbySIFTMatchingOptimization

667

Figure 8: The images show the matches with (lower) and

without (upper) RANSAC.

threshold of the match function used was 2 instead

of the default 1.5 value. The difference of the

resulting number of points can be seen in fig. 9. The

proposed method was tested with 30 different

videos. Each video contains one of the known object

except five videos that contain unknown objects.

Query videos have an average length of 4 seconds

and the first step of the method is performed with a

uniform frame sampling rate fixing N (the number

of the selected frames per video) at 4 (so

approximately one frame for second). In fig. 10

best match number is shown with relationship to the

number of experiments (step 3). In step 3 the

selection of an appropriate threshold (10) is

performed by statistical analysis of the correct

match. The chart in fig. 10 shows that the best

matches, for each object, are distributed into two

major groups. In tab.2 recognition correctness

results are shown for each test video query,

including the original id and name for the present

object (or NO OBJ# for unknown object). Total

recognition performance is shown in tab. 3, with an

average precision of the system of 83%. The number

of matches performed is 291, so only 24% of the full

dataset dimension of 900. In fig. 8 an example of

correct recognition is shown. Fig. 11 shows the

matches for an unrecognized object (dancer) and for

a correct not recognition of unknown object.

Figure 9: Matching results with different thresholds: 2

(lower) and default value, 1.5 (upper).

Table 2: Video object recognition correctness results.

obj. ID name result

1 Dancer uncorrect

2 Bible correct

3 Beer correct

4 Cipster correct

5 Tour Eiffel correct

6 Energy Drink correct

7 Paper tissue correct

8 Digital camera correct

9 iPhone correct

10 Statue of Liberty correct

11 Motorcycle correct

12 Nutella correct

13 Sunglasses uncorrect

14 Watch correct

15 Panda correct

16 Cactus uncorrect

17 Plastic plant uncorrect

18 Bottle of perfume correct

19 Shaving foam correct

20 Canned meat correct

21 Alarm clock (black) correct

22 Alarm clock (red) correct

23 Coffee cup uncorrect

24 Cordless phone correct

25 Tuna can correct

NO OBJ1 correct

NO OBJ2 correct

NO OBJ3 correct

NO OBJ4 correct

NO OBJ5 correct

Table 3: The precision of video object recognition system.

Testset size correct uncorrect precision

30 25 5 83.33%

Figure 10: The best matches results for each object and for

unknow objects (NO OBJ).

ACKNOWLEDGEMENTS

The authors wish to acknowledge Christian Caruso

for helping us in the implementation and

experimental phases.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

668

6 CONCLUSIONS AND FUTURE

WORKS

In this paper we proposed a new method for video

object recognition based on video object models.

The results of video object recognition, in terms of

accuracy are very encouraging (83%). We created a

video dataset of 25 video object, it consists of 360

degree-views of the objects. From the video dataset

an image dataset is also constructed by sampling the

video frames. It contains 900 views of the 25

objects. Our method for object modeling gives, as

result, a compact and complete representation of the

objects, it achieves almost 76% data compression of

the models. With regard to object recognition

method, one of the possible improvement is to refine

the selection of the frames for the query in the

objects models database. Given a video, the camera

motion could be estimated and the frame samples

extracted according to motion, for example trying to

get a frame every fixed angular displacing. Best

results should be reached using a sampling rate that

approximate the rate used in the dataset creation. If

the video is long enough to have a high number of

selected frames, the same modeling process could be

used in the query to increase time performance of

the recognition, preserving the accuracy taking only

the most relevant views.

Figure 11: Two examples of results: a false negative (the

dancer) and a true negative (unknown object).

REFERENCES

Li, Z. N., Zaiane, O. R., Tauber, Z., 1999. Illumination

Invariance and Object Model Content-Based Image

and Video Retrieval. In Journal of Visual

Communication and Image Representation, vol 10, pp

219-224.

Z. Li and B. Yan., 1996 Recognition Kernel for content-

based search. In Proc. IEEE Conf. on Systems, Man,

and Cybernetics, pages 472-477.

Day, Y. F., Dagtas, S., Iino, M., Khokhar, A., Ghafoor, A.,

1995. Object-oriented conceptual modeling of video

data. In Proceedings of the Eleventh International

Conference on Data Engineering.

Chen, L., Ozsu, M. T., 2002. Modeling of video objects in

a video databases. In Proceedings of IEEE

International Conference on Multimedia and Expo.

Sivic, J., Zisserman, A., 2006. Video Google: Efficient

visual search of videos. In Toward Category-Level

Object Recognition, pp. 127-144, Springer.

Vedaldi, A., Fulkerson, B., 2010. VLFeat: An open and

portable library of computer vision algorithms. In

Proceedings of the International Conference on

Multimedia.

Kavitha, G., Chandra, M. D., Shanmugan, J., 2007. Video

Object Extraction Using Model Matching Technique:

A Novel Approach. In 14th IWSSIP, 2007 and 6th

EURASIP Conference focused on Speech and Image

Processing, Multimedia Communications and

Services, pp. 118-121.

Mundy, Joseph L. 2006. Object recognition in the

geometric era: A retrospective. Toward category-level

object recognition. pp.3-28.

Lowe, D.G., 2004. Distinctive Image Features from Scale-

Invariant Keypoints, In International Journal of

Computer Vision n. 60 vol.2 pp. 91-110, Springer.

Turk, M., Pentland, A., 1991. Eigenfaces for recognition.

In Journal of cognitive neuroscience vol.3, n.1, pp. 71-

86, MIT press.

Zhao, L. W., Luo, S. W., Liao, L. Z., 2004. 3D object

recognition and pose estimation using kernel PCA. In

Proceedings of 2004 International Conference on

Machine Learning and Cybernetics.

Wang, X. Z., Zhang, S. F., Li, J., 2007. View-based 3D

object recognition using wavelet multiscale singular-

value decomposition and support vector machine. In

ICWAPR.

Pontil, M., Verri, A., 1998. Support vector machines for

3D object recognition. In IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol.20 n.6,

pp. 637-646.

Murase, H., Nayar, S. K., 1995. Visual learning and

recognition of 3-D objects from appearance. In

International journal of computer vision, vol.14 n.1,

pp. 5-24. Springer.

Lowe, D. G., 1999. Object recognition from local scale-

invariant features. In . The proceedings of the seventh

IEEE international conference on Computer vision.

Chang, P., Krumm, J., 1999. Object recognition with color

cooccurrence histograms. In IEEE Computer Society

Conference on Computer Vision and Pattern

Recognition.

Wu, Y. J., Wang, X. M., Shang, F. H., 2011. Study on 3D

Object Recognition Based on KPCA-SVM. In

International Conference on Information and

Intelligent Computing, vol.18 pp. 55-60. IACSIT

Press, Singapore.

Fischler, Martin A and Bolles, Robert C.,1981. Random

sample consensus: a paradigm for model fitting with

applications to image analysis and automated

cartography In Communications of the ACM, vol. 24,

num.6, pp. 381–395.

VideoObjectRecognitionandModelingbySIFTMatchingOptimization

669

Kovesi, P., 2003. MATLAB and Octave Functions for

Computer Vision and Image Processing. [online]

Available at: <http://www.csse.uwa.edu.au/~pk>

[Accessed September 2013]

Jinda-Apiraksa, A., Vonikakis, V., Winkler, S., 2013.

California-ND: An annotated dataset for near-

duplicate detection in personal photo collections. In

Proceedings of 5th International Workshop on Quality

of Multimedia Experience (QoMEX), Klagenfurt,

Austria.

CVIPLab, 2013. Computer Vision & Image Processing

Lab, Università degli studi di Palermo Available at:

<https://www.dropbox.com/sh/sqkq03tsembdu4m/N1

mCVCFxGQ>

Dong, W., Wang, Z., Charikar, M., Li, K., 2012. High-

confidence near-duplicate image detection. In

Proceedings of the 2nd ACM International Conference

on Multimedia Retrieval.

Chau, D. P., Bremond, F., Thonnat, M., 2013. Object

Tracking in Videos: Approaches and Issues. arXiv

preprint arXiv:1304.5212.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

670