Next Generation TV through Automatic Multimedia Annotation Systems

A Hybrid Approach

Jo¨el Dumoulin

, Marco Bertini

, Alberto Del Bimbo

, Elena Mugellini

, Omar Abou Khaled

and Maria Sokhn

Department of Information Technologies, HES-SO, Fribourg, Switzerland

Media Integration and Communication Center (MICC), University of Florence, Florence, Italy

Keywords:

Automatic Video Annotation, High Level Content Annotation, Multimedia Information Retrieval, Social

Video Retrieval, Tag Suggestion, User-generated Content, Smart TV.

Abstract:

After the advent of smartphones, it is time for television to see its next big evolution, to become smart TVs.

But to provide a richer television user experience, multimedia content ﬁrst has to be enriched. In recent years,

the evolution of technology has facilitated the way to take and store multimedia assets, like photographs or

videos. This causes an increased difﬁculty in multimedia resources retrieval, mainly because of the lack of

methods that handle non-textual features, both in annotation systems and search engines. Moreover, mul-

timedia sharing websites like Flickr or YouTube, in addition to information provided by Wikipedia, offer a

tremendous source of knowledge interesting to be explored. In this position paper, we address the automatic

multimedia annotation issue, by proposing a hybrid system approach. We want to use unsupervised meth-

ods to ﬁnd relationships between multimedia elements, referred as hidden topics, and then take advantage of

social knowledge to label these resulting relationships. Resulting enriched multimedia content will allow to

bring new user experience possibilities to the next generation television, allowing for instance the creation of

recommender systems that merge this information with user proﬁles and behavior analysis.

1 INTRODUCTION

Technology is changing television practices through

integration of computing capabilities and Internet

connection. Despite the temporary setback encoun-

tered by Google with Google TV

, effort they put into

this direction shows that it is the future of television.

Increasing integration of television and computer

technology can move traditional television applica-

tions towards richer, sociable, computer-mediated

user experiences. In particular, we believe that tradi-

tional television programs can be enriched by exploit-

ing multimedia content available on the Internet. But

this requires to improve such multimedia content, by

developing new automatic annotation techniques, to

provide richer annotations than existing methods and

reduce the so-called “semantic gap”(Smeulders et al.,

2000).

In the last few years, the number of video and im-

age archives has dramatically increased. Therefore, it

has become really hard to search and retrieve images

http://www.google.com/tv/

or videos in an effective way, and one reason is the

lack of semantic multimedia content based indexes

needed for such retrieval (Tjondronegoro et al., 2005)

(Yin et al., 2009). So, the multimedia content man-

agement system research became an important re-

search area (Lew et al., 2006). Progress in the ﬁeld of

linked data (Bizer, 2009), with the semantic technolo-

gies, have helped a lot to efﬁciently manage multime-

dia data, by providing tools for describing and linking

concepts and data (Bertini et al., 2010) (Dong and Li,

2006) (Akrivas et al., 2007). In parallel, progress has

also been made in computer vision tools, allowing to

analyze and process huge amount of multimedia data

in a reduced amount of time (Smeaton et al., 2009);

most of these solutions have exploited the bag-of-

features approach to generate textual labels that rep-

resent the categories of the main and easiest to detect

entities (objects and persons) in the video sequence

keyframes. The TRECVid benchmark has shown an

increasing improvement in the performance of appro-

priately trained classiﬁers (Snoek et al., 2006)(Haupt-

mann et al., 2008), but despite these advances, auto-

matic multimedia content annotation is still a prob-

192

Dumoulin J., Bertini M., Del Bimbo A., Mugellini E., Abou Khaled O. and Sokhn M..

Next Generation TV through Automatic Multimedia Annotation Systems - A Hybrid Approach.

DOI: 10.5220/0004128101920197

In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems

(SIGMAP-2012), pages 192-197

ISBN: 978-989-8565-25-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

lematic task. Moreover, the “semantic gap”, deﬁned

by Smeulders et al. (Smeulders et al., 2000) as the

lack of coincidence between the information that can

be extracted from a visual data and the interpretation

that a user has in a given situation, is still an unsolved

problem.

On the one hand, statistical tools based on com-

puter vision techniques provide an automatic mean

to process large-scale multimedia databases, but they

lack the knowledge that would be required to en-

rich the resulting elements with additional informa-

tion and relations, and intervention of people is then

required. On the other hand, social networks and also

ontologies, through manual annotation and social co-

operation, convey a tremendous and quite not used

amount of knowledge, although it is still difﬁcult to

use it in a proﬁtable way. An idea is to merge both do-

mains, through a hybrid automatic multimedia anno-

tation system, in order to enrich multimedia content,

and bring new possibilities to the next generation tele-

vision, such as interactivity, personalisation, sharing,

etc.

In this paper we propose a hybrid approach for

multimedia annotation systems, by merging auto-

matic processing and data stemming from manual an-

notation, that could lead to the improvement of the

next generation television experience. The rest of the

paper is organized as follow. A brief state of the art

for the multimedia annotation domain is made in Sec-

tion 2. The proposed hybrid system approach is dis-

cussed in Section 3, and the application to the next

generation television domain is proposed in Section

4. Finally, conclusions are drawn in Section 5.

2 MULTIMEDIA ANNOTATION

Two main visual tagging approaches are used to en-

rich or annotate multimedia content: manual anno-

tation and automatic processing. Manual annotation,

done by experts, is the most accurate and provide high

level concept annotation. But the time needed to do

such task is huge and thus way too expensive. In re-

cent years there has been an extremely common use

of social media such as tagging images (e.g. Flickr)

and video (e.g. YouTube), resulting in a new human

labeling source, which can be more or less structured

(ontolgy based approachessuch as FOAF

vs collabo-

rative systems such as Wikipedia). Human labeling is

closer to users, resulting in more precise understand-

ing of the image (or video sequence). But when it

Friend Of A Friend ontology: http://www.foaf-

project.org/

comes to process huge quantity of documents, auto-

matic processing techniques are essential.

Image classiﬁcation methods currently considered

state-of-art require extensive training (e.g. SVM

(Meyer, 2001), boosting (Schapire, 2003)), and do

not scale well with increasing number of concepts that

must be recognized. These concepts are also to be de-

termined, typically provided in a standard lexicon as

LSCOM

or created ad hoc for the problem, while the

interest of people may change over time or depending

the introduction of some new event or interest.

Probabilistic Latent Semantic Analysis (pLSA)

(Hofmann, 2001) and Latent Dirichlet Allocation

(LDA) (Blei et al., 2003) (Hu, 2009) are statistical text

modeling techniques allowing automated text doc-

uments indexing (Cai et al., 2008) (Nguyen et al.,

2009) (Phan et al., 2008). By deﬁning a visual

analogy of a word (visual words) (Sivic and Zis-

serman, 2003), these techniques have recently been

successfully applied for image content analysis tasks

(both on pictures and videos), such as classiﬁcation

of scenes (Bosch and Zisserman, 2006), multimedia

content retrieval (H¨orster et al., 2007) (Lienhart and

Slaney, 2007) or object localization in images (An-

dreetto et al., 2008) (Wang and Grimson, 2007). For

the multimedia content retrieval, this can add a low-

dimensional descriptor, enabling efﬁcient retrieval in

very large databases (H¨orster et al., 2007).

Several recent approaches try to bring closer both

manual annotation and automatic processing, using

social tags to help in the creation of supervised clas-

siﬁers, typically to provide negative examples of con-

cepts to train (Setz and Snoek, 2009) (Li and Snoek,

2009), or to extend more easily the size of the lexi-

con of concept detectors using user-generated videos

as training material (Ulges et al., 2010). Nearest-

neighbors methods have received large attention due

to the increase of training data available in social me-

dia and to the appeal of using an unsupervised ap-

proach, e.g. to suggest tags to images (Li et al., 2009)

based on their visual similarity with a visual neighbor-

hood. A baseline approach that is based on NN im-

ages to annotate new images was proposed in (Maka-

dia et al., 2008), showing how a fusion of different

features is fundamental for achieving the best per-

formance, while weighted neighborhood of keywords

is used for tag propagation in (Guillaumin et al.,

2009). One interesting approach (Tsai et al., 2011)

is based on the use of “visual synsets”, clusters of vi-

sually similar images to whom are associated sets of

weighted tags that are semantically similar. (Ballan

et al., 2010) have proposed a system for video tag sug-

Large Scale Concept Ontology For Multimedia:

http://www.lscom.org/

NextGenerationTVthroughAutomaticMultimediaAnnotationSystems-AHybridApproach

193

gestion and temporal localization, based on the simi-

larity of keyframes with visual images that have been

manually annotated with the same tags on Flickr. In

this approach, the tags of a video are used to down-

load images from Flickr (annotated with these words)

and these images are then used to evaluate the visual

similarity with the keyframe of the video, to locate

the visual concepts over time. Most these methods

rely on global and compact visual features and CBIR

techniques to speed the computation of visual neigh-

borhoods.

3 HYBRID APPROACH

Our goal is to enhance the television user experience,

by associating rich annotated multimedia resources,

in an interactive and personalized way to the program

the user is watching. For this purpose, we propose

a hybrid multimedia annotation system. As said be-

fore, on the one hand, statistical tools are able to pro-

cess multimedia contents quite efﬁciently, but do not

scale well in terms of size and changes in the lexi-

con of concepts to be detected. On the other hand,

a huge amount of user generated knowledge is avail-

able through social networks, and also ontologies. We

want to take advantages of both parts, by merging

them together, as schematized in Figure 1.

Figure 1: Hybrid annotation system: hidden concepts - so-

cial networks and ontologies.

The association between visual topics and textual

topics (Monay and Gatica-Perez, 2004) is still an un-

solved problem (e.g. due to the sparsity of tags (Qi

et al., 2012)), and its resolution could lead to this pur-

pose: take advantage of the knowledge available in

both social networks and ontologies, associating the

textual latent topics to the visual latent topics; these

latter topics could be used also to reduce the dimen-

sionality of the descriptors used in the bag-of-visual-

words approach, e.g. describing visual content using

an histogram of topics instead of a larger histogram of

visual words.

3.1 Unsupervised Techniques for

Hidden Topics Analysis

Applied to a pool of multimedia data, the use of un-

supervised techniques, such as multi-modal LDA, re-

sults in the extraction of relationships between the dif-

ferent multimedia content.

The idea of using such graphical models to dis-

cover hidden topics is related to the automatic video

analysis based on unsupervised methods domain.

For now, these tasks are less accurate than semi-

supervised or manually methods. But with the expo-

nential growth of multimedia content (e.g. YouTube

reported in May 2012 that more than 72 hours of

videos are uploaded every minute), use of unsuper-

vised methods appears essential to us, and therefore

enhance these methods is a critical task. Automatic

techniques also allow to do tasks much more faster

than a person would with a manual system, and this

is important in order to one day have real time video

analysis systems.

3.2 Social Annotations and Knowledge

In recent years, social sites like Flickr or YouTube

have seen a radical increase of the number of users,

which also broughtan increasing amount of annotated

multimedia content. In most cases, people add labels

(also called tags) to their content, also deﬁned as folk-

sonomy, providing a huge amount of contextual and

semantic information. This is mainly used by users to

organize and access content. But while these tags are

often overly personalized, imprecise and ambiguous

(Kennedy et al., 2006), they convey a rich informa-

tion, and they can be very useful. Indeed, as presented

in (Ballan et al., 2010), it is possible to use existing

social annotations, in order to build tag suggestion

systems for video. Moreover, using visual similarity

between frames, it is also possible to deﬁne temporal

localization of the suggested tags.

If social annotations provide a good knowledge

base, it lacks in structure. Ontologies could be an-

other way to enrich the manual annotation approach.

And semantic relationships represented between con-

cepts can bring new kinds of information. Social

annotations and ontologies approach could be both

used separately, but perhaps the idea of a folksontolgy

(Damme et al., 2007) could be an interesting direc-

tion.

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

194

4 THE NEXT GENERATION

TELEVISION

If a huge work has been made to enhance the graph-

ical user interfaces of our computers for the last

decades, this has not been the case for our television,

so our TV user experience has not really evolved. In-

deed, main efforts to image and device quality have

been made, until recently. In parallel, the number of

available channels and features has increased, mak-

ing the experience less user friendly. With the advent

of good quality Internet connections, we have seen

the appearance of connected TV

, allowing to browse

the Web, use applications, or play small embedded

games, directly within the TV. But, if it is the main

improvement brought to TV, this is not really conve-

nient, mainly because of the use of standard remote

control (Cooper, 2008), and not a lot of people use

such features. This could be an explanation regard-

ing the limited success encountered by the Google TV

or Apple TV. But no doubt it is just a timing ques-

tion, and other similar products will meet success very

soon.

Figure 2 shows a possible application schema,

in which our proposed hybrid automatic multimedia

annotation system, by enriching multimedia content

with additional meta-information (resulting in a “mul-

timedia content ++”), could allow various TV experi-

ence improvement ideas.

Figure 2: Application: 1) Automatic multimedia content

enrichment system; 2) TV User interaction.

Indeed, we plan to take advantage of this “mul-

timedia content ++”, for example to improve navi-

gation through TV Guide, but also in the program

content itself. Additional information could be made

available to the user, according to what he is watch-

ing, taking in account his interests. By providing new

ways of community interactions, we think that social

aspects could take new dimensions, also helping to

http://en.wikipedia.org/wiki/Smart

TV, Intel & Smart

TV: http://www.intel.com/content/www/us/en/smart-

tv/smart-tv-with-intel-inside.html

build recommender or collaborative ﬁltering systems

for example.

Actually, a recommender system, based on social

information, could represent an interesting starting

point in our vision of TV user experience enhance-

ment. If recommender systems are becoming more

and more popular (i.e. Last.fm or Pandora for music,

MovieLens or Netﬂix for movies, etc.), mainly be-

cause of the ever increasing user generated content on

multimedia sharing websites like YouTube, they are

generally not able to perfectly meet the user needs.

We think that building a system able to dynamically

improve the comprehension of the user needs, while

he is using the system through the time, and requires

the least amount of explicit user interaction, eliciting

information from other sources or analyzing the user

behavior, could give better results. Nowadays, people

do not like to spend time in system conﬁgurations.

So, using existing information deﬁning the user inter-

ests, like his Facebook proﬁle, is a good way to create

an initial user proﬁle. This could be done by analyz-

ing his proﬁle information (interests, etc.), but also his

generated content (tags, links, comments, etc.). As

users evolve, so should evolve their proﬁles. In ad-

dition to a textual user proﬁle builder, we would like

to integrate a behavioral user proﬁle builder. It could

use standard information, like the way the user use

the system (zapping, the user stops a programme be-

fore it ends, etc.), but more interesting would be to

take advantage of more human behavioral activity, by

deﬁning his attention: is the user talking with other

people? Does he look attentively at the TV, or is he

sleeping? The inclusion of visual sensors on TV sets,

as the gesture based control system in the Samsung

Smart TV, make this type of analysis possible. Fi-

nally, the whole system, as schematized in Figure 3,

could propose interesting content to the user. It will be

able to efﬁciently select the more interesting videos,

thanks to their annotations, automatically added by

the ﬁrst part of the system.

FacebookYoutube

Video

Smart TV experience

Video

User behavior

analyzer

Textual user

proﬁle builder

Behavioral user

proﬁle builder

Automatic

video

annotation

Recommender

system

Figure 3: The proposed recommender system: social media

and video annotation (left), user proﬁle and behavior anal-

ysis (right) contribute to the improvement of the TV user

experience.

NextGenerationTVthroughAutomaticMultimediaAnnotationSystems-AHybridApproach

195

5 CONCLUSIONS

As the images and videos databases become huge, it is

essential to ﬁnd efﬁcient, fast, but also accurate tech-

niques to annotate and enrich multimedia contents in

an automatic way. Both automatic processing and

manual annotation methods have advantages, and we

propose to merge them in a hybrid system that links

user-generated annotations with visual content anal-

ysis techniques. Resulting enriched multimedia con-

tent will allow us to improve current television ex-

perience (interaction, personalization, sharing, etc.),

in order to meet expectations of the users of the next

generation television systems. As a starting point in

this vision, we would like to build a dynamic recom-

mender system, that exploits both video content anno-

tations and a user proﬁle builder based on behavioral

analysis.

REFERENCES

Akrivas, G., Papadopoulos, G., Douze, M., Heinecke,

J., O’Connor, N., Saathoff, C., and Waddington, S.

(2007). Knowledge-based semantic annotation and

retrieval of multimedia content. In Proc. of 2nd Inter-

national Conference on Semantic and Digital Media

Technologies, pages 5–6, Genoa, Italy.

Andreetto, M., Zelnik-Manor, L., and Perona, P. (2008).

Unsupervised learning of categorical segments in im-

age collections. Proc. of Computer Vision and Pattern

Recognition Workshops, 2008. CVPRW ’08., pages 1–

Ballan, L., Bertini, M., Bimbo, A. D., Meoni, M., and Serra,

G. (2010). Tag suggestion and localization in user-

generated videos based on social knowledge. In Proc.

of second ACM SIGMM Workshop on Social Media

(WSM), pages 3–8.

Bertini, M., Amico, G. D., Ferracani, A., Meoni, M., and

Serra, G. (2010). Web-based Semantic Browsing of

Video Collections using Multimedia Ontologies. In

Proceedings of the international conference on Mul-

timedia - MM’10, pages 1629–1632, Firenze, Italy.

ACM.

Bizer, C. (2009). The Emerging Web of Linked Data. IEEE

Intelligent Systems, 24(5):87–92.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

Dirichlet Allocation. Journal of Machine Learning

Research, 3:993–1022.

Bosch, A. and Zisserman, A. (2006). Scene classiﬁcation

via pLSA. Proc. of ECCV.

Cai, D., Mei, Q., Han, J., and Zhai, C. (2008). Modeling

hidden topics on document manifold. In Proc. of the

17th ACM conference on Information and knowledge

management, pages 911–920.

Cooper, W. (2008). The interactive television user experi-

ence so far. Proc. of the 1st international conference

on Designing interactive user experiences for TV and

video (UXTV), 44:133.

Damme, C. V., Hepp, M., and Siorpaes, K. (2007). Folk-

sOntology: An Integrated Approach for Turning Folk-

sonomies into Ontologies. Social Networks, 2:57–70.

Dong, A. and Li, H. (2006). Multi-ontology Based Mul-

timedia Annotation for Domain-speciﬁc Information

Retrieval. IEEE International Conference on Sensor

Networks, Ubiquitous, and Trustworthy Computing -

Vol 2 - Workshops, 2:158–165.

Guillaumin, M., Mensink, T., Verbeek, J., and Schmid, C.

(2009). Tagprop: Discriminative metric learning in

nearest neighbor models for image auto-annotation. In

Computer Vision, 2009 IEEE 12th International Con-

ference on, pages 309 –316.

Hauptmann, A. G., Christel, M. G., and Yan, R. (2008).

Video retrieval based on semantic concepts. In Pro-

ceedings of the IEEE, volume 96, pages 602–622.

Hofmann, T. (2001). Unsupervised Learning by Probabilis-

tic Latent Semantic Analysis. Mach. Learn., pages

177–196.

H¨orster, E., Lienhart, R., and Slaney, M. (2007). Image re-

trieval on large-scale image databases. Proc. of Con-

ference on Image and video retrieval, pages 17–24.

Hu, D. (2009). Latent Dirichlet Allocation for Text, Images,

and Music. cseweb.ucsd.edu, pages 1–19.

Kennedy, L. S., Chang, S. F., and Kozintsev, I. V. (2006).

To search or to label?: predicting the performance of

search-based automatic image classiﬁers. Proc. of the

8th ACM international workshop on Multimedia In-

formation Retrieval (MIR), pages 249–258.

Lew, M. S., Sebe, N., Djereba, C., and Jain, R. (2006).

Content-Based Multimedia Information Retrieval :

State of the Art and Challenges. ACM Transactions

on Multimedia Computing, Communications, and Ap-

plications (TOMCCAP), 2(1):1–19.

Li, X. and Snoek, C. (2009). Visual categorization with neg-

ative examples for free. In Proc.s of ACM Multimedia,

pages 661–664.

Li, X., Snoek, C. G. M., and Worring, M. (2009). Learn-

ing Social Tag Relevance by Neighbor Voting. IEEE

Transactions on Multimedia, 11:1310–1322.

Lienhart, R. and Slaney, M. (2007). pLSA on large scale

image databases. Proc. of IEEE International Con-

ference on Acoustics, Speech and Signal Processing,

(ICASSP), pages 1217–1220.

Makadia, A., Pavlovic, V., and Kumar, S. (2008). A new

baseline for image annotation. In Proc. ECCV, pages

316–329.

Meyer, D. (2001). Support Vector Machines. R News,

2(2):23–26.

Monay, F. and Gatica-Perez, D. (2004). PLSA-based im-

age auto-annotation: constraining the latent space. In

Proc. of ACM Multimedia, pages 348–351.

Nguyen, C., Phan, X., and Horiguchi, S. (2009). Web

Search Clustering and Labeling with Hidden Topics.

ACM Transactions on Asian Language Information

Processing (TALIP), 8(3).

Phan, X., Nguyen, L., and Horiguchi, S. (2008). Learning

to classify short and sparse text & web with hidden

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

196

topics from large-scale data collections. In Proc. of

the 17th international conference on World Wide Web,

pages 91–100.

Qi, G.-J., Aggarwal, C., Tian, Q., Ji, H., and Huang, T.

(2012). Exploring context and content links in so-

cial media: A latent space method. Pattern Analy-

sis and Machine Intelligence, IEEE Transactions on,

34(5):850 –862.

Schapire, R. E. (2003). The Boosting Approach to Machine

Learning An Overview. MSRI Workshop on Nonlinear

Estimation and Classiﬁcation, 7(4):1–23.

Setz, A. and Snoek, C. (2009). Can social tagged images

aid concept-based video search? In Proc. of IEEE

International Conference on Multimedia and Expo

(ICME), pages 1460–1463.

Sivic, J. and Zisserman, A. (2003). Video Google: a

text retrieval approach to object matching in videos.

Proceedings Ninth IEEE International Conference on

Computer Vision, pages 1470–1477 vol.2.

Smeaton, A., Over, P., and Kraaij, W. (2009). High-level

feature detection from video in TRECVid: a 5-year

retrospective of achievements. Multimedia Content

Analysis, pages 1–24.

Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A.,

and Jain, R. (2000). Content-based image retrieval at

the end of the early years. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 22(12):1349–

1380.

Snoek, C. G. M., Worring, M., van Gemert, J. C., Geuse-

broek, J.-M., and Smeulders, A. W. M. (2006). The

challenge problem for automated detection of 101 se-

mantic concepts in multimedia. In Proc. of ACM Mul-

timedia.

Tjondronegoro, D., Chen, Y.-P. P., and Pham, B. (2005).

Content-based video indexing for sports applications

using integrated multi-modal approach. Proc. of ACM

Multimedia, page 1035.

Tsai, D., Jing, Y., Liu, Y., Rowley, H., Ioffe, S., and Rehg,

J. (2011). Large-Scale Image Annotation using Visual

Synset. Proc. of ICCV.

Ulges, A., Schulze, C., Koch, M., and Breuel, T. M.

(2010). Learning automatic concept detectors from

online video. Computer Vision and Image Under-

standing, 114(4):429–438.

Wang, X. and Grimson, E. (2007). Spatial latent dirichlet

allocation. Proc. of Neural Information Processing

Systems Conference, pages 1–8.

Yin, Z., Li, R., Mei, Q., and Han, J. (2009). Exploring so-

cial tagging graph for web object classiﬁcation. Proc.

of ACM SIGKDD international conference on Knowl-

edge discovery and data mining (KDD), page 957.

NextGenerationTVthroughAutomaticMultimediaAnnotationSystems-AHybridApproach

197