AUTOMATIC KEY-FRAME EXTRACTION

FROM BROADCAST SOCCER VIDEOS

Nielsen C. Sim˜oes

, Neucimar J. Leite

and Beatriz Marcotegui

University of Mato Grosso do Sul (UEMS), Dourados, MS, Brazil

Institute of Computing, University of Campinas (UNICAMP), Campinas, SP, Brazil

Centre de Morphologie Math´ematique (CMM), Ecole des Mines de Paris, Fontainebleau, France

Keywords:

Key-frame, Video analysis, Shot classiﬁcation, Broadcast soccer video, Visual rhythm.

Abstract:

This paper presents a new approach for broadcast soccer video navigation and summarization based on speciﬁc

representative images of the video. It also takes into account some soccer video features to better describe

these videos. This work considers a special color reduction based on an HSV subquantization and a shot

classiﬁcation approach for soccer videos by exploring the dominant color related to the playground area.

1 INTRODUCTION

The manual analysis and annotation of video

databases is a long and arduous task which can gener-

ate mistakes due to the operator weariness, for exam-

ple. Automatic analysis is an important task for video

semantic understanding. A digital video segment is

deﬁned as a sequence of images or frames. A shot is

an uninterrupted subset of frames recorded from the

same camera. Shot detection is one of the ﬁrst tasks in

automatic analysis of videos. Video edition generates

some transitions between shots. There are many ap-

proaches for shot transition detection concerned with

different kinds of videos, such as commercials, news

and movies.

Key-frames are usually deﬁned to represent a shot

and are commonly used as video edition tools which

deal with shots in the time-line video segment. Some

video indexing techniques and analysis also consider

key-frame information in order to reduce the amount

of data to be analyzed.

A key-frame is a representative image of a shot

and its extraction is an important tool for the pro-

cess of video semantic analysis (Ciocca and Schet-

tini, 2005; Doulamis et al., 2000; Dufaux, 2000; Wolf,

1996). In general, it is frequently used for video vi-

sualization (Arman et al., 1994; Komlodi and Mar-

chionini, 1998; Tse et al., 1998; Zhong et al., 1996)

and recovery (Arman et al., 1994; Liu et al., 2003;

Sze et al., 2005; Pardo, 2006), or even as a tool for

scenes detection (Xiong et al., 1997).

Most key-frame detection approaches consider, as

representative images, the frames obtained from a

predeﬁned position in each shot (Arman et al., 1994;

Ueda et al., 1991; Zhang et al., 1995). This sim-

ple method can be efﬁcient for short videos such as

TV advertisements. Other approaches are based on

dissimilarity measures between consecutive frames

(Ciocca and Schettini, 2005; Doulamis et al., 2000;

Yeung and Liu, 1995; Xiong et al., 1997) or also by

computing the local minimum related to image mo-

tion (Dufaux, 2000; Wolf, 1996).

This work presents a new approach for auto-

matic key-frame extraction from TV broadcast soccer

videos, useful for video browsing and navigation. It

is also proposed an speciﬁc video representative im-

age, a special color reduction and a shot classiﬁca-

tion for soccer videos by exploring the playground

color frequency. Section 2 presents a new approach

for key-frame extraction. Section 3 shows experimen-

tal results using some Brazilian TV broadcast soccer

videos. Section 4 draws some conclusions and future

works.

2 KEY-FRAME EXTRACTION

(Sze et al., 2005) take into account the temporal his-

togram for each pixel in a group of frames related to

216

C. Simões N., J. Leite N. and Marcotegui B. (2009).

AUTOMATIC KEY-FRAME EXTRACTION FROM BROADCAST SOCCER VIDEOS.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 216-223

DOI: 10.5220/0001805702160223

 SciTePress

a given shot. In such case, the obtained key-frame

is a composition of different frames which yields a

key-frame representation that can be useful for video

retrieval but not for video browsing and navigation,

since this synthetic key-frame is not necessarily in-

cluded in the set of the original frames belonging to

the corresponding shot. (Pardo, 2006) also considers

a video segment or a group of frames to compute the

pixel-wise histogram which, again, can be useful for

video retrieval but not for video browsing and naviga-

tion.

A simple key-frame extraction can be obtained by

deﬁning the ﬁrst frame of each shot as representative

image. Since soccer videos have a lot of long shots,

some of these unique key-frames may not represent

them properly. Thus, it might be desirable to select

more than one key-frame associated with the content

of these long duration shots. In general, a shot de-

tection step must be considered before performing the

key-frame extraction method. There are different shot

detection approaches in the literature (Brunelli et al.,

1999; Guimar˜aes et al., 2003; Kim et al., 2001; Ko-

prinska and Carrato, 2001; Ngo et al., 1998; Sim˜oes,

2004; Zhang et al., 1993), and new ones have already

been introduced, mainly due to the speciﬁc character-

istics of the various types of existent videos.

This work focuses on the key-frame extraction of

soccer video images. Since the shot detection is be-

yond the scope of this paper it assumes the availability

of an approach for shot detection, such as pixel-wise

comparison (Brunelli et al., 1999; Koprinska and Car-

rato, 2001; Patel and Sethi, 1996; Zhang et al., 1993),

the one considered for this task in this work. The key-

frame extraction proposed here attempts to deﬁne at

least one key-frame for each single shot. In cases of

long duration shots more than one key-frame is de-

ﬁned to better describe these speciﬁc shots.

In TV broadcast soccer videos most shots are rep-

resented by large playground regions. In this sense,

we propose a special color reduction approach, as

well as a speciﬁc visual rhythm transformation and a

method for shot classiﬁcation by exploring this play-

ground color information of the shots. As we will

show in Section 2.3, this information will be used to

identify shots which are closely related to the ﬁeld,

i.e., frames whose pixels yield high playground color

density. The next subsections discuss in details each

step of this key-frame extraction approach.

2.1 Color Reduction

A soccer match is played on a playground which, for

processing purposes, can be considered as the back-

ground of the video images. In TV broadcast videos,

in which a match is recorded through different cam-

eras and angles, the playground information is pre-

sented most of the time, making the color informa-

tion the representative or dominant feature in the cor-

responding shots. In order to estimate this color rep-

resentativeness of the frames, a simple histogram op-

eration can be used. Since this step consumes much

memory and is sensitive to brightness conditions, for

example, a special color reduction model is proposed

to avoid these problems and also discriminate the

playground color information.

By considering the HSV color space, the method

performs a subsampling of its main components. In-

formally, the proposed approach reduces the Hue

component to six values (primary and secondary col-

ors, i.e., red, green, blue, yellow, cyan and magenta)

while the Saturation component is reduced to eight

values. This same reduction procedure is applied to

the Value component (see Figure 1). All colors with

low Saturation are reduced to eight gray levels, and

all the other colors with low Values are represented

by black. For example, due to changes in the time

conditions the playground color can present signiﬁ-

cant brightness variations. Moreover, the playground

grass can be of different types, yielding changes to the

playgroundcolor saturation, as well. Thus, we can not

properly consider brightness and saturation values for

the playground color identiﬁcation.

Here, the ﬁrst, third and ﬁfth Hue values of the

color space are related to the primary colors (red,

green and blue) and the second, fourth and sixth col-

ors are related to the secondary colors (yellow, cyan

and magenta). After the analysis of matches with dif-

ferent time and weather conditions the second Hue

value, closely related to the actual colors of the play-

ground, is associated to the playground color. Satu-

ration and value components was kept ﬁxed in the re-

duction approach for this hue information, which re-

sults in only one ﬁxed color for the playground infor-

mation. The other ﬁve Hue values can be reduced by

considering seven different values for the Saturation

component (low saturations are represented as gray

levels) and seven different values for the Value com-

ponent (low values are represented as black color).

This maps all HSV values to one playground color,

one black color, seven gray levels and 245 different

colors: 5 ( the 6 original Hue components minus 1 of

the ﬁxed Hue component) x 7 (values for Saturation)

x 7 (values for the Value component). As a result, the

method makes a reduction of a 24-bit color represen-

tation to 254 colors with only one of them related to

the playground information.

AUTOMATIC KEY-FRAME EXTRACTION FROM BROADCAST SOCCER VIDEOS

217

Figure 1: HSV partition for color reduction.

2.2 Visual Rhythm Approach

Visual Rhythm is a representative image of a whole

video proposed ﬁrstly as a tool for shot detection

approaches (Chung et al., 1999; Bezerra and Leite,

2003; Guimar˜aes et al., 2003; Kim et al., 2001). The

original visual rhythm (Deﬁnition 2.1) uses a trans-

formation function from the spatial domain of the se-

quence (2D +t) to D + t.

Deﬁnition 2.1 Visual Rhythm. Let f

(x,y) be the

color value of the pixel (x,y) of a frame in time t,

from a digital video with N frames. Let H and W be,

respectively, the height and width of the frames. The

visual rhythm image I

is the result of the following

transformation:

(t,z) = f

× z+ a,r

× z+ b),

where z ∈ [0,H

− 1] and t ∈ [0,N − 1], H

and N

are, respectively, the height and width of the visual

rhythm image; r

and r

represent a pixel subsam-

pling rate, while a e b, a translation into each frame.

Deﬁnition 2.1 corresponds to a subsampling of the

video, resulting in a set of representative image which

keeps some of its main temporal features. A more

general deﬁnition of this visual rhythm is given in

Deﬁnition 2.2, in which a transformation function can

be deﬁned considering both local and global informa-

tions of the image sequences. Examples of typical

local information for the visual rhythm representation

are the center vertical or horizontal lines, or the main

diagonal of the video frames.

Deﬁnition 2.2 General Visual Rhythm. Let f

be a

frame in a time t of a digital video with N frames. The

general visual rhythm image I

GVR

is the result of the

following transformation:

GVR

(t,z) = τ( f

,z),

where z ∈ [0,L− 1] (L depends on the transformation

function τ) and t ∈ [0,N − 1], where N corresponds to

the width of the visual rhythm image.

As can be seen from Deﬁnition 2.1, only local in-

formation of a frame can be preserved when the vi-

sual rhythm for video content simpliﬁcation is used.

For broadcast soccer videos, it is interesting to take

into account more global information on the frames,

mainly if it is necessary to identify the playground

color. For such purpose, the following statistical

mode (Deﬁnition 2.3) is used as a transformation

function. This function will be considered in a new

visual rhythm representation (Deﬁnition 2.4) which

deals with values of the frames histogram after color

reduction procedure mentioned in Section 2.1.

Deﬁnition 2.3 Mode. A mode is the most common

element of a set of samples. Let Hist[x] be the fre-

quency of x for a numerical data sample S with P ele-

ments. The mode M is deﬁned as follows:

M(S) = m,∀x ∈ [0,P− 1],Hist[m] >= Hist[x].

In other words, the mode is the value for which the

histogram reaches a maximum.

Deﬁnition 2.4 Mode Visual Rhythm. Let I

GVR

(t,z)

be a general visual rhythm as in Deﬁnition 2.2, and

(z) a line z from a frame in time t. The mode visual

rhythm image, I

MVR

, is deﬁned by considering the fol-

lowing transformation function:

MVR

= τ(t,z) = M( f

(z)),

where z ∈ [0,H − 1] and t ∈ [0, N − 1], H and N are,

respectively, the height and the width of the mode vi-

sual rhythm image. H indicates also the height of the

video frames.

Since the playground information is present in

most of the soccer video shots, the mode visual

rhythm (Deﬁnition 2.4) alone cannot, for example,

discriminate frames containing several players from

others showing just one of them (e.g., in a zoom cam-

era work). Furthermore, it is important to know if the

playground color yields a representative mode, i.e., if

it has few different colors by line indicating that this

mode is indeed representative. For this purpose, we

also compute the mode rate visual rhythm given by

the following deﬁnition.

Deﬁnition 2.5 Mode Rate Visual Rhythm. Let

GVR

(t,z) be a general visual rhythm as in Deﬁni-

tion 2.2, and f

(z) a line z from a frame in time t.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

218

The mode rate visual rhythm, I

MRVR

, is the result of

the following transformation function:

MRVR

= τ(t,z) = 100×

Hist

[M( f

(z))]

where, again, z ∈ [0,H − 1] and t ∈ [0, N − 1], H and

N are, respectively, the height and width of the mode

rate visual rhythm image. As before, H and W corre-

sponds also to the height and the width of the video

frames.

Figure 2 illustrates the deﬁnition of the mode vi-

sual rhythm (MVR) and mode rate visual rhythm

(MRVR) representations. As we can see from this

Figure, the color reduction proposed here constitutes

the ﬁrst step in the deﬁnition of these images con-

taining both local and global information about the

whole video. For example, besides the detection of

the dominant color of the sequence considered in Fig-

ure 2, one can notice the vertical lines in the MVR

and MRVR images corresponding to shot transitions.

In this work, these images were used to perform key-

frame extraction as discussed in the next section.

Figure 2: MVR and MRVR images of a soccer video se-

quence. a) original frame; b) result of color reduction; c)

column with all mode by line; d) MVR image; e) MRVR

image.

2.3 Key-frame Extraction Method

During broadcast TV soccer matches, different cam-

era views are commonly applied to the whole scene.

In general, most of the match is transmitted by con-

sidering a long view camera in which it is possible to

have a global view with many players shown at the

same time. Long view cameras are usually related

to the match play time. A medium view is generally

used to focus on some players actions, showing no

more than two or three players at the same time. In

special events, a close-up is used to show the players’

face, t-shirts, referee, and also audience and coach.

Close-ups are mainly related to the break intervals

of the match. The next shot classiﬁcation takes into

account the MVR and MRVR images during a given

shot interval.

2.3.1 Shot Classiﬁcation

By considering the MVR and MRVR images, it is

possible to identify signiﬁcant shots according to

some predeﬁned rules. In the literature, different shot

classiﬁcation approaches are proposed aiming at dif-

ferent goals. (Dufaux, 2000) deﬁnes a single shot as

the representative shot of a whole video. Later, he

considers a key-frame selection on this shot to de-

ﬁne one key-frame for the whole video. (Vendrig and

Worring, 2003) use a shot selection to help their video

annotation process by considering different classes of

features, such as characters.

In this work, four classes related to the TV broad-

cast standard for soccer games are proposed, as il-

lustrated in Figure 3. All cameras, for this kind of

videos, use a three predeﬁned apertures and, during

the match, these apertures do not change much. Ini-

tially, the large view camera focus on a broad region

of the ﬁeld, related to a class a described further in the

paper. A medium view camera records some close-

ups of the players’ actions, such as passing or shoot-

ing, and is associated to classes a and b. A class c

can be related to players close-ups showing, for in-

stance, their t-shirt, their numbers or faces. A class

d is only considered to identify transmission errors.

Basically, our shot classiﬁcation approach takes into

account the amount of playground color information,

which, in turn, is used in our key-frame extraction

method. More speciﬁcally, these four classes are:

Class a - High Dominant Green Shots: Shots with

a high number of pixels related to the playground

color. In general, shots corresponding to global or

medium views of the ﬁeld (Figure 3a).

Class b - Medium Dominant Green Shots: Shots

with a medium number of pixels related to the

playground color. They are commonly associated

to close-up views inside the ﬁeld or lateral views

(Figure 3b).

Class c - Low Dominant Green Shots: Shots with

a low number of pixels related to the playground

color. These shots represent mainly the audience

or lateral close-up views, or even other views out-

side the ﬁeld (Figure 3c).

Class d - Non-representative Shots: In general,

these shots have no signiﬁcant information or

without information. They are related to transmis-

sion failures or no signal segments (Figure 3d).

The classiﬁcation method is based on the play-

ground color in the MVR image and on the rate of this

color indicated in the MRVR image. For each shot,

the corresponding segment in the I

MVR

and I

MRVR

im-

ages is considered by computing the percentage of

AUTOMATIC KEY-FRAME EXTRACTION FROM BROADCAST SOCCER VIDEOS

219

Figure 3: Example of frames belonging to the four shot

classes.

the playground color and the average rate of the play-

ground color area.

Let A be a selected shot area and PC deﬁne the set

of points corresponding to the playground color in the

MVR image. Also, let RA deﬁne the set of points in

the MRVR image corresponding to the value of the

mode rate equal or greater than 50%. This threshold-

ing identiﬁes all signiﬁcant modes related to the MVR

image and, consequently, to the background informa-

tion in each line of the video frames. A basic shot

classiﬁcation is obtained as follows:

1. For each detected shot in the MVR image, con-

sider the PC area. Let PCP deﬁne the PC per-

centage related to the shot area A, i.e., PCP =

| PC |/A.

2. For each detected shot, consider the RA area.

3. Classify the corresponding shots using the PC and

RA informations as follows:

Class a - High Dominant Green Shots: The

PCP and the percentage of the subset PC ⊂ RA

related to the shot area are equal or greater than

a threshold T

(

|PC⊂RA|

≥ T

Class b - Medium Dominant Green Shots: The

PCP is equal or greater than a threshold T

, and

the percentage of the subset PC ⊂ RA related

to the shot area is lower than the threshold T

(

|PC⊂RA|

< T

Class c - Low Dominant Green Shots: The

PCP is lower than the threshold T

and the

percentage of the subset PC ⊂ RA related to

the shot area is lower than the threshold T

(

|PC⊂RA|

< T

Class d - Non-representative Shots: The PC is

lower than the threshold T

and the percent-

age of the subset PC ⊂ RA related to the shot

area is equal or greater than the threshold T

(

|PC⊂RA|

≥ T

Figure 4 shows a video segment with eight shots.

After applying a simple threshold in the MRVR im-

age, the area corresponding to the mode rate is se-

lected (Figure 5).

Figure 4: MVR and MRVR superimposed images with

eight shots (separated by black lines).

Figure 5: Selected area from the MRVR image in Fig-

ure 4, after applying a threshold corresponding to 50% of

the mode rate.

2.3.2 Key-frame selection

Once all shots are discriminated according to the pre-

vious classes, the shot length and class are ﬁnally used

to determine how the key-frames are selected as our

ﬁnal result. In this step, the duration of the shot clas-

siﬁed as class a is taken into account in order to de-

termine the number of key-frames to be selected.

Thus, shots previously classiﬁed as belonging to

class b and class c with short duration need only one

key-frame to represent them. In this case, it is se-

lected the frame located at the position correspond-

ing to 10% of the shot length. Shots from class a

can be of short or long duration. When long-duration

shots are detected, we take into account its length and

at least one key-frame at the begining (same position

for the shots in class b and class c), and one at the

end, corresponding to 90% of the shot length. Finally,

shots belonging to class d have no key-frame, since

these shots convey no signiﬁcant information about

the scene.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

220

3 Experimental Results

Some experiments were performed using three dif-

ferent matches of the First Brazilian League (about

4.5 hours), from different TV channels. We consider

a typical and simple shot detection approach to per-

form this ﬁrst task with a precision of at least 82%.

The simple pixel-wise comparison was used, frame

by frame, as discussed by (Brunelli et al., 1999).

The color subquantization approach presented

very good reduction property without losing color in-

formation and properly selecting the playgroundcolor

in all three video sequences. It is important to high-

light that one video has players with green t-shirts,

and all of them were recorded at different times and

weather conditions.

Table 1 shows the results concerning the shot clas-

siﬁcation step in which all classes were correctly clas-

siﬁed. The threshold T

was deﬁned as 82% and T

18% taking into account the three soccer videos used

in the experiments. Shortly, the main task here is to

classify shots according to the different playground

occurrence and those representing transmission fail-

ures. Using this shot classiﬁcation, the next step is

to apply the key-frame selection as discussed in Sec-

tion 2.3.2.

Table 1: Shot classiﬁcation results.

Classes a b c d

a 934 0 0 0

b 0 887 0 0

c 0 0 621 0

d 0 0 0 1

Correctly Classiﬁed 2443 100.0%

Falsely Classiﬁed 0 0.0%

Class a 934 38.23%

Class b 887 36.31%

Class c 621 25.42%

Class d 1 0.04%

Total of Shots 2443 100.0%

A key-frame extraction validation is an arduous

task mainly due to the many video characteristics.

For soccer videos, it is important to whole keep the

dynamic of the match through the set of the deﬁned

key-frames. To illustrate the results of our approach,

the key-frames extracted here are compared to those

obtained from the IBM Multimedia Analysis and Re-

trieval System - Marvel Lite 3.2a (Smith et al., 2007).

Table 2 presents the results of the proposed approach

and the IBM Marvel Lite software with respect to the

number of deﬁned key-frames.

An example of detected key-frames is shown in

Table 2: Key-frame extraction results.

Key-frames

Total of shots 2443

Proposal approach 2913

IBM Marvel Lite 3.2a 1949

Figure 6 for a given video segment. Note that all the

obtained key-frames describe different camera views

or position, thus expressing the dynamic of the entire

game, as expected. The IBM Marvel Lite software

was used with its standard parameters and the cor-

responding key-frame extraction is presented in Fig-

ure 7 for the same video segment considered before.

This result shows that some key-frames are not too

relevant and that a frame belonging to a dissolve shot

transition was included in the set of extracted key-

frames. Note that the shot detection method used in

the IBM system is not the same as the one considered

here.

Figure 6: An example of key-frames resulting of the extrac-

tion approach for a soccer video segment.

4 CONCLUSIONS

This work presented a new key-frame extraction ap-

proach for TV broadcast soccer videos. It also pro-

posed an efﬁcient color reduction which determines

the dominant color of the soccer ﬁelds playground, as

well as a novel video representative image and a shot

classiﬁcation method based on this playground infor-

mation.

From the results illustrated above, we can see that

the classiﬁcation of the shots, based on their represen-

tative images, yielded a good classiﬁcation method.

For the key-frame extraction, our approach considers

AUTOMATIC KEY-FRAME EXTRACTION FROM BROADCAST SOCCER VIDEOS

221

Figure 7: Key-frames extracted by the IBM Marvel soft-

ware for the same video segment as in Figure 6.

more than one key-frame per shot which, for naviga-

tion purposes, can highlight interesting segments of a

soccer video.

As future works, we plan to use the MVR and

MRVR images for shot detection by exploring its tem-

poral feature, and improve the shot classiﬁcation in

order to classify shots by camera view modes and not

only by playground appearance. Finally, it is also in-

teresting to perform, as a validation scheme, a user

test for video browsing and navigation by considering

the key-frame extraction proposed here.

ACKNOWLEDGEMENTS

The authors are grateful to the National Council for

Scientiﬁc and Technological Development (CNPq),

CAPES and FAPESP for the ﬁnancial support.

REFERENCES

Arman, F., Depommier, R., Hsu, A., and Chiu, M.-Y.

(1994). Content-based browsing of video sequences.

In MULTIMEDIA ’94: Proceedings of the second

ACM international conference on Multimedia, pages

97–103, New York, NY, USA. ACM Press.

Bezerra, F. N. and Leite, N. J. (2003). Video transition

detection using string matching: preliminary results.

In SIBGRAPI XVI Brazilian Symposium on Computer

Graphics and Image Procesing, pages 339–346.

Brunelli, R., Mich, O., and Modena, C. M. (1999). A sur-

vey on the automatic indexing of video data. Journal

of Visual Communication and Image Representation,

10(2):78–112+.

Chung, M. G., Lee, J., Kim, H., Song, S. M.-H., and Kim,

W. M. (1999). Automatic video segmentation based

on spatio-temporal features. Korea Telecom Journal,

4(1):1–13.

Ciocca, G. and Schettini, R. (2005). Dynamic key-frame

extraction for video summarization. In Santini, S.,

Schettini, R., and Gevers, T., editors, Internet Imag-

ing VI, volume 5670, pages 137–142. SPIE.

Doulamis, A. D., Doulamis, N. D., and Kollias, S. D.

(2000). A fuzzy video representation for video sum-

marization and content-based retrieval. Signal Pro-

cessing, 80(6):1049–1067.

Dufaux, F. (2000). Key frame selection to represent a

video. In International Conference on Image Process-

ing, volume 2, pages 275–278.

Guimar˜aes, S. J. F., Leite, N. J., Couprie, M., and de Albu-

querque Arajo, A. (2003). Video segmentation based

on 2D image analysis. Pattern Recognition Letters,

24(7):947–957.

Kim, H., Lee, J., Yang, J.-H., Sull, S., Kim, W. M.,

and Song, S. M.-H. (2001). Visual rhythm and

shot veriﬁcation. Multimedia Tools and Applications,

15(3):227–245.

Komlodi, A. and Marchionini, G. (1998). Key frame pre-

view techniques for video browsing. In DL ’98: Pro-

ceedings of the third ACM conference on Digital li-

braries, pages 118–125, New York, NY, USA. ACM

Press.

Koprinska, I. and Carrato, S. (2001). Temporal video seg-

mentation. Signal Processing: Image Communica-

tion, 16(5):477–500.

Liu, F., Dong, D., Miao, X., and Xue, X. (2003). A fast

video clip retrieval algorithm based on va-ﬁle. In

Yeung, M. M., Lienhart, R. W., and Li, C.-S., edi-

tors, Storage and Retrieval Methods and Applications

for Multimedia 2004, volume 5307, pages 167–176.

SPIE.

Ngo, C. W., Pong, T. C., and Chin, R. T. (1998). Survey of

video parsing and image indexing techniques in com-

pressed domain. Symposium on Image, Speech, Signal

Processing, and Robotics (Workshop on Computer Vi-

sion), 1:231–236.

Pardo, A. (2006). Pixel-wise histograms for visual segment

description and applications. In CIARP2006, Lecture

Notes in Computer Science, volume 4225/2006, pages

873–882. Springer.

Patel, N. V. and Sethi, I. K. (1996). Compressed video pro-

cessing for cut detection. Visual Image Signal Pro-

cessing, 143(5):315–323.

Sim˜oes, N. C. (2004). Detec˜ao de algumas transi˜oes abrup-

tas em segncias de imagens (in portuguese). Master’s

thesis, Institute of Computing - UNICAMP.

Smith, J. R., Natsev, A. P., Tesic, J., Lexing Xie, R. Y.,

Letz, F., Penz, C., Seidl, J., and Yang, J. (2007). IBM

multimedia analysis and retrieval system - Marvel Lite

3.2a. http://www.alphaworks.ibm.com/tech/imars.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

222

Sze, K.-W., Lam, K.-M., and Qiu, G. (2005). A new key

frame representation for video segment retrieval. Cir-

cuits and Systems for Video Technology, IEEE Trans-

actions on, 15(9):1148–1155.

Tse, T., Marchionini, G., Ding, W., Slaughter, L., and Kom-

lodi, A. (1998). Dynamic key frame presentation

techniques for augmenting video browsing. In AVI

’98: Proceedings of the working conference on Ad-

vanced visual interfaces, pages 185–194, New York,

NY, USA. ACM Press.

Ueda, H., Miyatake, T., and Yoshizawa, S. (1991). Impact:

an interactive natural-motion-picture dedicated multi-

media authoring system. In CHI ’91: Proceedings

of the SIGCHI conference on Human factors in com-

puting systems, pages 343–350, New York, NY, USA.

ACM Press.

Vendrig, J. and Worring, M. (2003). Interactive adaptive

movie annotation. IEEE MultiMedia, 10(3):30–37.

Wolf, W. (1996). Key frame selection by motion analysis. In

IEEE International Conference on Acoustics, Speech,

and Signal Processing.

Xiong, W., Lee, J. C.-M., and Ma, R.-H. (1997). Automatic

video data structuring through shot partitioning and

key-frame computing. Mach. Vision Appl., 10(2):51–

65.

Yeung, M. M. and Liu, B. (1995). Efﬁcient matching and

clustering of video shots. In ICIP ’95: Proceedings

of the 1995 International Conference on Image Pro-

cessing (Vol. 1)-Volume 1, page 338, Washington, DC,

USA. IEEE Computer Society.

Zhang, H., Kankanhalli, A., and Smoliar, S. W. (1993). Au-

tomatic partitioning of full-motion video. ACM Mul-

timedia Systems, 1(1):10–28.

Zhang, H. J., Low, C. Y., Smoliar, S. W., and Wu, J. H.

(1995). Video parsing, retrieval and browsing: an in-

tegrated and content-based solution. In MULTIME-

DIA ’95: Proceedings of the third ACM international

conference on Multimedia, pages 15–24, New York,

NY, USA. ACM Press.

Zhong, D., Zhang, H., and Chang, S.-F. (1996). Clustering

methods for video browsing and annotation. In Sethi,

I. K. and Jain, R. C., editors, Storage and Retrieval

for Still Image and Video Databases IV, volume 2670,

pages 239–246. SPIE.

AUTOMATIC KEY-FRAME EXTRACTION FROM BROADCAST SOCCER VIDEOS

223