ROBUST KEY FRAME EXTRACTION FOR 3D RECONSTRUCTION

FROM VIDEO STREAMS

Mirza Tahir Ahmed, Matthew N. Dailey

School of Engineering and Technology, Asian Institute of Technology, Pathumthani, Thailand

Jose Luis Landabaso, Nicolas Herrero

Telefonica Research, Barcelona, Spain

Keywords:

3D reconstruction, Key frame extraction, 3D video player, Geometrical robust information criterion (GRIC),

3D reconstruction degeneracy.

Abstract:

Automatic reconstruction of 3D models from video sequences requires selection of appropriate video frames

for performing the reconstruction. We introduce a complete method for key frame selection that automatically

avoids degeneracies and is robust to inaccurate correspondences caused by motion blur. Our method combines

selection criteria based on the number of frame-to-frame point correspondences, Torr’s geometrical robust

information criterion (GRIC) scores for the frame-to-frame homography and fundamental matrix, and the

point-to-epipolar line cost for the frame-to-frame point correspondence set. In a series of experiments with

real and synthetic data sets, we show that our method achieves robust 3D reconstruction in the presence of

noise and degenerate motion.

1 INTRODUCTION

Reconstructing a 3D scene from video requires choos-

ing a number of representative (key) frames from

the video stream. Automatic 3D reconstruction from

snapshots and manually extracted video frames has

been a focus of the structure-from-motion research

community for a long time, but only a few researchers

have carefully considered automatic selection of key

frames prior from a video prior to the reconstruction

process.

Estimation of 3D camera poses and recovery of

3D scene geometry are two very expensive processes

in 3D reconstruction if performed with all frames in

a video sequence. If the frames are decimated then

these processes become less expensive. Additionally,

consecutive frames may have baselines that are too

short for accurate triangulation. Another important

factor is that while the fundamental matrix provides

extremely useful information about the relationship

between two images of a general 3D structure related

by general camera motion, in degenerate cases, when

these generality assumptions for do not hold, funda-

mental matrix estimation fails.

We introduce a method for automatic key frame

selection that takes all of these factors into account. It

is based on the number of frame-to-frame point corre-

spondences obtained, Torr’s geometrical robust infor-

mation criterion (GRIC, Torr, 1998), and the point-to-

epipolar line cost for the frame-to-frame correspon-

dence set to identify key frames. In a series of exper-

iments with real and synthetic data sets, we show that

our method achieves robust 3D reconstruction in the

presence of noise and degenerate motion.

2 REQUIREMENTS

There three main reasons for extracting key frames

from video sequences: computational performance,

triangulation accuracy, and avoidance of degeneracy.

2.1 Computational Performance

The same level of 3D reconstruction can be achieved

from a few frames instead of processing all the frames

in a video sequence. This will not only improve the

performance but also the estimation of the 3D cam-

era pose and recovery of 3D scene geometry will be

computed more efﬁciently.

231

Tahir Ahmed M., N. Dailey M., Luis Landabaso J. and Herrero N. (2010).

ROBUST KEY FRAME EXTRACTION FOR 3D RECONSTRUCTION FROM VIDEO STREAMS.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 231-236

DOI: 10.5220/0002836902310236

 SciTePress

2.2 Triangulation Accuracy

The baseline is the line between two camera centers.

The baseline length is typically very small in consec-

utive frames. Long baselines are required for accu-

rate triangulation. The size of a 3D point’s region

of uncertainty increases as the distance between two

frames decreases. Therefore, the frame selection pro-

cess should seek to maximize the baseline between

the camera positions for key frames, subject to the

constraint that a sufﬁcient number of correspondences

are retained.

2.3 Degeneracy Avoidance

There are two conditions for non-general camera mo-

tion and non-general position of structure known as

degenerate cases when the epipolar geometry is not

deﬁned and methods based on estimation of the fun-

damental matrix will fail (although note that the frame

pair may still be useful for resectioning, in which we

estimate only the camera position from known 3D-2D

correspondences):

Motion Degeneracy: If the camera rotates about its

center with no translation, the epipolar geometry

is not deﬁned.

Structure Degeneracy: When all of the 3D points in

view are coplanar, the fundamental matrix cannot

be uniquely determined from image correspon-

dences alone.

3 PREVIOUS WORK

Here we provide an overview of the most relevant re-

cent work in key frame selection. We mention the

most relevant. Seo et al. (2003) consider three fac-

tors: (a) the ratio of the number of point correspon-

dences found to the total number of point features

found, (b) the homography error, and (c) the spatial

distribution of corresponding points over the frames.

Hartley and Zisserman (2004) state that the homogra-

phy error is small when there is little camera motion

between frames. Homography error is a good proxy

for the baseline distance between two views. Seo et

al. also encourage the use of evenly distributed corre-

spondences over the entire image to obtain the funda-

mental matrix. They derive a score function from the

above mentioned factors which is used to select key

frames. The pair with the lowest score is selected as

a key frame. The authors do not discuss any measure

for degenerate cases.

Pollefeys and van Gool (2002) select key frames

for structure and motion recovery based on a motion

model selection mechanism (Torr et al., 1998) to se-

lect next key frame only once the epipolar geometry

model explains the relationship between the pair of

images better than the simpler homography model.

The distinction between the homography and the fun-

damental is based on the geometric robust informa-

tion criterion (GRIC, Torr, 1998). They discard all

frames based on degenerate cases.

Seo et al. (2008) use the the ratio of the number

of correspondences to the total number of features

found. If the ratio is close to one this means the im-

ages overlap too much and the baseline length will be

small. Under these assumptions, a frame should not

be selected as a key frame. The second measure is

the reprojection error. The pair of frames with mini-

mum reprojection error are categorized as key frames.

But as in their earlier work, no measures are taken for

degenerate cases.

4 METHOD

We treat key frame selection as constrained optimiza-

tion. Given the ﬁrst frame of a video sequence, we

seek to ﬁnd the successor frame that 1) has a sufﬁ-

ciently long baseline (via a correspondence ratio con-

straint), 2) does not lead to degenerate motion or

structure, and 3) has the best estimated epipolar ge-

ometry. We introduce our methods to achieve these

criteria in this section.

4.1 Correspondence Ratio Constraint

We use Seo et al.’s (2008) correspondence ratio R

a proxy for baseline length:

, (1)

where T

is the number of frame-to-frame point fea-

tures in correspondence for the frame pair under con-

sideration, and and T

is the total number of point fea-

tures considered for correspondence. R

is inversely

correlated with camera motion: as the camera moves,

features in view tend to leave the scene, and the ap-

pearance of objects in view tends to change with per-

spective distortion, occlusion, and so on.

Although a long baseline is desirable for triangu-

lation accuracy, if the number of corresponding fea-

tures is too low, camera pose estimation accuracy will

suffer. We therefore constrain candidate key frames to

those having a correspondence ratio R

between up-

per and lower thresholds T

and T

. Currently, we set

these thresholds through experimentation.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

232

4.2 Degeneracy Constraint

The relationship between a pair of images with gen-

eral camera motion and general structure is appropri-

ately deﬁned by a fundamental matrix, whereas de-

generate camera motion is more ﬁttingly deﬁned by

a homography. We can thus use the relative quality

of ﬁt to distinguish general motion from degenerate

motion. To assess the relative quality of ﬁt, we use

Torr’s geometric robust information criterion (GRIC,

Torr, 1998). GRIC is based not only on goodness of

ﬁt but also on the relative parsimony of the two types

of models. The score, summed over the point corre-

spondences, is

GRIC =

∑

ρ(e

)

+ λ

dn + λ

k, (2)

where ρ(e

) is a robust function

ρ(e

) = min(

, λ

(r − d))

of the residual e

, d is the number of dimensions mod-

eled (d = 3 for a fundamental matrix or 2 for a ho-

mography), n is the total number of features matched

across the two frames, k is the number of degrees of

freedom in the model (k = 7 for a fundamental matrix

or 8 for a homography), r is the dimension of the data

(r = 4 for 2D correspondences between two frames),

is the assumed variance of the error, λ

= log(r),

= log(rn), and λ

limits the residual error.

Given a candidate key frame, we calculate the

GRIC score for the homography and the fundamental

matrix models. If the GRIC score for the homography

model is lower than the GRIC score for the fundamen-

tal matrix, we eliminate the frame as a candidate key

frame.

4.3 Key Frame Selection Criteria

We assume that the ith key frame has already been

identiﬁed as the frame with index k

is just the ﬁrst

frame of the video sequence). Here we describe our

method to select the next key frame k

i+1

. Let φ(k

)

be the set of frame indices succeeding k

for which

the upper and lower bounds on the correspondence

ratio R

are satisﬁed and for which the GRIC score

for the fundamental matrix model is better than the

GRIC score for the homography model. We let

i+1

= argmax

j∈φ(k

)

( f (k

, j)) (3)

where f (i, j) is an objective function expressing one

or more key frame goodness criteria for current key

frame i and candidate next key frame j. We con-

sider two criteria, GRIC difference and the point-to-

epipolar line cost (PELC).

4.3.1 GRIC Difference Criterion

If the GRIC score of the fundamental matrix model is

much better than that of the homography model, the

relationship between the frames is much better repre-

sented by the fundamental matrix model, indicating

a good candidate key frame. We use the normalized

GRIC difference as one possible criterion for select-

ing the next key frame:

(i, j) =

GRIC

(i, j) − GRIC

(i, j)

GRIC

(i, j)

, (4)

where GRIC

(i, j) is the GRIC score from Equation

(2) for the homography between frames i and j, and

GRIC

(i, j) is the GRIC score for the fundamental

matrix for frames i and j. As we shall see, this mea-

sure is good for selecting key frames because it pro-

vides very low variation in reprojection error as com-

pared to uniformly sampled frames.

4.3.2 PELC Criterion

The GRIC difference method tends to stabilize varia-

tion in reprojection error, but as we shall see in the ex-

perimental results, it has little effect on the mean error

compared to uniformly sampled frames. We analyzed

the GRIC difference scores and the point-to-epipolar

line cost over many frames in real image sequences

and observed some frames in which the variation in

the GRIC difference was very small but the variation

in the point-to-epipolar line cost (PELC) was very

high, as shown for example in Figure 1. We found

that high PELC values tended to occur due to inaccu-

rate correspondences with blurry images in our video

sequences.

We thus consider PELC as an additional criterion

for key frame selection. As we shall see in the exper-

imental results, including both the GRIC difference

and the PELC in the key frame selection criteria helps

us ﬁnd key frames that are both well explained by

the epipolar geometry and have highly accurate cor-

respondences. We therefore propose the alternative

key frame score

(i, j) = w

(i, j) + w

(σ −PELC(i, j)), (5)

where σ is the assumed standard deviation of the er-

ror and PELC is the standard geometric reconstruc-

tion error measure for the fundamental matrix (Hart-

ley and Zisserman, 2003). The weights w

and w

could be set automatically, but we currently set them

experimentally.

4.4 Algorithm Summary

The complete method for key frame selection is sum-

marized in Algorithm 1.

ROBUST KEY FRAME EXTRACTION FOR 3D RECONSTRUCTION FROM VIDEO STREAMS

233

Frames

5 10 15 20

0.25

0.2

0.15

0.1

0.05

0.66

0.6

0.55

0.5

0.45

0.4

0.35

0.4

0.36

0.3

0.25

0.2

GRIC

PELC

GRIC+PELC

Figure 1: Variation of GRIC difference and PELC. Frame

0 is an assumed previous key frame. The GRIC difference

is maximal for frame 13, but PELC has a local minimum

at frame 14. Since there is only a small change in GRIC

difference between frame 13 and 14 but a much improved

PELC, the GRIC+PELC method (Equation 5) selects frame

14 as the next key frame.

Algorithm 1: KEYFRAMEEXTRACTION.

1: Input: A video stream with n frames.

2: Output: Key frame index sequence k

, k

, . . .

3: i ← 0; j

∗

← 0

4: while j

∗

6= ⊥ do

5: k

← j

∗

;i ← i + 1; j

∗

← ⊥

6: for candidate frame j ∈ k

+ 1..n do

7: Match keypoints between frames k

and j

8: Compute H and F using RANSAC

9: Discard outlier matches

10: Calculate correspondence ratio R

11: if R

< T

min

or R

> T

max

then

12: continue

13: end if

14: if GRIC

, j) ≤ GRIC

, j) then

15: continue

16: end if

17: if f

, j) is best so far then

18: j

∗

← j

19: end if

20: end for

21: end while

5 EXPERIMENTS AND RESULTS

We performed experiments with both synthetic and

real data. The synthetic data is useful because we

can precisely identify degenerate motion and struc-

ture; the real data is useful for validate the method’s

robustness to real-world noise.

5.1 Video Sequences

Here we provide details about each experimental

video sequence.

Synthetic Data: Church. We created a 930-frame

synthetic sequence with ground truth data using

Blender (Blender Community, 2009) and a sam-

ple 3D model of a church (Blender Artists, 2000).

The scene is outdoors, with sky in the back-

ground. We inserted degenerate cases of both

types by rotating the camera view point about its

center or zooming in on planar surfaces. We ex-

tracted 3D points, projected 2D points, camera

projection matrices, and the ground truth depth for

every 2D point.

Indoor Data. These sequences were captured in-

doors at Telefonica Research, Barcelona, with a

Sony HDR camera. We performed manual cali-

bration of the camera intrinsic parameters using a

chessboard pattern.

Library. A 1500-frame sequence in the Telefon-

ica library.

Lunch Room. A 1500-frame sequence in the

Telefonica lunch room.

Imagenio. A 1500-frame sequence in the Image-

nio room at Telefonica.

Nico. A 1000-frame sequence of a person sitting

still in a chair.

Photocopy Machine. A 1200-frame sequence of

a photocopy machine.

5.2 Experiment 1: Degeneracy

In Experiment 1, we tested degenerate case identiﬁ-

cation in the Church sequence. We processed every

10th frame as a candidate key frame. We manually

identiﬁed 26 frames consisting of degenerate motion

or structure.

Figure 2 shows the GRIC

(i, i + 10) and

GRIC

(i, i + 10) scores for each frame i consid-

ered. Frames 201–271 and 441–571 consist of pure

camera rotations, and frames 761–881 only contain

coplanar points.

Table 1 shows the detection rate and error rate

for the degenerate cases in the data set. The method

is able to identify the actual degenerate cases per-

fectly, with only 3 false positives among the 93 frames

tested.

Table 1: Detections and errors for degenerate motion and

structure detection in Experiment 1. FPs = false positives;

FNs = false negatives.

Sequence Positives FPs FNs

Church 26 3 0

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

234

Frames

0 100 200 300 400 500 600 700 800 900

4000

3500

3000

2500

2000

1500

1000

500

Automatic detection of Homography using GRIC Score

GRIC Homography

GRIC Fundamental

Cyan show 26 Degenerate Cases

Light Red show 3 False Positive

Figure 2: GRIC scores for the synthetic Church sequence.

The red dash-dot line shows GRIC

(i, i + 10); the green

dash-dot line shows GRIC

(i, i + 10). The cyan areas show

when the when the homography model is preferred. The

light red areas indicate false positives.

5.3 Experiment 2: 3D Reconstruction

In Experiment 2, we compared uniformly sampled

key frames, key frames selected based on the GRIC

difference score, and key frames selected using PELC

as well as the GRIC difference score. For each

sequence, we performed key frame selection then

applied Telefonica’s structure from motion pipeline

(metric reconstruction from the essential matrix for

the ﬁrst pair followed by resectioning and bundle ad-

justment for subsequent key frames) to obtain a 3D

point cloud from the key frames. We computed the

root mean reprojection error for each frame then com-

puted the min, max, mean, and standard deviation

statistics over the entire sequence. A numerical com-

parison of the three methods is shown in Table 2, and

the per-frame reprojection errors are shown for three

sequences in Figure 3.

The GRIC difference score method yields much

lower reprojection error lower variance than uniform

sampling in almost every case, but the mean repro-

jection error is not much better than that for uniform

sampling, due to a few outlier frames. A manual

inspection revealed that the outlier frames tended to

be those with signiﬁcant blur, leading to inaccurate

correspondences, even for the inlier correspondences.

Including PELC in the objective function eliminates

these outlier frames and leads to lower mean repro-

jection error and lower variance for all of the real se-

quences. PELC does not help much on the noise-free

synthetic sequence, however.

500 1000 1500

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

Uniformly Sampled Frames

GRIC

GRIC+PELC

Frames

Library Sequence

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

500 1000 1500

Uniformly Sampled Frames

GRIC

GRIC+PELC

Frames

Imagenio Sequence

Frames

100 200 300 400 500 600 700 800 900 1000

1.1

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

Uniformly Sampled Frames

GRIC

GRIC+PELC

Nico Sequence

Figure 3: Reprojection error comparison. Red dotted lines:

Uniform sampling. Blue dash lines: GRIC. Green dot-dash

lines: GRIC+PELC.

6 DISCUSSION AND

CONCLUSIONS

We have demonstrated the feasibility of automatic key

frame selection using a combination of constraints

based on the correspondence ratio and the GRIC

ROBUST KEY FRAME EXTRACTION FOR 3D RECONSTRUCTION FROM VIDEO STREAMS

235

Table 2: Results of Experiment 2. The GRIC+PELC method obtains the lowest mean reprojection error and lowest error

variance on all real video sequences.

Sequence Method Key frames Reprojection Error

Min Max Mean σ

Church (synthetic)

Uniform 30 0.2929 0.9608 0.4451 0.0221

GRIC 37 0.2627 0.5498 0.3878 0.0029

GRIC+PELC 29 0.2752 0.5164 0.3748 0.0026

Library

Fixed 43 0.4681 1.9595 0.6108 0.1434

GRIC 34 0.3362 0.7222 0.4738 0.0046

GRIC+PELC 37 0.2519 0.4852 0.3971 0.0022

Imagenio

Uniform 84 0.2016 1.8660 0.5415 0.0294

GRIC 79 0.2786 1.5150 0.4772 0.0363

GRIC+PELC 65 0.2493 0.6042 0.4275 0.0049

Nico

Uniform 30 0.3052 1.2125 0.4668 0.0290

GRIC 29 0.2760 0.5533 0.4032 0.0043

GRIC+PELC 29 0.2760 0.5155 0.3780 0.0031

Photocopy machine

Uniform 51 0.3210 1.7264 0.4649 0.0368

GRIC 51 0.3210 0.7955 0.4707 0.0116

GRIC+PELC 58 0.3274 0.5682 0.4324 0.0036

score for the homography and fundamental matrix,

followed by optimization of a criterion including the

GRIC difference and the point-to-epipolar line cost.

We ﬁnd that the relative quality of the fundamental

matrix and homography models, represented by the

GRIC difference, is more important than the point to

epipolar line cost, but both are useful in key frame

selection when some frames are corrupted by blur.

One possible limitation of our method is the need

to specify the thresholds and weights. We currently

set these free parameters experimentally. However,

since all of the parameters are relative to the number

of correspondences obtained or the overall residual er-

ror, in principle, it should be possible to ﬁnd values

that work well for most sequences and allow the user

to adjust them when necessary.

By limiting further 3D reconstruction processing

to the most informative frames, our method helps to

minimize the overall compute time of the video pro-

cessing pipeline. Telefonica is deploying the method

in an upcoming product for video surﬁng, which is in

the last phase of development.

Future work will focus on enhancing the system

for robustness with arbitrary videos. The key frame

selection method may have to interact with other pro-

cesses such as moving object segmentation and auto-

calibration to achieve this goal.

ACKNOWLEDGEMENTS

MTA was supported by a graduate fellowship from

the Higher Education Commission of Pakistan. We

thank Telefonica Research, Barcelona for provid-

ing the environment for this research and we thank

Guillermo Gallego and Jose Carlos for valuable sug-

gestions.

REFERENCES

Blender Artists (2000). 3D church model. Available at

http://blenderartists.org/cms/.

Blender Community (2009). Blender [open source soft-

ware]. Available at http://www.blender.org/.

Hartley, R. and Zisserman, A. (2003). Multiple View Geom-

etry in Computer Vision. Cambridge University Press,

New York, NY, USA.

Pollefeys, M. and Van Gool, L. (2002). Visual modeling

with a hand held camera. Journal of Visualization and

Computer Animation (JVCA), 13:199–209.

Seo, Kim, Doo, and Choi (2008). Optimal keyframe se-

lection algorithm for three-dimensional reconstruction

in uncalibrated multiple images. Society of Photo-

Optical Instrumentation Engineers (SPOIE), Vol. 47.

Seo, Kim, Jho, and Hong (2003). 3D estimation and

keyframe selection for match move. International

Technical Conference on Circuits/Systems, Computers

and Communications (ITC-CSCC).

Torr, P., Fitzgibbon, A., and Zisserman, A. (1998). Main-

taining multiple motion model hypotheses over many

views to recover matching and structure. pages 485–

491.

Torr, P. H. S. (1998). Geometric motion segmentation and

model selection. Phil. Trans. Royal Society of London

A, 356:1321–1340.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

236