UNSUPERVISED LEARNING FOR TEMPORAL SEARCH SPACE

REDUCTION IN THREE-DIMENSIONAL SCENE RECOVERY

Tom Warsop and Sameer Singh

Research School of Informatics, Holywell Park, Loughborough University, Leicestershire, LE11 3TU, U.K.

Keywords:

Three-dimensional scene recovery, Search space reduction, Unsupervised learning.

Abstract:

Methods for three-dimensional scene recovery traverse scene spaces (typically along epipolar lines) to com-

pute two-dimensional image feature correspondences. These methods ignore potentially useful temporal in-

formation presented by previously processed frames, which can be used to decrease search space traversal.

In this work, we present a general framework which models relationships between image information and

recovered scene information speciﬁcally for the purpose of improving efﬁciency of three-dimensional scene

recovery. We further present three different methods implementing this framework using either a naive Near-

est Neighbour approach or a more sophisticated collection of associated Gaussians. Whilst all three methods

provide a decrease in search space traversal, it is the Gaussian-based method which performs best, as the other

methods are subject to the (demonstrated) unwanted behaviours of convergence and oscillation.

1 INTRODUCTION

Recovering three-dimensional (3D) scene informa-

tion from two-dimensional (2D) image information

can be very useful. The work presented in this paper

is part of a larger project concerned with recovering

3D scene information from a train mounted, forward-

facing camera.

Many methods have previously been applied to

3D scene recovery. As highlighted by (Favaro et al.,

2003), a large proportion of these methods follow a

similar pattern of execution. First point-to-point cor-

respondences among different images are established.

These image correspondences are then used to in-

fer three-dimensional geometry. These feature cor-

respondences can be computed in one of two ways,

either by seaching the 2D image plane or by incorpo-

rating epipolar geometry.

The ﬁrst set of methods do not take the 3D na-

ture of the problem into account. These methods typ-

ically operate in two steps. First, image features are

detected. Methods presented in literature use Har-

ris corners ((Li et al., 2006)), SIFT features ((Zhang

et al., 2010)) and SURF features ((Bay et al., 2008)).

More recently, to compensate for viewpoint changes

in captured image information (Chekhlov and Mayol-

Cuevas, 2008) artiﬁcally enhanced the feature set for a

single image point considered, computing spatial gra-

dient descriptors for multiple afﬁne transformed ver-

sions of the image area surrounding a feature point.

Feature correspondences are then computed by fea-

ture matching in subsequent frames.

It is, however, possible to incorporate 3D informa-

tion into these feature correspondence computations.

One of the most straightforward ways of integrating

3D information uses stereo cameras. Under schemes

such as these, as can be seen in the work of (Zhang

et al., 2009; Fabbri and Kimia, 2010; Li et al., 2010)

and (Grinberg et al., 2010) (to name a few) epipo-

lar scanlines across left and right-hand images are

searched for matching feature correspondence. It is

possible to integrate these concepts into monocular

camera conﬁgurations such as in the method intro-

duced by (Klein and Murray, 2007) known as Paral-

lel Tracking and Mapping (PTAM). In which features

are initialised with their 3D positions by searching

along epipolar lines, deﬁned by depth between key

frames of the image sequence. (Davison, 2003; Davi-

son et al., 2007) presented a similar idea of feature

initialisation in monoSLAM.

When recovering 3D information from image se-

quences, if they are processed in reverse chronologi-

cal order new scene elements to process appear at im-

age edges. This provides an interesting property - im-

age areas recovered in subsequent image frames ex-

hibit similar properties to those processed previously,

highlighted in Figure 1. It may therefore be possi-

ble to exploit this information, using relationships be-

549

Warsop T. and Singh S..

UNSUPERVISED LEARNING FOR TEMPORAL SEARCH SPACE REDUCTION IN THREE-DIMENSIONAL SCENE RECOVERY.

DOI: 10.5220/0003308405490554

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 549-554

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

tween 2D image features and recovered 3D scenes to

reduce the size of the search spaces traversed (for ex-

ample, along epipolar lines) when computing feature

correspondences. Such a concept has not been pro-

posed by previous methods and forms the basis of

the method presented in this work (named Temporal

Search space Reduction, or TSR).

Figure 1: When processing image sequences in reverse or-

der, new scene elements entering at image edges exhibit

similar 2D image and 3D scene properties.

The structure of the remainder of this paper is as

follows. Section 2 presents a brief overview of the 3D

scene recovery method used as a platform for exper-

imental comparison of the TSR extension. The TSR

concept and three implementations are discussed in

section 3. Experimental results regarding real data are

provided in section 4, as is the discussion of problems

faced by methods implementing the TSR concept. Fi-

nally, section 5 concludes the work in this paper.

2 THREE-DIMENSIONAL

SEQUENCE RECOVERY

To demonstrate the TSR concept, the 3D scene recov-

ery method described by (Warsop and Singh, 2010) is

used. This method has been chosen because it can be

simply adapted to recover dense 3D scene informa-

tion in the form of planes, in which correspondences

are searched for along categorized epipolar lines. This

method recovers the 3D corner points of a plane re-

lated to image quadrilaterals by searching for the 3D

corner values which provide the lowest reprojection

error in a subsequent frame. Summarized as:

= min

{SAD(S

, SQR(R(Q

, I

), I

))} (1)

where, SAD(x, y) computes the sum-of-absolute dif-

ferences in the RGB channels of images x and y,

R(Q, I) reprojects the 3D coordinates of Q into subse-

quent image I, SQR(q, I) converts a quadrilteral im-

age area q into a square area using image data I,

= SQR(Q

, I

) where Q

and I

are the original

quadrilateral and image under consideration (respec-

tively) and Q

are the 3D coordinates of the quadri-

lateral corner points searched through. The adaptation

(a) Flat. (b) Side.

Figure 2: The three base types of plane used for searching

in the 3D scene recovery method.

made to this method takes the form of only using three

types of planes (shown in Figure 2) when searching

for the best matching plane - deﬁned by height, width

and depth respectively.

3 UNSUPERVISED LEARNING

FOR TEMPORAL SEARCH

SPACE REDUCTION

Figure 3 demonstrates how the concept proposed by

TSR (shaded boxes) integrates with a typical 3D

scene recovery method (unshaded boxes). The fol-

lowing describes each shaded box:

1. Compute 2D Image Features - since the image

area surrounding a feature is to be used to indi-

cate the 3D search space to traverse, these features

must be computed.

2. (2D,3D) Relationship Model - storing the rela-

tionship between 2D image features and corre-

sponding recovered 3D information.

3. Compute 3D Search Space - for any new feature

considered for recovery, the range of 3D values

to consider should be selected based upon the

computed 2D values. If similar features have

been processed before, narrow ranges around ex-

pected values should be searched. Otherwise,

large ranges should be selected. These search

ranges are deﬁned by a category type (ﬂat, ver-

tical or side) and value (height, depth or width).

4. Update the Model - once a new feaure has been

recovered, the model storing the 2D and 3D re-

lationships must be updated to include this new

information.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

550

Image sequence Next Image

Recovered 3D

scene

Project current 3D

scene into image

Detect features in

new, unprocessed

image areas

For each feature

(1) Compute 2D image

features from image

area surrounding

feature

(3) Compute 3D search

space range(s) to

traverse

Recover 3D information

related to feature

(4) Update model

Add recovered

feature information

to 3D scene

(2) (2D,3D)

relationship model

2D image features

3D range(s)

2D image features and

recovered 3D information

Recovered 3D information

(5) Degrade model

Output: recovered

3D scene

Figure 3: 3D sequence recovery method (unshaded box) enhancement, incorporating temporal information stored as a rela-

tionship between 2D image information of the area surrounding a feature and corresponding recovered 3D information. The

grey shaded boxes represent the additions proposed by TSR.

5. Model Degradation - it is necessary to degrade

the model on each iteration, preventing it from

consuming too much memory.

The following subsections discuss three different

methods for implementing these proposed TSR exten-

sions. Each method uses the same image features - for

each image quadrilateral area considered separate red,

green and blue histograms are computed. From which

each of the mean, standard deviation, skewness, kur-

tosis and energy are computed, resulting in 15 fea-

tures.

3.1 Nearest Neighbour (NN)

The ﬁrst implementation stores the (2D,3D) rela-

tionship as a list of tuples (M) of the form: <

2D image features, category, value >. For any newly

processed quadrilateral, the categories to process are

determined by computing the following value for each

category, c ∈ ﬂat, side, vertical:

2πσ

−( f −µ

)

2σ

(2)

where, f are the 2D image features of the currently

considered quadrilateral, Nc represents the K Eu-

clidean nearest neighbours to f in M of category c,

and σ

are the mean and standard deviation of

Nc respectively. The resultant set of probabilities are

normalized and any above a threshold indicate the

corresponding categories should be processed. The

range of values to process for any chosen category are

deﬁned by a minimum and maximum value computed

using:

min

= µ

− (Dσ

× (1 − p

)) (3)

max

= µ

+ (Dσ

× (1 − p

)) (4)

where, µ

and σ

are the mean and standard deviation

of the 3D values associated with Nc and D is a scalar

value. Scene recovery results are used to create new

tuples to update M with. To implement model degra-

dation, an extra distance ﬁeld is used in the tuples of

M. When a tuple is added to M, this ﬁeld is initialised

to zero and accumulates the distance travelled by the

camera since initialisation. A threshold of this dis-

tance ﬁeld can then be used to remove old tuples.

3.2 Nearest Neighbour with Error

Correction (NNEC)

The second implementation proceeds as the previous

NN method. But with the addition that after recovery

has been performed the SAD value associated with

the best set of quadrilateral corners (BQ

) is com-

puted:

min

SAD

= SAD(S

, SQR(R(BQ

, I

), I

)) (5)

where everything has the same meaning as in Equa-

tion 1. If min

SAD

is greater than a pre-determined

threshold, the value ranges selected for processing by

the nearest neighbour metric are deemed to of been

inappropriate and recovery is performed again, using

all value ranges. The subsequent result is added to the

tuple list as before.

UNSUPERVISED LEARNING FOR TEMPORAL SEARCH SPACE REDUCTION IN THREE-DIMENSIONAL SCENE

RECOVERY

551

3.3 Feature and Value Gaussians (FVG)

The relationship model of this implementation builds

a set of Gaussian distributions for the 2D image fea-

tures encountered. Similar 2D image features are rep-

resented by a single multi-dimensional Gaussian dis-

tribution. Each of these feature distributions is asso-

ciated with value range distributions, each represent-

ing similar 3D values that have been recovered for the

corresponding 2D image features. Each value distri-

bution also has an associated category.

For any new feaure recovered, the probability the

corresponding image features ( f ) belong to any of the

feature distributions are computed:

2πσ

−( f −µ

)

2σ

(6)

where, p

is computed for each of the feature dis-

tributions, FD is the current feature distribution un-

der consideration and µ

ad σ

are the mean and

standard deviation of FD respectively. For each p

greater than a pre-determined threshold, the associ-

ated value distributions are each considered in turn

and used to determine a value range to process, us-

ing the minimum and maximum computed in a sim-

ilar manner to Equations 3 and 4. If V R represents

the set of all values to process of a new feature, the

best ﬁtting plane for the considered 3D scene recov-

ery method is computed using:

= min

∈V R

{SAD(S

, SQR(R(Q

, I

), I

))}

(7)

If the SAD value associated with P

is greater

than a threshold, reprocessing proceeds as for the

NNEC method. Under this scheme, there are three

possible ways in which the model can be updated.

These are demonstrated in Figure 4, where green rep-

resents update and red means a new distribution is

added. Model degradation is performed by storing

a distance since last update with each distribution,

where distance is in terms of camera movement. If

this distance exceeds a threshold the corresponding

Gaussian is removed.

4 EXPERIMENTAL RESULTS

The data used for experimentation consists of high-

deﬁnition (i.e. 1920 × 1080 pixels) image frames,

captured from a front-forward facing camera mounted

on a train. In total, 5 sequences totalling 520 im-

age frames were used. Each image frame was ground

truthed by hand - matching approximately 850 fea-

tures between image pairs.

(a) No feature matching.

(b) Feature matching but value reprocessing.

Figure 4: Graphical repesentation of the three different

types of update in FVG.

With regards to TSR methods presented in section

3, Figure 5(a) presents the average number of com-

binations checked per quadrilateral recovered in each

frame and Figure 5(b) the accuracy of each method,

where Exhaustive refers to the unaltered method de-

scribed in section 2. The results show that each of the

methods implementing the TSR extensions provides a

decrease in the number of combinations checked per

recovered quadrilateral whilst maintaining similar ac-

curacy. However, this reduction sometimes comes at

a cost. For example, the NN method produces less

accurate scene recovery results. This is because the

NN relationship model can converge. Consider a syn-

thetic image sequence comprising of only a textured

wall and ﬂoor plane such as in Figure 6. The sequence

was created such that in the ﬁrst 10 images, the wall is

of a ﬁxed x-coordinate of 400. Then in the 19th image

the wall was created with an x-coordinate of 0.

In the ﬁrst 18 images of the sequence, the whole

image was processed and used to update the NN

model. In the 19th image only a square of the wall

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

552

(a) Average number of quadrilterals checked

per sequence frame.

(b) Average difference between recovered sequence frames and ground

truth.

Figure 5: Comparison of the number of checks made and difference with ground truthed scene per method, averaged over all

5 sequences.

Figure 6: Synthetic image sequence example comprising of

a ﬂoor and wall plane.

Figure 7: Number of combinaions procesed by the NNEC

method for the oscillating wall sequence.

plane was recovered. As expected, the side plane cat-

egory was chosen. However, because all previous side

planes processed are of width 400, the value range

chosen to process is 390.50 to 421.04. Clearly this is

incorrect. The error occurs because for a large part of

the sequence one set of 2D image features maps to one

speciﬁc 3D value. Hence, the model converges for

these 2D image features. When the 3D value changes

for this speciﬁc set of features the model cannot rep-

resent this, resulting in possible error.

NNEC resolves this convergence issue. However

because of the nearest neighbour selection scheme,

this method can produce oscillatory behaviour. For

example, consider a similar synthetic image se-

quence, except in this one the x-coordinate for the

wall is 400 pixels for 20 frames, then 0 for 20 frames

and 400 pixels for a further 20 frames. When the wall

plane value changes for the ﬁrst time, error correc-

tion is invoked and slowly more correct members are

added to the pool of nearest neighbours, but when the

wall plane value reverts back to the original value the

same process repeats - highlighted in Figure 7.

The FVG method avoids these problems because

the multiple associated value distributions can repre-

sent different values the features have been mapped to

in the sequence so far.

5 CONCLUSIONS

We have presented a general update to 3D scene re-

covery methods which takes advantage of temporal

information to increase efﬁciency. As such, 3 differ-

ent implementations were provided and applied to an

existing 3D scene recovery method. Of which, the

simple nearest neighbour methods are affected by the

problems of convergence and oscillatory behaviour.

The Gaussian model presented copes well with both

of these problems, reducing the search space traversal

by an order of magnitude and maintaining accuracy

of recovered scenes. Now that we have demonstrated

the advantages and pitfalls of these methods, we wish

to further investigate the beneﬁts of the TSR concept,

integrating it with other methods and applying it to

more challenging data.

UNSUPERVISED LEARNING FOR TEMPORAL SEARCH SPACE REDUCTION IN THREE-DIMENSIONAL SCENE

RECOVERY

553

REFERENCES

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008).

Surf: Speeded up robust features. Computer Vision

and Image Understanding (CVIU), 110(3):346–359.

Chekhlov, D. and Mayol-Cuevas, W. (2008). Appearance

based indexing for relocalisation in real-time visual

slam. In 19th British Machine Vision Conference,

pages 363–372.

Davison, A. J. (2003). Real-time simulataneous localization

and mapping with a single camera. In Proc. Interna-

tional Conference on Computer Vision, pages 1403–

1411.

Davison, A. J., Reid, I. D., Molton, N. D., and Stasse, O.

(2007). Monoslam: Real-time single camera slam. In

IEEE Transactions on Pattern Analysis and Machine

Intelligence, volume 29, pages 1–15.

Fabbri, R. and Kimia, B. (2010). 3d curve sketch: Flexible

curve-based stereo reconstruction ad calibration. In

2010 IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 1538–1545.

Favaro, P., Jin, H., and Soatto, S. (2003). A semi-direct ap-

proach to structure from motion. In The Visual Com-

puter, volume 19, pages 377–384.

Grinberg, M., Ohr, F., and ad J. Beyerer, D. W. (2010).

Feature-based probabilistic data association and track-

ing. In The 7th International Workshop on Intelligent

Transportation (WIT2010), pages 29–34.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small ar workspaces. In Proceedings and

the Sixth IEEE and ACM International Symposium on

Mixed and Augmented Reailty (ISMAR07).

Li, J., Li, ., Chen, Y., Xu, L., and Zhang, Y. (2010). Bun-

dled depth-map merging for multi-view stereo. In

2010 IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 2769–2776.

Li, P., Farin, D., Gunnewiek, R. K., and de With, P. H. N.

(2006). On creating depth maps from monoscopic

video using structure frm motion. In 27th Symposium

on Information Theory, pages 508–515.

Warsop, T. E. and Singh, S. (2010). Robust three-

dimensional scene recovery from monocular image

pairs. In 9th IEEE International Conference on Cy-

bernetic Intelligent Systems 2010 (CIS 2010), pages

112–117.

Zhang, G., Dong, Z., Jia, J., Wong, T.-T., and Bao, H.

(2010). Efﬁcient non-consecutive feature tracking for

structure-from-motion. In ECCV 2010, pages 422–

435.

Zhang, G., Jia, J., Wong, T.-T., and Bao, H. (2009). Consis-

tent depth maps recovery from a video sequence. In

IEEE Transactions on Pattern Analysis and Machine

Intelligence, volume 21, pages 974–988.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

554