A Saliency-based Framework for 2D-3D Registration

Mark Brown, Jean-Yves Guillemaut and David Windridge

Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K.

Keywords:

Pose Estimation, Registration, Saliency.

Abstract:

Here we propose a saliency-based ﬁltering approach to the problem of registering an untextured 3D object to a

single monocular image. The principle of saliency can be applied to a range of modalities and domains to ﬁnd

intrinsically descriptive entities from amongst detected entities, making it a rigorous approach to multi-modal

registration. We build on the Kadir-Brady saliency framework due to its principled information-theoretic

approach which enables us to naturally extend it to the 3D domain. The salient points from each domain are

initially aligned using the SoftPosit algorithm. This is subsequently reﬁned by aligning the silhouette with

contours extracted from the image. Whereas other point based registration algorithms focus on corners or

straight lines, our saliency-based approach is more general as it is more widely applicable e.g. to curved

surfaces where a corner detector would fail. We compare our salient point detector to the Harris corner and

SIFT keypoint detectors and show it generally achieves superior registration accuracy.

1 INTRODUCTION

The increase in available 3D data over the last decade

has naturally led to the problem of its registration

with 2D images. It has applications in object recog-

nition, robotics and medical imaging, where in par-

ticular real-time solutions are required (van de Kraats

et al., 2004; Gendrin et al., 2011).

While there is a signiﬁcant amount of 3D data that

contains texture information, for example acquired

from Structure from Motion or multi-viewreconstruc-

tion techniques, here we consider the problem where

the data is untextured; often obtained from LiDAR or

other forms of laser scanner. This constraint makes

it signiﬁcantly more challenging, if the 3D data were

textured, point correspondences could be initialised

by, for example, using the SIFT descriptor from View-

point Invariant Patches (Wu et al., 2008). However,

for untextured 3D data these descriptors are inappli-

cable since there is no longer a common modality to

be described.

The 2D / 3D registration problem can be seen as

an instantiation of the more general multi-modal reg-

istration problem, an area that has received signiﬁ-

cant attention in computer vision (van de Kraats et al.,

2004; Gendrin et al., 2011; Mastin et al., 2009; Eg-

nal, 2000). While there exist a number of methods

that can perform this with a good initial alignment,

e.g. 2D / 3D registration in medical imaging using

ﬁducial markers (van de Kraats et al., 2004) or cross-

spectral stereo matching (Egnal, 2000), the less con-

strained case without these priors remains largely un-

solved. This is because the methods relying on a good

initial alignment often use computationally expensive

measures such as Mutual Information (Egnal, 2000)

and so are infeasible when considering all possible

alignments.

The main contribution of this paper is its advoca-

tion of saliency-based ﬁltering for the 2D / 3D regis-

tration problem. Saliency is a broad term that refers to

the idea that certain parts of an image are more infor-

mative or distinctive than other areas and these parts

represent the intrinsic and underlying aspects of the

object. As such, it is not domain or modality speciﬁc,

making it a very general approach to multi-modal reg-

istration. Further, it is seen as an approximation to

the Human Visual System (HVS) which is capable

of solving a variety of vision and registration tasks

very efﬁciently. The structure of saliency detectors

can be varied since the deﬁnition is broadly speciﬁed;

the aim is often to extract features that have a high re-

peatability with respect to keypoints people have la-

beled as salient (Judd et al., 2012). As such, the two

detectors the saliency detector is compared to - the

Harris Corner detector and SIFT - could arguably both

be considered saliency detectors in certain contexts.

While there are a range of saliency detectors to

choose from (Itti, 2000; Kadir and Brady, 2001; Judd

265

Brown M., Guillemaut J. and Windridge D..

A Saliency-based Framework for 2D-3D Registration.

DOI: 10.5220/0004675402650273

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 265-273

ISBN: 978-989-758-003-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

et al., 2012; Lee et al., 2005), we use the Kadir-Brady

detector due to its principled information-theoretic

approach. We generalise it over an arbitrary number

of dimensions and implement a curvature based ex-

tension to the 3D domain. It extracts salient points

by maximising the Shannon entropy in scale-space;

a measure that has been used in 2D / 3D and cross-

spectral registration before in the form of Mutual In-

formation (MI) (Mastin et al., 2009; Egnal, 2000)

however MI is too costly in our case as we do not

assume a good initial alignment.

After extracting salient points from both domains,

we seek to align them. However, as descriptors are

typically inapplicable in multi-modal data no corre-

spondences can be initialised; hence the ‘Simultane-

ous Pose and Correspondence’ (SPC) problem has to

be solved. For this we use the SoftPosit algorithm

(David et al., 2002) to initially align the data and the

registration is reﬁned by aligning the edges of the sil-

houette with edges from the image. Whereas other

methods e.g. (Mastin et al., 2009) put a strong prior

on the initial alignment, the SoftPosit point based

method is able to consider a large variety of align-

ments (all rotations and limited translations) within a

reasonable amount of time (5 minutes on a standard

single core CPU).

The structure of this paper is as follows: In Sec-

tion 2, we give an outline of related work in 2D / 3D

registration. In Section 3 we present our methodol-

ogy; in Section 4 our experiments and conclusions are

presented in Section 5.

2 RELATED WORK

2D / 3D registration has received a signiﬁcant amount

of attention, particularly in the medical domain,

where multi-modal (e.g. CT, MRI, 3D Rotational

X-ray (van de Kraats et al., 2004)) registration is of

great importance. (van de Kraats et al., 2004) evalu-

ate a number of algorithms for this and classify them

as intensity-based, gradient-based, feature-based or

hybrid-based. Algorithms outside of the medical

imaging domain are usually feature-, intensity-, or

learning-based. Registration in medical imaging typ-

ically involves using ﬁducial markers for an initial

alignment and so the problem becomes one of reﬁne-

ment.

Intensity based methods typically project the 3D

data and use correlation measures such as mutual

information or cross-correlation to reﬁne the regis-

tration (Kotsas and Dodd, 2011). These techniques

have been used in cross-spectral stereo matching (Eg-

nal, 2000) where the offset that maximises the mu-

tual information or correlation is determined to be

the correct offset. It has also been used by (Mastin

et al., 2009) where the authors match a LiDAR scan

to an image using Mutual Information. However, the

authors acknowledge the sensitivity to initialisation,

saying that they do not initialise the yaw or pitch to

more than 0.5 degrees from the ground truth in their

experiments. It would be computationally infeasible

when sampling without these priors.

Feature based methods extract features such as

points or lines, and match these up, often without as-

suming any knowledge about their correspondences

(hence solving for the pose and the correspondences).

A number of papers have attempted to solve the Si-

multaneous Pose and Correspondence (SPC) problem

from these points, a problem also addressed in 3D-

3D registration. With a good enough initial align-

ment, matching can be achieved through Iterative

Closest Point (ICP). This can be made more robust

through Expectation-Maximisation ICP (Granger and

Pennec, 2002) or Levenberg-Marquardt ICP (Fitzgib-

bon, 2001) who further describes the use of differ-

ent kernels to deal with outliers and shows their pro-

posed method increasing the basin of convergence

over other variants of ICP. A well known solution to

SPC, and the one used here, is the SoftPosit algorithm

(David et al., 2002). Unlike ICP, this allows weighted

correspondences between points rather than the bi-

nary assignment in ICP, allowing it to better avoid lo-

cal maxima during convergence. It will be discussed

further in Section 3.3.

More recently (Moreno-Noguer et al., 2008)

solved the SPC problem by modeling initialisations

as a Gaussian Mixture Model and using each com-

ponent to initialise a Kalman ﬁlter. They also intro-

duced priors on the camera pose; for example the

camera is always above the ground and pointing to-

wards the object. It performs just as well as SoftPosit

in a similar amount of time except in large amounts of

clutter, where SoftPosit is outperformed by it. Also,

(Enqvist et al., 2009) solved the SPC problem by de-

termining which pairs of correspondences are infea-

sible and sought to maximise the number of corre-

spondences that are feasible. This is done using an

heuristic branch and bound technique and the authors

achieve results comparable to state of the art.

Machine learning algorithms are evident in the lit-

erature for the 2D / 3D registration problem, speciﬁ-

cally in face recognition. In general, the main draw-

back with this approach is the time spent in label-

ing and learning the model. (Yang et al., 2008)

use Canonical Correlation Analysis (CCA) for face

recognition to learn a correlation between image fea-

tures and 3D features of face data. In this applica-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

266

tion, an image and a 3D mesh of a face are compared:

if there is sufﬁcient correlation then it is deemed a

match.

2.1 Saliency-based Methods

We use Kadir-Brady saliency (Kadir and Brady, 2001;

?) due to its applicability in different modalities re-

sulting from a principled information-theoretic ap-

proach. Other forms of saliency are not so easy to

extend to 3D e.g. Itti-Koch saliency (Itti, 2000) relies

on colour and centre-surround operations that aim to

replicate how the human visual system works. (Lee

et al., 2005) determine the saliency of a 3D mesh

through weighting curvatures in scale space to return

a saliency value for each point; this is not as general

as Kadir-Brady which only requires a histogram.

Saliency has already been applied in the 2D / 3D

registration problem in medical imaging by (Chung

et al., 2004). Here, the authors learn a saliency map

through eye-tracking of volunteers. By focusing on

registering only on the region of the image that is

salient, the authors increase the accuracyof the results

whilst reducing the computation time. This is differ-

ent from what is proposed here because (Chung et al.,

2004) uses a ground truth saliency map constructed

through using eye tracking software; no salient algo-

rithm was used. Instead, we propose a framework for

2D / 3D registration that can be applied to the general

multi-modal registration problem.

3 METHODOLOGY

We wish to register a 3D model to an image. To do so,

salient image points {x

}

i=1

and N

salient mesh

points { X

}

j=1

are extracted in the ﬁrst stage of the

methodology. This is achieved using the generalised

Kadir-Brady saliency detector in each modality (see

Sections 3.1 and 3.2).

The second stage of the methodology aims to de-

termine the transformation that matches these points.

This requires the SPC problem to be solved since no

putative matches may be found, for which SoftPosit

(David et al., 2002) is used.

However, this is only possible if the two sets of

salient points have a sufﬁciently high repeatability.

This may not be the case: in particular the image

points may exhibit projective saliency, whereby they

are salient because they lie on the edge of the pro-

jected model, due to the viewing angle (i.e. transfor-

mation between modalities) rather than the model’s

intrinsic properties. An example of this is given in

Figure 1: An illustration of different types of saliency. The

red points are salient due to the intrinsic properties of the

model whereas the green points exhibit projective saliency.

Figure 1. Projective saliency is accounted for by ex-

tracting points on the edge of the silhouette (S) of the

mesh and measuring their alignment with {x

}

i=1

- a

detailed explanation is given in Section 3.3.

The ﬁnal stage of the methodology takes the

best scoring 1% of transformations from the previ-

ous stage and reﬁnes them by further matching edges

from 2D (extracted using the Canny edge detector)

with the boundary of the 3D silhouette - a detailed

explanation of this is given in Section 3.4. A diagram

showing the pipeline of the methodology is given in

Figure 2.

3.1 The Generalised Kadir-Brady

Saliency Detector

The Kadir-Brady detector is here abstracted to R

This is constructed using a set of points {x

∈ R

a set of scales {s

,...,s

} and a histogram with val-

ues



1,x,s

,...,v

R,x,s



associated with a point x over

a given scale s

. The Kadir-Brady detector uses this

histogram about each point for each scale and de-

termines where the histogram’s entropy is peaked in

scale space. The higher this peak is above its neigh-

bour’s scales, the higher the saliency of that point is

deﬁned to be.

More formally, for a point x, a scale s

, and a his-

togram with values



1,x,s

,...,v

R,x,s



, the entropy of

x is deﬁned as:

x,s

= −

∑

i=1

P(v

i,x,s

)log

(P(v

i,x,s

)) (1)

where P(v

i,x,s

) is the probability of the histogram

taking the value v

at scale s

from x. This is

taken from a frequentist approach, i.e. P(v

i,x,s

) =

i,x,s

∑

j=1

j,x,s

. (Shao et al., 2007) make this step

ASaliency-basedFrameworkfor2D-3DRegistration

267

Figure 2: Pipeline of methodology for 2D / 3D registration.

more robust by weighting points by twice as much if

they are within s

k−1

of x. This is used here as the au-

thors demonstrate an increase in the repeatability by

10% compared to no weighting.

After computing the entropy of each point at each

scale (as above) only the scale-space points for which

the entropy is peaked above its neighbouring two

scales are retained. This is then weighted by how dis-

similar the PDF is from these two scales:

x,s



∑

i=1



P(v

i,x,s

k+1

) −P(v

i,x,s

)



∑

i=1



P(v

i,x,s

) −P(v

i,x,s

k−1

)





(2)

where C

and C

are constants that make the compar-

ison scale invariant:

k+1

−N

k−1

(3)

and where N

is the number of points within s

of x,

i.e. N

∑

i=1

i,x,s

. The ﬁnal saliency measure Y

x,s

then the product of the two measures H

x,s

and W

x,s

These salient points are subsequently clustered

into circular salient regions using an heuristic cluster-

ing algorithm according to (Kadir and Brady, 2001).

Its purpose is to make it more robust to noise as this

would otherwise act as a randomiser and increase the

entropy. The centre of each circular region is deﬁned

to be the salient point returned by the detector,and has

an associated scale and saliency value. An example

of points extracted from the generalised Kadir-Brady

saliency detector is given in Figure 3.

The original Kadir-Brady detector for images (R

)

constructs a 256-bin histogram of pixel intensities

about each pixel taken from a circular neighbourhood

of radius s

. The radius ranges from three pixels to 21

in intervals of three.

Figure 3: Detected points of the Middlebury dinosaur using

the standard 2D Kadir-Brady detector (left) and its exten-

sion to 3D mesh-based data (right).

3.2 The 3D Kadir-Brady Saliency

Detector

While the scale-space concept is the same in 3D, the

descriptor histogram to use is less obvious. After

some experiments, we adopt a curvature-based detec-

tor, the same as in (Lee et al., 2005). For this the

mean curvature (the average of the two principal cur-

vatures) is used, where the principal curvatures at x

are deﬁned to be the eigenvalues of the shape operator

at x, and are calculated using the algorithm presented

by (Taubin, 1995). Since this algorithm can produce

abnormally large curvatures for some points due to

ﬂoating point errors, in our algorithm those with the

top 0.5% of all curvature values have their curvature

value capped. Denoting the maximum curvature by

M, a 30-bin histogram around point x with scale s

is constructed as follows: For each point x

that is

within s

of x denote its curvature as x

. Then insert

this into bin number ⌊

30x

⌋, i.e. construct bins of uni-

form width ranging from curvatures 0 to M. Thus the

value of each bin in the histogram can be expressed

as:

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

268

i,x,s

∑

:|x−x

|<s

δ(⌊

30x

⌋,i)

∑

:|x−x

|<s

k−1

δ(⌊

30x

⌋,i) (4)

where ⌊x⌋ denotes the greatest integer smaller than x

and δ the Kronecker delta function.

Seven scales are used as in the 2D case. Here the

lowest scale is 0.3% of the length of the diagonal of

the bounding box of the model, and this increases in

0.3% intervals up to 2.1%.

3.3 The SoftPosit Algorithm

The SoftPosit algorithm (David et al., 2002) attempts

to solve the SPC Problem: for image points {x

}

i=1

and model points {X

}

j=1

it aims to ﬁnd:

argmin

R,T,P

∑

i=1

∑

j=1

i, j

||K(RX

+ T) −x

(5)

where K is the known camera intrinsics, R ∈ SO(3),

T is a 3x3 translation vector and P is a permutation

matrix whose entries are P

i, j

= 1 if x

matches X

and

0 otherwise. A brief outline of the SoftPosit algorithm

is given here, with more detail in (David et al., 2002).

The SoftPosit algorithm iteratively switches be-

tween assigning correspondences (P) (involving mul-

tiple, weighted correspondences from each point) and

solving the pose (R and T) from these weighted corre-

spondences. Furthermore, the parameter that controls

this weighting (β) can be seen as a simulated anneal-

ing parameter, meaning the correspondences tend to-

wards the binary ‘one correspondence per point’ as

required in Equation 5. For comparison with Section

3.4, the update step is as follows:

Inputs: N

world points, N

image points, hypothe-

sised pose.

Outputs: Updated pose.

1. For each world / image coordinate pair (X

, x

project x

to the same depth as X

using a per-

spective projection, then project this back onto the

image plane under a Scaled Orthographic Projec-

tion (SOP) - call this p

. Deﬁne the projection of

under a SOP as q

. Let d

i, j

= |p

−q

2. Compute a matrix of weights m

i, j

γexp(−β(d

i, j

− α)) for β = 0.004, α = 1,

γ = 1/(max{N

} + 1). Also ∀ i, j set

i,N

= m

+1, j

= γ : this represents the

probability of a point having no correspondence.

3. Use Sinkhorn’s method to normalise each row and

column (apart from the last ones) of this matrix.

This is achieved by alternately normalising each

row and column.

4. Use m

i, j

as the weights for weighted corre-

spondences to update the pose by minimising a

weighted sum (speciﬁcally Equation 5 with P

i, j

). This is achieved simply by setting the

derivative of the objective function to zero.

The algorithm is iterated from the ﬁrst stage with

β increasing by a factor of 1.05 each iteration - this

has the effect of weighting close points signiﬁcantly

higher.

What is deﬁned here as projective saliency is now

taken into account by extracting points on the edge

of the silhouette (S) of the model and aligning these

with the image points. Therefore, (5) is altered and

the objective is to ﬁnd:

argmin

R,T

∑

j=1

min

i=1

||K(RX

+ T) −x

∑

i=1

min

j∈W

||K(RX

+ T) −x

(6)

where W represents the union of the original salient

3D points and the 3D points on the boundary of the

silhouette S (and is thus dependent on R and T). Note

that (6) no longer depends on P; this is because the

problem is no longer symmetric (image points may

lie on the boundary but not vice versa). Obtaining

the silhouette at every iteration is an expensive opera-

tion and is only relevant when the hypothesised pose

is within the vicinity of the solution. Therefore, the

points are initially aligned according to (5) by Soft-

Posit and scored according to (6), which is subse-

quently minimised in Section 3.4.

In the current implementation, 50 iterations are

carried out from 500 random initial poses, for which

random rotation matrices were generated using the

method described in (Arvo, 1992). Further, OpenGL

is used to render the 3D model and points that are not

visible are not included in the update step. As this is

computationally expensive, it is only done when the

pose is sufﬁciently different from the previous time

the visibility condition was checked. This difference

is measured as the sum over all elements in R and T of

the absolute difference between the two poses. From

these 500 initialisations, the lowest scoring 1% are se-

lected according to (6).

3.4 Boundary Matching

After an initial alignment is obtained, the top 1% are

subsequently reﬁned by matching the boundary of the

ASaliency-basedFrameworkfor2D-3DRegistration

269

silhouette with the outer edges of the image as well

as matching the original 2D and 3D salient points. A

Canny edge detector is ﬁrst used on the image and

points are extracted on the exterior contour from this.

With these E exterior points, there are now N

world

coordinates and (N

+ E) image points and (6) is min-

imised with this new set of image points. The update

step is adjusted as follows:

Inputs: N

world points, (N

+ E) image points, hy-

pothesised pose, 3D mesh.

Outputs: Updated pose.

1. Compute d

i, j

between the N

world points and all

+ E) image points.

2. For each of the (N

+ E) image points, compute

the distance between it and the nearest point that

is on the edge of the silhouette. (Computing the

distance between it and all points on the edge of

the silhouette is too computationally expensive.)

3. Construct the matrix of weights m

i, j

as appropri-

ate, taking into account an image point may be

matched with its nearest point on the silhouette.

This leads to a (N

+ 2) × (N

+ E + 1) matrix.

Apply Sinkhorn’s method for the same purpose as

in Section 3.3.

4. Use these weighted correspondencesto update the

pose, similar to Section 3.3.

Again, β is increased by a factor of 1.05 at each it-

eration. The score is calculated in the same way as be-

fore, except now it uses all (N

+ E) image points in-

stead of the N

originally extracted. The lowest scor-

ing match is deemed the correct alignment.

4 EXPERIMENTS

4.1 Experimental Setup

We applied our algorithm to six datasets; two real

and four synthetic. The synthetic datasets are the

Stanford bunny and Stanford dragon, and the horse

and angel from the Large Geometric Models Archive

from Georgia Institute of Technology. For the syn-

thetic datasets, textureless images were generated us-

ing POV-Ray from a random angle (Arvo, 1992) and

ﬁxed translation, using a point light source at the same

location as the camera. The two real datasets are the

dinosaur and temple from Middlebury’s multi-view

reconstruction dataset (Seitz et al., 2006). Since this

does not include the ground truth, we use the 3D

reconstruction provided by (Guillemaut and Hilton,

2011). These images contain texture information of

the model while the 3D data does not, adding an-

other layer of difﬁculty. An example image from each

dataset is shown on the right of Figure 4. In all cases,

we attempted to align 30 images independently to the

model.

For each attempted registration, we extracted 160

points from the 3D structure and 80 from the image.

We extract these using the generalised Kadir-Brady

saliency method described here, and using the Harris

corner detector and the SIFT keypoint detector, both

of which have a respective implementation in 3D (As

there is no texture in the models, SIFT uses the cur-

vature value instead). All three of these keypoint de-

tectors return a response value which is used to rank

them so as to obtain the optimal keypoints. We then

follow the methodology described in Section 3, ini-

tialising SoftPosit with 500 random poses.

To generate the poses, we assume no prior infor-

mation on the rotation and so we generate random ro-

tation matrices according to (Arvo, 1992). However,

we initialise the translation such that the mean of the

3D data is in the centre of the image (which is the

case in the synthetic data, and is sufﬁciently close for

the real data) and initialise a random depth within 5%

from the ground truth. Whilst we could have sampled

over a larger range of translations, this is unrealistic

as often an image of a model is focused at a suitable

location to capture all of the structure.

4.2 Results and Discussion

We deﬁne three distance measures to compare the

hypothesised registration with its ground truth. The

ﬁrst measure is a comparison of rotations. There

are a number of ways to compare rotation matrices

(Huynh, 2009). Here, we deﬁne the angle between

the rotation R and its ground truth rotation R

rot

(θ) = arccos(Ru·R

u) with u =

√

[1,1,1]

. Sec-

ondly, we measure the distance between the hypothe-

sised camera centre and the ground truth camera cen-

tre in normalised 3D coordinates (E

dist

). Finally we

introduce the projection error: it is calculated by pro-

jecting the model using the hypothesised projection

and comparing it with the ground truth projection by

comparing the two generated areas A and B. The error

is deﬁned as E

proj

(%) =

(A∪B)\(A∩B)

A∪B

The cumulative frequency curves for these mea-

sures are shown in Figure 4 for individual datasets and

their average across all datasets.

Overall, the saliency detector produced the best

results, particularly on the real data and the horse.

It performed similarly to the other detectors for the

Stanford bunny and the angel. This is shown by

the different areas under the cumulative frequency

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

270

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Horse

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Angel

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Stanford Bunny

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Stanford Dragon

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Middlebury Temple

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

Middlebury Dinosaur

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Horse

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Angel

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Stanford Bunny

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Stanford Dragon

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Middlebury Temple

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

Middlebury Dinosaur

0 20 40 60 80 100 120 140 160 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Degrees from ground truth

Cumulative Frequency

All Datasets

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Distance from Camera

Cumulative Frequency

All Datasets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Horse

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Angel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Stanford Bunny

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Stanford Dragon

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Middlebury Dinosaur

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

Middlebury Temple

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Projection Overlap

Cumulative Frequency

All Datasets

Figure 4: Images of the datasets and cumulative frequency curves for each error measure. From left to right: Rotation error;

Distance between camera centres; Projective error; image of dataset. From top to bottom: horse; angel; Stanford bunny;

Stanford dragon; Middlebury temple; Middlebury Dinosaur; Average across all datasets.

ASaliency-basedFrameworkfor2D-3DRegistration

271

curve for each point detector when averaged across

all datasets:

Dataset / Error Type E

rot

(θ) E

dist

proj

(%)

Harris 140 1.63 0.81

SIFT 149 1.70 0.85

Saliency 169 1.72 0.88

Whilst our method worked well on the Stanford

dragon, it was outperformed by the Harris corner de-

tector and SIFT; this may be due to the large amount

of small corners on it, allowing the other methods to

be better suited. In particular, Kadir-Brady saliency

may not work as well for small features since their en-

tropy may not be peaked in scale space (the entropy is

high for the smallest scale and then decreases). Many

of the results could potentially have been localised to

within 1

◦

if a better method had been used in the re-

ﬁnement stage. In particular, numerous edges were

extracted from images of the temple, producing a lot

of noise for the reﬁnement. This could be solved by,

for example, using Mutual Information for reﬁnement

as in (Mastin et al., 2009).

5 CONCLUSIONS

In this paper we have presented a generalisation of

Kadir-Brady saliency to an arbitrary number of di-

mensions and provided a novel curvature based ex-

tension to 3D. Further, we have used this as a ﬁl-

ter for the 2D / 3D registration problem and shown

saliency to be a superior ﬁlter for point-based regis-

tration, demonstrating its consistency across different

modalities. This is due to its principled information-

theoretic approach, however it is only a proxy for true

saliency and improvements can be made here. Future

work may include line or curve-based saliency, esti-

mating the focal length of the camera as well as its ex-

trinsics and the extension of the principle of saliency-

based ﬁltering to other multi-modal problems.

ACKNOWLEDGEMENTS

This research was executed with the ﬁnancial sup-

port of the EPSRC and EU ICT FP7 project IMPART

(grant agreement No. 316564).

REFERENCES

Arvo, J. (1992). Fast random rotation matrices. pages 117–

120.

Chung, A. J., Deligianni, F., Hu, X.-P., and Yang, G.-Z.

(2004). Visual feature extraction via eye tracking for

saliency driven 2D/3D registration. In Proc. of the

2004 symposium on Eye tracking research & applica-

tions, ETRA ’04, pages 49–54.

David, P., DeMenthon, D., Duraiswami, R., and Samet, H.

(2002). Softposit: Simultaneous pose and correspon-

dence determination. In Proc. of the 7th European

Conference on Computer Vision-Part III, ECCV ’02,

pages 698–714.

Egnal, G. (2000). Mutual information as a stereo correspon-

dence measure. Technical report, Univ. of Pennsylva-

nia.

Enqvist, O., Josephson, K., and Kahl, F. (2009). Optimal

correspondences from pairwise constraints. In Inter-

national Conference on Computer Vision.

Fitzgibbon, A. W. (2001). Robust registration of 2D and

3D point sets. In British Machine Vision Conference,

pages 662–670.

Gendrin, C., Furtado, H., Weber, C., Bloch, C., Figl, M.,

Pawiro, S. A., Bergmann, H., Stock, M., Fichtinger,

G., Georg, D., and Birkfellner, W. (2011). Monitoring

tumor motion by real time 2d/3d registration during

radiotherapy. Radiother Oncol.

Granger, S. and Pennec, X. (2002). Multi-scale em-icp:

A fast and robust approach for surface registration.

In European Conference on Computer Vision (ECCV

2002), volume 2353 of LNCS, pages 418–432.

Guillemaut, J.-Y. and Hilton, A. (2011). Joint multi-layer

segmentation and reconstruction for free-viewpoint

video applications. Int. J. Comput. Vision, 93(1):73–

100.

Huynh, D. Q. (2009). Metrics for 3d rotations: Comparison

and analysis. J. Math. Imaging Vis., 35(2):155–164.

Itti, L. (2000). A saliency-based search mechanism for overt

and covert shifts of visual attention. Vision Research,

40(10-12):1489–1506.

Judd, T., Durand, F., and Torralba, A. (2012). A benchmark

of computational models of saliency to predict human

ﬁxations. Technical report, MIT.

Kadir, T. and Brady, M. (2001). Saliency, scale and image

description. Int. J. Comput. Vision, 45(2):83–105.

Kotsas, P. and Dodd, T. (2011). A review of methods for

2d/3d registration. World Academy of Science, Engi-

neering and Technology.

Lee, C. H., Varshney, A., and Jacobs, D. W. (2005). Mesh

saliency. ACM Trans. Graph., 24(3):659–666.

Mastin, A., Kepner, J., and Fisher III, J. W. (2009). Auto-

matic registration of lidar and optical images of urban

scenes. In Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on.

Moreno-Noguer, F., Lepetit, V., and Fua, P. (2008). Pose

priors for simultaneously solving alignment and cor-

respondence. In Proc. of the 10th European Confer-

ence on Computer Vision: Part II, ECCV ’08, pages

405–418.

Seitz, S. M., Curless, B., Diebel, J., Scharstein, D., and

Szeliski, R. (2006). A comparison and evaluation of

multi-view stereo reconstruction algorithms. In Proc.

of the 2006 IEEE Computer Society Conference on

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

272

Computer Vision and Pattern Recognition - Volume 1,

CVPR ’06, pages 519–528.

Shao, L., Kadir, T., and Brady, M. (2007). Geometric

and photometric invariant distinctive regions detec-

tion. Inf. Sci., 177(4):1088–1122.

Taubin, G. (1995). Estimating the tensor of curvature of a

surface from a polyhedral approximation. In Proceed-

ings of the Fifth International Conference on Com-

puter Vision, ICCV ’95.

van de Kraats, E. B., Penney, G. P., Tomazevic, D., van

Walsum, T., and Niessen, W. J. (2004). Standardized

evaluation of 2d-3d registration. In MICCAI (1), pages

574–581.

Wu, C., Clipp, B., Li, X., Frahm, J.-M., and Pollefeys, M.

(2008). 3d model matching with viewpoint-invariant

patches (VIP). In 2008 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition

(CVPR 2008).

Yang, W., Yi, D., Lei, Z., Sang, J., and Li, S. Z. (2008).

2D-3D Face Matching using CCA. In 8th IEEE Inter-

national Conference on Automatic Face and Gesture

Recognition (FG 2008), pages 1–6.

ASaliency-basedFrameworkfor2D-3DRegistration

273