Depth-Scale Method in 3D Registration of RGB-D Sensor Outputs

İsmail Bozkurt and Kemal Egemen Özden

Faculty of Engineering, Bahçeşehir University, İstanbul, Turkey

Keywords: 3D Registration, Kinect, Local Image Features and Matching.

Abstract: Automatic registration of 3D scans with RGB data is studied in this paper. In contrast to bulk of research in

the field which deploy 3D geometry consistency, local RGB image feature matches are used to solve the

unknown 3D rigid transformation. The key novelty in this work is the introduction of a new simple measure,

we call “Depthscale measure”, which logically represents the size of the local image features in 3D world,

thanks to the availability of the depth data from the sensor. Depending on the operating characteristics of the

target application, we show this measure can be useful and efficient in eliminating outliers through

experimental results. Also system level details are given to help scientists who want to build a similar

system.

1 INTRODUCTION

Registering 3D scans of the same rigid environment

from different vantage points is an old computer

vision problem which is still an active research field.

Various applications of such techniques include 3D

modeling, 3D model retrieval and robot navigation.

The research field got recently a new momentum

thanks to availability of cheap RGB-D sensors (e.g.

Microsoft Kinect). In this paper, we study a simple

3D registration system using data coming from

Kinect-like sensors.

Quite some active research in the field is

dedicated to registering and associating 3D scans

based on their global/partial 3D geometric

consistency. As a complementary approach, a

system using RGB local features has its own

advantages: (i) Environments which do not have

enough 3D geometric variation but color variation

can easily be registered. (ii) Partial matching of the

3D scans are naturally handled. (iii) Detection and

matching of local image features is a well-

understood problem and many opensource/free

feature detection libraries exist which makes

building such a system very easy. In an nutshell, our

system detect local image features in 3D scans,

match them and registers the 3D models into a

cumulative model by using 3-point 3D similarity

transformation estimation and RANSAC sampling

The key novelty in our approach is the use of a

new measure that we coined as “Depthscale”

measure which is easy to compute and gives extra

information about match-ability of the image

features. It is simply the multiplication of the

detected scale of image feature and the associated

depth. Assuming the internal camera parameters of

the sensor did not change during the scans, this

measure logically represents the size of the local

image feature in 3D space. Being a simple integer or

double value, it can be used efficiently to ignore

false matches, in contrast to (and additional to)

computing similarity measure of two full feature

description vector (e.g. 128 byte vectors in SIFT).

The operation characteristics of the target

application impose different constraints on the

system. For example a robotic navigation system

with limited computational power and real-time

mapping and localization requirements will desire

small number of matches with high inlier ratio for

minimal number of RANSAC iterations. Whereas

3D modelling applications running on workstations,

which would like to have good accuracy, may prefer

high number of matches even though inlier ratio is

low in order to get as many inliers as possible. We

will show the usefulness of Depthscale measure for

obtaining proper operation mode.

During experimental evaluation, we analyzed the

effect of Depthscale measure with respect to Lowe’s

second closest neighbor method. We also checked

results which combine two methods. As a novel

470

Bozkurt I. and Özden E..

Depth-Scale Method in 3D Registration of RGB-D Sensor Outputs.

DOI: 10.5220/0004673204700475

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 470-475

ISBN: 978-989-758-003-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

approach, we cast the problem of outlier elimination

of a match-set as a binary classification problem.

Hence we could show the results as ROC (Receiver

Operating Characteristics) curves.

Even though the system is simple to construct,

considering the popularity of the Kinect platform,

implementation details to be presented will still be

useful for future developers of a similar system.

The paper will proceed as follows. First we will

give a brief mention of the related literature. Second,

we will summarize main concepts and methods we

deployed. Third, we will give the system details and

various ways of exploiting our Depthscale method.

Afterwards we will share our practical observations

during building such a system, especially relevant to

Kinect environment. Following the experiments

section, the paper is concluded.

2 LITERATURE

Matching and registering 3D models is an old

computer vision problem. A common approach

begins with creating a rough alignment, typically

using PCA or manually, than applies a variation of

Iterative Closest Point (ICP) (Besl and McKay,

1992) algorithm. A recent overall pipeline is

introduced by Microsoft for Kinect systems

(Shahram et. al., 2011). There are known problems

with such approaches. First of all, partial overlap

causes PCA based alignment problematic whereas

ICP requires good initial rough alignment. Also the

standard form of ICP is not immune to errors in the

geometry, though robust extensions to ICP exists

(Fitzgibbon, 2001). To overcome this problem,

people applied various 3D depth based local features

inspired by their RGB based sisters (Bronstein et.

al., 2010). However, all those approaches suffer in

case of degenerate surface geometry. For example in

a planar scene, all of the above approaches will fail.

A known method to stabilize degenerate

geometric configurations is to introduce RGB

information during registration, e.g. (Craciun et. al.,

2010). We build our system on local intensity

features which will introduce robustness against lack

of geometric variation and overlap while speeding

up the registration. Such local features have been

widely studied (Tuytelaars and Mikolajczyk, 2010)

which was initially popularized by (Lowe, 2004).

Work by (Wu et.al., 2008) uses depth information to

estimate 3D local image features which can be used

for 3D registration, however has the requirement of

rendering 3D model in different directions. Our

Depthscale method can be used in conjunction with

any available feature detection utility as long as they

give a invariant support area for the feature.

3 SYSTEM

The system follows the standard envelop which is

typically used in 2D image matching and mosaicing.

The following procedure is looped as many times till

all the 3D scans are registered to a global model.

Step 1: Detect local features in the RGB images

of two 3D scans.

Step 2: For each feature in first image, find knn

neighbours in the second image.

Step 3: Use Depthscale and/or Lowe’s second

nearest neighbour technique to decrease the false

matches.

Step 4: Use RANSAC and 3-point 3D

registration algorithm to robustly estimate 3D rigid

transformation.

Step 5: Apply the estimated transformation to the

second scan and merge it with the previous overall

3D model collected so far.

The system is bootstrapped with two 3D scans

and new 3D scans are added incrementally to the

current reconstruction. Currently it is assumed that

the 3D scans to be registered are ordered in a way

that consecutive shots overlap. For local features

SURF (Bay et. al., 2008) detector and descriptor

package and for knn search FLANN (Muja and

Lowe, 2009) library of (opencv, 2013) library is

used. Below describes the other sub-components

while leaving Depthscale method to the last since it

is the main novelty of the work.

3.1 RANSAC

RANSAC (Fischler and Bolles, 1981) is a classical

robust estimation technique which eliminates

outliers and keeps geometrically consistent data in

an over constrained setting. By sampling minimal

number of data elements to describe the target

parametric model, it reaches a stable solution which

gives the highest number of inliers. The classical

analysis states that the required number of iterations

to guarantee a good solution with a certain

probability depends on the inlier ratio of the data set.

However, knowing that data is noisy itself, for good

estimation we would like to have as many data

points as possible. Hence we prefer to keep both the

inlier ratio and number of inliers high in a system.

Depth-ScaleMethodin3DRegistrationofRGB-DSensorOutputs

471

3.2 3D Rigid Transformation

Estimation

The main parametric model we want to fit to our

data is 3D rigid transform which can minimally be

estimated with 3 points in 3D space. Since we know

the 3D location (with respect to local frame) of local

features that are detected in the images, we can

directly feed them into our transformation estimation

method. We used SVD-based method (Eggert, 1997)

to estimate this rigid motion. It is a flexible method

that can both estimate the minimal 3-point

configuration required by RANSAC and n-point

case.

3.3 Lowe’s Second Nearest Neighbour

Method

As described by (Lowe, 2004) due to repeating

patterns and various geometric deformations

different features may look like each other hence the

closest match is not always the right match. In order

to decrease the number of such outliers, after

computing knn (k=2) neighbours, he proposed to

check the similarity of second closest neighbour to

the first. It is expected that the second closest

neighbour must be far for correct matches. This

measure of “closeness” is actualized by a threshold

between 0 and 1. Thresholds which are slightly less

than 1 means that the system almost always take the

closest feature as a match, practically making the

measure ineffective. On the other hand thresholds

towards 0 result in a paranoid measure which almost

finds no match. From now on we will call this

method shortly Lowe’s measure.

3.4 Depthscale Method

This is the main contribution of the paper. Similar to

Lowe’s measure we propose a new method to

decrease the false matches in a match-set which is

simple to compute. Assuming the 3D scans are taken

from the same sensor, we can ignore the effects of

RGB cameras internal calibration matrix and state

that the following value:

ScaleDepthDS *

(1)

represents the scale of the feature’s support area in

3D world where feature’s depth info comes from

the 3D sensor and scale is the size of the support

area of the feature on the intensity image. The

equation directly results from inverse application of

perspective projection. The features that belong to

same 3D point can be considered to have same DS.

This measure can be applicable for many feature

detectors which returns a invariant support area. In

contrast to long feature descriptors (e.g. SIFT is a

128 byte vector), Depthscale can be represented with

a simple integer which makes it more efficient to

compare. Also since scale information it

encapsulates is not related to intensity based feature

descriptor, it gives extra information that can be

exploited during matching. However due to its

simplicity it would not be as descriptive as intensity

features.

Figure 1: Typical NDSD distribution for a test set.

4 EXPLOITING DEPTHSCALE

INFORMATION

Different mechanisms of deploying Depthscale

information for the purpose of more accurate feature

matching can be thought of. In the first following

subsection we will describe our basic approach to

compare DS values given a feature match. In the

second sub-section we will describe different ways

to combine it with intensity based descriptors.

4.1 Comparing Depthscale Values

Noise needs to be taken into account in any robust

vision system. Looking at Eq. 1, any error in number

of pixels in a fixed image scale will be multiplied by

the depth of the feature. Hence we must take into

account the magnitude of the Depthscale while

comparing differences. Considering high depths and

big scales will give higher errors due to

multiplication, we propose normalizing the DS

difference with average DS of the compared values

in a feature match would give stable results. Below

formulation describes the normalized Depthscale

distance (NDSD) between two features of a single

match:

DSDS

NDSD







(2)

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

472

Equation 2 shows NDSD distribution of the inliers

taken from an experiment (without taking the

absolute value). Different test sets result in similar

histograms. As can be seen NDSD=0.15-0.2 seems

to be a good threshold to eliminate outliers.

4.2 Combining Depthscale and Lowe’s

Measure

In its simplest form Depthscale method can be a

simple efficient way to increase the quality of the

match-set. However it would be desirable to

combine Lowe’s measure and Depthscale method to

increase matching performance. Here we will focus

on simplest way of fusion, leaving the more

advanced approaches to another work. Since there

are already tools to find knn set for intensity

features, we introduce the Depthscale measure to

this pipeline. After finding the knn neighbourhood in

intensity feature space, we apply Depthscale

elimination for closest neighbours with a certain

threshold. After that Lowe’s measure is applied. In a

sense, we filter the match-set first with Depthscale

method, than Lowe’s measure, resulting in an

“AND” operation.

5 PRACTICAL OBSERVATIONS

Finding 3D locations of local feature points relative

to sensor coordinate system is fundamental in our

approach. However Kinect’s depth sensor and RGB

sensor are separated which requires a sort of

alignment between them. Microsoft Kinect SDK

provides a function that aligns the depth frame on

the rgb frame. However as empirically observed,

that function does not provide a robust alignment

(especially depth values which are less than 1

meter). Therefore we decided to map the all the x, y

and depth points on the depth frame through pre-

calculated calibration matrices. The transformation

requires applying inverse internal calibration matrix

of the depth camera, 3D rigid transformation

between two sensor frames, and internal calibration

matrix of the RGB sensor.

Kinect’s depth sensor’s noise character is worth

mentioning. It is basically a structured light

technique where the projector and the sensor is

separated with a fixed baseline. Hence one factor

affecting the noise is the distance of the target: as the

observed location is further from the sensor, depth

measurements have higher noise. Also Kinect

applies an interpolation technique for the points that

lie within the dark regions of the structured light

image. In combined with above, different geometric

and reflectance characteristics of surface may result

in very spiky errors. A 2 cm of RANSAC threshold

seems to suffice to deal with such outliers

6 EXPERIMENTS

As a novel approach we cast the problem of

removing outliers from a match-set as “binary

classification” problem, taken from machine

learning field. Indeed what Lowe’s measure does is,

given a candidate match, checking the second

nearest neighbour to mark it as inlier and outlier. In

order to analyze that way, ground truth information

is needed. Typically this is done manually in

machine learning problems. However since we can

have hundreds of matches given a pair of scans, we

approximated this procedure by using all the

candidate matches in an excessive RANSAC loop

and detected inliers/outliers. This procedure might

include one or two false positives or true negatives

in the resultant ground truth but such amounts would

have minimal effect.

We showed the effects of different thresholds for

Lowe’s measure, pure Depthscale method and

combined approach as a Receiver Operating

Characteristics (ROC) curve. ROC curve is classical

mechanism to show the performance (true positive

vs false positive) of a binary classifier for different

threshold parameters. In our case inliers and outliers

are labeled as positive and negative respectively.

However we must note that there is no training

happening here. The thresholds determine the

classifier directly.

We tried to take experiment data in environments

with different characteristics. Fig. 2 shows images

from 4 different test sets. Each test set has 5 (but

only 2 are used in the graphs) scans and they are

taken with a Kinect device from different angles and

distances. The top row specifically aims to represent

planar scenes, whereas the third row contains strong

non-planar human body.

The ROC curves in Fig. 3-6 show that pure

Depthscale measure gives better true positive ratio

for high false positive ratios. Which means that if we

would like to easily eliminate many outliers while

keeping almost all inliers, pure Depthscale would

give better results. For example a workstation based

modelling application would choose that operating

mode due to requirement of many inliers for better

accuracy. However for lower false positive ratios

Lowe’ approach gives higher positive ratios. This is

Depth-ScaleMethodin3DRegistrationofRGB-DSensorOutputs

473

Posters on a wall Boxes Stack Sitting Person Living Room

Figure 2: Examined RGB Frames from 4 different data

sets.

more suitable for applications which cannot tolerate

high outliers in the data in order not to waste time in

many RANSAC iterations. A robotics localization

routine may opt for this operating mode. The

combined approach converges to the best individual

approach for different ends of the ROC curve,

sometimes even beating them. However for left ends

of the curve, it gives sporadic results for certain

thresholds and converges to the worse approach

occasionally. A deeper look shows that inferior

results are caused by unrealistically tight thresholds

for Depthscale such as 0.05. As a note, ROC figures

are created by sampling various thresholds (for

Lowe’s measure between 0,65 and 1,0 and for

Depthscale 0,05 and 0,6). Fig. 7 shows different

views of registration results for “Sitting Person”

experiment using all 5 frames. Note that no ICP like

refinement or surface reconstruction techniques are

utilized here.

7 CONCLUSIONS

We introduced a simple local image feature based

Figure 3: ROC curve of “Posters on a wall” experiment.

Figure 4: ROC curve of “Box Stack” experiment.

Figure 5: ROC curve of “Sitting Person” experiment.

3D registration system for Kinect-like sensor. It is

designed to be built on available open source

systems. We also introduced a new measure called

Depthscale measure to increase the matching

performance by exploiting the fact that depth

measurements are available for the detected features.

Experiments are presented to show the usefulness of

this new measure. Eliminating outliers from a match

set is cast as a classification problem and hence

analysis is done through familiar ROC curves.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

474

Figure 6: ROC curve of “Living Room” experiment.

Figure 7: The rendering of overall registered 3D model for

Sitting Person experiment from different angles.

REFERENCES

Besl, P. J., McKay, N. D., 1992. A Method for

Registration of 3-D Shapes. In IEEE PAMI 14, 2,

pp.239-256.

Shahram et al., 2011. KinectFusion: Real-time 3D

Reconstruction and Interaction Using a Moving Depth

Camera. In SIGGRAPH.

Fitzgibbon, A. W., 2001. Robust Registration of 2D and

3D Point Sets. In BMVC.

Bronstein et al., 2010. SHREC 2010: Robust Feature

Detection and Description Benchmark. In

Eurographics Workshop on 3D Object Retrieval.

Craciun, D., Paparoditis, N., Schmitt, F., 2010. Multi-view

Scans Alignment for 3D spherical Mosaicing in Large-

scale Unstructured Environments. In CVIU 114 pp.

1248–1263.

Tuytelaars, T., Mikolajczyk, K., 2010. Local Invariant

Feature Detectors: A Survey. In Foundations and

Trends® in Computer Graphics and Vision 3(3) pp.

177-280.

Lowe, D., G., 2004. Distinctive Image Features from

Scale-invariant Keypoints. In IJCV, 60( 2), pp. 91-

110.

Wu, C., Clipp, B., Li, X., Frahm, J-M, Pollefeys, M.,

2008. 3D Model Matching with Viewpoint Invariant

Patches (VIP). In CVPR.

Bay, H., Ess, A., Tuytelaars, T., Gool L.V., 2008.

Speeded-Up Robust Features (SURF). In CVIU 110,

pp.346–359.

Muja, M., Lowe, D. G., 2009. Fast Approximate Nearest

Neighbors With Automatic Algorithm Configuration.

In VISAPP.

Opencv, 2013. http://opencv.willowgarage.com.

Fischler M. A., Bolles, R. C., 1981. Random Sample

Consensus: A Paradigm for Model Fitting with

Applications to Image Analysis and Automated

Cartography. In Commun. ACM, 4(6) pp.381–395.

Eggert, D., W., Lorusso, A., Fisher, R. B., 1997.

Estimating 3-D rigid body transformations: a

comparison of four major algorithms. In Machine

Vision and Applications 9: 272–290.

Depth-ScaleMethodin3DRegistrationofRGB-DSensorOutputs

475