EVALUATION OF STEREO MATCHING COSTS ON

CLOSE RANGE, AERIAL AND SATELLITE IMAGES

Ke Zhu

, Pablo d’Angelo

and Matthias Butenuth

Remote Sensing Technology, Technische Universit

at M

unchen, Arcisstr 21, M

unchen, Germany

The Remote Sensing Technology Institute, German Aerospace Center, Oberpfaffenhofen, Germany

Keywords:

Dense stereo matching, Cost function, Performance, Observation constrain.

Abstract:

In the last years, most dense stereo matching methods use evaluation on the Middlebury stereo vision bench-

mark datasets. Most recent stereo algorithms were designed to perform well on these close range stereo

datasets with relatively small baselines and good radiometric behaviour. In this paper, different matching costs

on the Semi-Global Matching algorithm are evaluated and compared using the common Middlebury datasets,

aerial and satellite datasets with ground truth. The experimental results show that the performance of dense

stereo methods for datasets with larger baselines and stronger radiometric changes relies on even more robust

matching costs. In addition, a novel matching cost based on mutual information and Census is introduced

showing the most robust performance on close range, aerial and satellite data.

1 INTRODUCTION

The performance of dense stereo matching methods

depends on all components, this includes prepro-

cessing, matching costs, aggregation, disparity opti-

mization and postprocessing steps. Most work on

dense stereo uses well known cost functions such

as absolute differences or Birchﬁed-Tomasi (Birch-

ﬁeld and Tomasi, 1998), as these perform well on the

Middlebury datasets (Scharstein and Szeliski, 2002;

Scharstein and Szeliski, 2011). The most intuitive

cost assumes the consistency between intensities of

two corresponding pixels. Using different matching

costs, like Absolute Differences (AD), Mutual Infor-

mation (MI) (Viola and Wells, 1997; Chrastek and

Jan, 1997) or Census (Zabih and Woodﬁll, 1994) on

the same stereo matching method can generate very

different results (Hirschm

uller and Scharstein, 2009;

Neilso and Yang, 2008).

Dense stereo algorithms are typically evaluated

with a small baseline conﬁguration, artiﬁcial and

often ambient light sources. Radiometric changes

due to vignetting, gamma changes etc. were of-

ten simulated by modifying these small baseline im-

ages (Hirschm

uller and Scharstein, 2009; Neilso and

Yang, 2008). These simulations do not capture all

effects such as non-lambertian reﬂectance. In the

evaluation of stereo matching costs using the Middle-

bury data (Hirschm

uller and Scharstein, 2009): Cen-

sus shows the best and the most robust overall perfor-

mance. Mutual information performs very well with

global methods. On radiometrically distorted Mid-

dlebury datasets, and datasets with varying illumina-

tion, Census and Mutual Information outperform AD

clearly. But we are not aware of matching cost per-

formance evaluation for images with larger baselines

and remotely sensed images.

In this study, the Semi-Global Matching (SGM)

method (Hirschm

uller, 2008) is selected as the stereo

algorithm for evaluating different matching costs be-

cause of its robustness, speed and accuracy. Four

matching costs are evaluated: a parametric match-

ing cost (AD), a non-parametric matching cost (Cen-

sus), a matching cost based on Mutual Information

(MI), and in addition, a new combined matching

cost MI-Census (MIC). In contrast to previous stud-

ies (Hirschm

uller and Scharstein, 2009; Neilso and

Yang, 2008), we do not use synthetically modiﬁed

datasets for performance evaluation, but use the stan-

dard Middlebury datasets as examples for close range

datasets, and aerial and satellite images as examples

for datasets with large baselines and stronger radio-

metric differences.

We focus on a fundamental question in our work:

given a currently outperformed stereo method, how

important is the matching cost for stereo methods on

real data? We found that the performance of match-

ing costs on the Middlebury dataset cannot be extrap-

379

Zhu K., d’Angelo P. and Butenuth M..

EVALUATION OF STEREO MATCHING COSTS ON CLOSE RANGE, AERIAL AND SATELLITE IMAGES.

DOI: 10.5220/0003764203790385

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods (PRARSHIA-2012), pages 379-385

ISBN: 978-989-8425-98-0

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

olated to images with larger baselines and stronger

radiometric changes, and it strongly depends on the

matching cost function used. A novel matching cost,

linearly merged of MI and Census (MIC), shows the

most robust performance during increasing length of

baseline on real data.

The remainder of this paper is organized as fol-

lows: Section 2 describes the evaluated match-

ing costs and the basics of SGM. The experiments

on Middlebury datasets, airborne optical image se-

quences and satellite datasets are evaluated and com-

pared in Section 3. Conclusions and the future work

are presented in the last section.

2 MATCHING COSTS AND

SEMI-GLOBAL MATCHING

2.1 Matching Costs

Generally, the matching costs in this work are deﬁned

on intensity, instead of color. Color channels are av-

eraged, if they are available. Three typical matching

costs are selected: a parametric matching cost (AD), a

non-parametric cost (Census) and a cost based on MI.

In addition, we combine linearly MI and Census with

different weights to build a new matching cost (MIC).

The simplest cost function is AD, which as-

sumes constant intensities for corresponding pixels

and, thus, lambertian reﬂectance and good radiomet-

ric calibration. In 1, I

(p) and I

(p,d) denote the in-

tensity of pixel p in the left image and the intensity

of its matched pixel at disparity d in the right image

separately:

(p,d) =

(p) − I

(p − d)

(1)

In contrast to AD, Census is a non-parametric

cost. It is invariant to monotonic gray value changes

and thus can tolerate a large class of global and lo-

cal radiometric changes. It encodes the local image

structure within a transform window and deﬁnes a bit

string where each bit describes the relative ordering

between the computing pixel and its local neighbor.

A bit is set if a pixel inside the window has a lower

intensity than the center pixel. The distance between

two bit strings is computed using the Hamming dis-

tance. In our work, a 9 × 7 window is used and sup-

ports the matching costs in the range of 0 to 63. ξ

denotes a Census transform within a window W .

computes the Hamming distance:

Census

(p,d) =

(ξ

(p), ξ

(p − d)) (2)

For easier combination with other costs, we rescale

the matching costs into a range from 0 to 1023.

MI combines individual entropies H

, H

and the

joint entropy H

l,r

of a stereo pair. This enables reg-

istering of images with complex radiometric relation-

ships (Viola and Wells, 1997). In this paper, we use

the Hierarchical MI (HMI) for an efﬁcient iterative

learning (Hirschm

uller, 2008).

MI(p, d) = H(p) + H(p − d) − H(p, p − d) (3)

The mutual information cost is also rescaled to a

range from 0 to 1023.

Due to the ﬁxed local support, the disparity im-

ages generated using Census show slightly blurred

object boundaries (Brockers, 2009). Hence, a new

matching cost in this paper is a merging of MI and

Census. This combination uses the advantages of both

costs: the intuitive consistency and the local structure,

linearly summated with different weights.

MIC

(p,d) = w

×C

(p,d)+

(1 − w

)×C

Census

(p,d).

(4)

2.2 Semi-Global Matching

The Semi-Global Matching (SGM) method approxi-

mates a global, 2D smoothness constraint by combin-

ing many 1D constraints from different aggregation

directions for pixelwise matching. The global energy

for the disparity image is deﬁned as E(D):

E(D) =

∑

(C(p,D

) +

∑

q∈N





− D



= 1



∑

q∈N





− D



> 1



(5)

The ﬁrst term sums the costs of all pixels in the im-

age with their particular disparities D

. The next two

terms penalize the discontinuities with penalty factors

and P

, which differ in small or large disparity dif-

ference within a neighbourhood q of the pixel p. This

minimization approximation is realized by aggregat-

ing S(p,d) of path wise costs into a cost volume:

S(p,d) =

∑

(p,d). (6)

(p,d) in 6 represents the cost of pixel p with

disparity d along one direction r. It is described as

following:

(p,d) = C(p,d) + min(L

(p − r,d),

(p − r,d − 1) + P

(p − r,d + 1) + P

min

(p − r,i) + P

) − min

(p − r,i).

(7)

This regularisation term function favores planar

and sloped surfaces, but still allows larger height

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

380

jumps in the direction of cost aggregation. The dispar-

ity at each pixel is selected as index of the minimum

cost from the cost cube. In addition, P

is adapted to

the local intensity gradient:

1 + |I(p) − I(p − 1)|/W

(8)

Here, W

is a parameter that controls the reduction of

the penalty.

3 EVALUATION

In this section, we test the four matching costs

with SGM on Middlebury images without radiometric

changes, on an aerial image sequence with increasing

length of the baseline and on satellite images repre-

senting different typical object classes. In addition,

we tuned the smoothness parameters of SGM for all

four costs in order to get the best performance. This

tuning allows concentrating on the performance of

matching costs rather than the stereo method. Dur-

ing the evaluation, all parameters are kept constant for

Middlebury images. For the airborne image sequence,

an 1.7 meter resolution LIDAR 3D point cloud is used

as the ground truth. A 3D point cloud acquired by the

Institut Cartogr

aﬁc de Catalunya (ICC) with airborne

laser scanning is used as reference data for the satel-

lite data. The density of the point cloud is approxi-

mately 0.5 points per square meter. The data is part

of the ISPRS matching benchmark (Reinartz et al.,

2010).

3.1 Results on the Middlebury Stereo

Benchmark Datasets

We apply the evaluation using on the Middlebury

data. After the parameter tuning, our experimen-

tal results on the Middlebury data shown in Fig-

ure 1 are similar like the original implementa-

tions (Hirschm

uller, 2008; Scharstein and Szeliski,

2011). Our post-processing steps are not tuned to the

Middlebury datasets and thus our results are slightly

worse results then the ofﬁcial SGM entry. The best

parameter combinations are P

= 16 and P

= 29 for

AD, P

= 432 and P

= 480 for Census and P

= 750

and P

= 1450 with a P

adaptive factor W

= 45 for

MI. Figure 1 shows the computed disparity images

and the ground truth. The disparity images generated

with AD have visually clean edges.

The speciﬁc error analysis is illustrated in Figure 1

bottom. The performance of MI is generally the best

for all four datasets with an average percentage of

bad pixels e=9.65 bei Error Threshold et = 1. Unlike

Figure 1: Results on the Middlebury Datasets for SGM with

varying matching costs. Top: Depth images obtained after

parameter tuning. Bottom: Bad pixel percentages for non

occluded pixels, all pixels and pixels near discontinuities.

in the previous study (Hirschm

uller and Scharstein,

2009), the results using Census (e=12.67) are wore

then using AD (e=11.05), possibly due our basic post-

processing. We observed that the boundaries of AD

are cleaner compared to Census, because of the ﬁxed

window size of Census. The MI reaches similar

but more accurate results as AD. On the Middlebury

datasets, MIC performs similar as MI with e = 9.76.

The bast parameters for MIC are P

= 750, P

= 1450

and W

= 50. The cost of MI has a weight of 0.9.

This combination is better as AD, worse as MI in our

study on the Middlebury data. But it outperforms the

other matching costs using aerial and satellite images

shown in the next subsection.

3.2 Results on Aerial Image Sequence

A continually recorded airborne optical image se-

quence is used to follow the impacts of matching costs

on the performance during a changing baseline and

stereo angles. The images are provided by the 3K

camera system, consisting of 3 Canon EOS 1D Mark

II cameras with a 50 mm lens (Kurz et al., 2007).

Only the nadir views are used in this evaluation. The

ﬂight altitude is approximately 1500 meters above

ground. The distance between each recorded obser-

EVALUATION OF STEREO MATCHING COSTS ON CLOSE RANGE, AERIAL AND SATELLITE IMAGES

381

vation is about 35 meters. The largest baseline we

present in this work is about 250 meters.

Figure 5 visualizes the impact of larger baselines

on the matching performance clearly: although the re-

sults computed on images with small baselines per-

form well using all four matching costs, the errors of

AD rise quickly with increasing baseline and stereo

angle, possibly due to the non-lambertian reﬂectance.

Hierarchical MI (HMI) does perform slightly better

than AD for small baselines, and the disparities in the

street are smoother and show less noise, c.f. Figure 3.

For small baselines, Census does produce a noise dis-

parity map, but shows the most robust behaviour for

larger baselines, as visible on the church roof in Fig-

ure 5.

In addition, we observed that the results using MI

have sharper edges at discontinuities. In contrast, the

ﬁxed local support of Census causes slightly blurred

edges. In Figure 2, the comparison on details between

disparity images using different matching costs are

shown. The smoothness on surfaces of the results us-

ing MI is generally better compared to AD and Cen-

sus. A visual analysis is illustrated in Figure 3.

Because of the advantages of MI for sharp edges

and smoothness on surfaces, and the robustness of

Census, we combine MI with Census to generate

matching the cost MIC. It performs similar to MI

at object discontinuities, but keeps the robustness of

Census for larger baselines.

3.3 Results on Satellite Data

We evaluated the matching costs on a Worldview-1

stereo image pair with a ground sampling distance

of 50 cm, and a relatively large stereo angle of 35

◦

The data is part of the ISPRS matching benchmark

(Reinartz et al., 2010). A small cutout of the stereo

data and the reconstruction results for an urban area

(Terrassa in Spain, Barcelona) are shown in Fig. 4.

The full dataset covers mountainous, agricultural, for-

est, industrial and residential areas. The ﬁgure indi-

cates that these images cannot be matched success-

fully using MI, while Census and the MIC perform

reasonably well on this challenging dataset. The large

black background in the MI image was incorrectly

ﬁlled using this data. Table 1 shows the results of

evaluating the city area shown in Fig. 4 and two other

test areas (hilly forest and industrial area) against the

LIDAR reference data. It is clearly visible that MIC

performs slightly better than Census and that MI does

produce the largest errors. Experiments with various

values for P

, P

and W

indicated that performance

depends mostly on the cost function and not on the

exact parametrisation of the stereo algorithm.

Figure 2: Comparison on details of disparity images using

different matching costs: MI and MIC show better object

boundaries and less noise. Different colour codiﬁcations

denote different disparities of corresponded pixels.

Figure 3: Comparison of smoothness for AD, MI, Census

and MIC results on a ﬂat street. The diagram shows the

height values along the proﬁle indicated by the yellow line.

Table 1: Evaluation of Matching results in three test areas

against ground truth LIDAR Data. NMAD is the normal-

ized median deviation and BP (Bad Pixels) is the percentage

of pixels with an absolute height error > 2 m.

Cost P

NMAD BP(%)

MIC 700 1400 200 0.3 0.72 15.8

Census 600 1300 200 - 0.74 16.8

MI 700 1400 200 - 1.10 25.8

4 CONCLUSIONS AND FUTURE

WORK

In this work, three typical matching costs (AD,

MI and Census) and a novel matching cost (MIC)

are evaluated using SGM on the Middlebury stereo

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

382

Figure 4: Small cutout of the Worldview-1 Stereo pair. First row: stereo pair and LIDAR reference data. Second row: Results

after stereo matching with different cost functions, orthographic reprojection and discontinuity preserving interpolation.

EVALUATION OF STEREO MATCHING COSTS ON CLOSE RANGE, AERIAL AND SATELLITE IMAGES

383

Figure 5: Disparity maps for stereo pairs with increasing baseline. The images 1 to 7 are matched with the centre image C.

The results for AD, MI, Census and MIC are shown in columns 2-5. The black areas indicate failures of the left-right check.

The ground truth DSM (Digital surface model) from laser scanning is shown next to the centre image.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

384

benchmark, aerial and satellite images with ground

truth. In summary we found that the performance of

dense stereo methods for datasets with larger base-

lines and stronger radiometric changes strongly re-

lies on robust matching costs. The experimental re-

sults on different data sets show, AD works well for

scenes with a small baseline and a good radiometric

calibration. MI performs slightly better than Cen-

sus in case of low radiometric changes and stereo

pairs with small baselines, but fails in areas with local

radiometric changes. MI keeps sharp edges at dis-

continuities and results in less noisy disparity maps

when compared to Census. Census performs well

for larger baselines, but results in slightly blurred ob-

ject boundaries. For large baseline satellite stereo im-

agery Census performs signiﬁcantly better than MI.

The weighted sum of MI and Census (MIC) uniﬁes

the advantages of MI and Census and outperforms

AD, MI and Census on remote sensing datasets.

Topics for future work include developing a

methodology for evaluation of remotely sensed im-

ages against LIDAR ground truth. The main chal-

lenges for this task are changes due to multi-temporal

data acquisition and different resolution and be-

haviour of the sensors. The inﬂuence of different

stereo algorithms on the performance of the matching

cost functions will be evaluated in further work.

ACKNOWLEDGEMENTS

Special thanks are given to the data providers for the

provision of the stereo data namely: Digital Globe for

the Worldview-1 data and ICC Catalunya for the ref-

erence data.

REFERENCES

Birchﬁeld, S. and Tomasi, C. (1998). A pixel dissimilarity

measure that is insensitive to image sampling. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 20(4):401–406.

Brockers, R. (2009). Cooperative stereo matching with

color-based adaptive local support. Computer Anal-

ysis of Images and Patterns.

Chrastek, R. and Jan, J. (1997). Mutual information as a

matching criterion for stereo pairs of images. Analysis

of Biomedical Signals and Images, 14:101–103.

Hirschm

uller, H. (2008). Stereo processing by semi-global

matching and mutual information. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

30(2):328–341.

Hirschm

uller, H. and Scharstein, D. (2009). Evaluation of

stereo matching costs on image with radiometric dif-

ferences. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 31(9):1582–1599.

Kurz, F., M

uller, R., Stephani, M., Reinartz, P., and

Schroeder, M. (2007). Calibration of a wide-angle

digital camera system for near real time scenarios.

ISPRS Workshop High Resolution Earth Imaging for

Geospatial Information.

Neilso, D. and Yang, Y. (2008). Evaluation of constructable

match cost measures for stereo correspondence using

cluster ranking. IEEE Conference on Computer Vi-

sion and Pattern Recognition.

Reinartz, P., d’Angelo, P., Krauß, T., Poli, D., Jacobsen, K.,

and Buyuksalih, G. (2010). Benchmarking and qual-

ity analysis of dem generated from high and very high

resolution optical stereo satellite data. ISPRS Sympo-

sium Commission I.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. International Journal of Computer Vision,

47(1):7–42.

Scharstein, D. and Szeliski, R. (2011). Middlebury stereo

vision research page. http://vision.middlebury.edu/

stereo/.

Viola, P. and Wells, W. M. (1997). Alignment by maximiza-

tion of mutual information. International Journal of

Computer Vision, 24(2):137–154.

Zabih, R. and Woodﬁll, J. (1994). Non-parametric local

transforms for computing visual correspondancen. In

Proc. European Conference of Computer Vision.

EVALUATION OF STEREO MATCHING COSTS ON CLOSE RANGE, AERIAL AND SATELLITE IMAGES

385