3D MODELING OF STREET BUILDINGS FROM PANORAMIC

VIDEO SEQUENCES AND GOOGLE MAP IMAGE

Fay Huang

, Yi-Ju Wu

, Jing-Siang Hsu

and Augustine Tsai

CSIE, National Ilan University, 26047 Yi-Lan, Taiwan

Emerging Smart Technology Institute, Institute for Information Industry, 105 Taipei, Taiwan

Keywords:

Panoramic Images, Image-based Modeling.

Abstract:

A semi-automatic image-based framework for modeling of street buildings is proposed in this paper. Two

types of image sources are used, one is a sequence of ground-level spherical panoramic images captured by

panoramic video recorder, and the other is an aerial image of the desired area obtained from Google Map.

The advantages of our approach are ﬁrst that the camera trajectory recovery result is more accurate and stable

due to that the spherical panoramic images are used if compared to multiview planar images. Second, since

each face texture of a building is extracted from a single panoramic image, there is no need to deal with color

blending problem while textures overlapped.

1 INTRODUCTION

Three-dimensional (3D) models of street scenes are

needed in many applications such as virtual ﬂy/drive-

through, augmented reality, urban planning, and for

documentation purpose. Inventing a fully or semi-

automatic method for a fast building model recon-

struction has lately become a vivid research topic

in many ﬁelds such as computer graphics and vi-

sion. Due to the recent explosion of digital photogra-

phy, various image-based modeling approaches have

drawn a great deal of attention from many researches.

A detailed building reconstruction is not needed in

some applications such as virtual touring or path guid-

ing. One speciﬁc example would be the GPS-based

car navigation system, which mainly uses aerial sim-

pliﬁed street map incorporated with speech to guide

drivers to their destinations. Some advanced navi-

gation systems also support simple 3D models with-

out textures to illustrate the situations when multiple

roads vertically overlap. However, in many practi-

cal situations, drivers might still feel that it is quite

difﬁcult to link the aerial 2D map or textureless road

models with the visual impression of the environment.

Thus, supplying realistic street views of the route can

be very useful, and this could be achieved by a set of

simple texture-mapped 3D building models.

Aerial images and ground-level images are two

major types of image sources used by many exist-

ing image-based urban 3D modeling approaches. City

Figure 1: Examples of the reconstructed building models,

which are textured with building elevation images automat-

ically extracted from the panoramic images to provide the

realistic impression.

models can be constructed solely from aerial images

if the heights of the buildings can be obtained from

airborne laser scanners or calculated from stereo im-

age views (Gruen, 1997; Haala and Brenner, 1998;

Maas, 2001; Vestri and Devernay, 2001). How-

ever, the resultant city model usually lacks a realis-

tic impression at ground level since the aerial image

can only provide very limited texture information for

buildings’ side views. Hence, there have been a num-

ber of approaches to automated texture mapping of

3D models using the available ground-level images

(Fruh and Zakhor, 2004; Hu et al., 2006; Liu et al.,

2006; Stamos and Allen, 2002). The automated pose

recovery algorithm for multiview images was consid-

ered time-consuming and the textures generated from

different views usually causes a visible seam due to

109

Huang F., Wu Y., Hsu J. and Tsai A..

3D MODELING OF STREET BUILDINGS FROM PANORAMIC VIDEO SEQUENCES AND GOOGLE MAP IMAGE.

DOI: 10.5220/0003811301090114

In Proceedings of the International Conference on Computer Graphics Theory and Applications (GRAPP-2012), pages 109-114

ISBN: 978-989-8565-02-0

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 2: Left: Point Grey Ladybug3 panoramic camera.

Right: an example of the captured spherical panoramic im-

age.

lighting condition and image resolution variations.

Wang et al. (Wang et al., 2007) proposed to use cylin-

drical panoramic images for texture mapping propose.

In their method, a rough location of each panoramic

image was assumed provided and the registration be-

tween the panoramic image and aerial image was

done through a voting process. Since only sparse

panoramic images were used, some building might be

lack of texture or there exist some visible seams while

textures overlapped.

Panoramic images have become widely accessible

due to the rapid development on hardware and cam-

era technologies, and they have many advantages in

supporting the 3D reconstruction tasks due to their

wide ﬁeld-of-view. We aim to develop a framework

which takes a set of dense spherical panoramic im-

ages and an orthogonal aerial image of that area, and

is able to output the texture maps and height infor-

mation of the selected buildings through a fully au-

tomatic process. The reconstructed building models

are textured with building elevation images extracted

from the panoramic images to provide the realistic im-

pression as shown in Fig. 1. The proposed approach

would be useful for applications that require large-

scaled and simple yet realistic 3D street/city models.

2 RECONSTRUCTION

FRAMEWORK

The framework takes two types of image sources as

input. One is a dense set of ground-leveled panoramic

images and the other is an nearly-orthogonal aerial

image of that area. The 360×360-degree panoramic

images were captured by Point Grey Ladybug3

mounted on the top of a car, which captured ﬁve “reg-

ular” planner images looking horizontally outwards

and one looking upward, each with frame rate of 15

images per second. Six images are then stitched to-

gether using the multi-perspective plane sweep ap-

proach of (Kang et al., 2004). This allows to pro-

duce a set of spherical panoramic images of resolu-

Feature Detection and Matching

Aerial and Panoramic Image Registration

External Camera Parameter Determination

Camera Trajectory Recovery

Building Rooftop Edge Detection and Matching

Building Height Estimation and Texture Extraction

Building Texture Rectification

Panoramic Images

Aerial Image

3D Modeling and Texture Mapping

Figure 3: Framework of the proposed approach.

tion 2048×1024. The camera and an example of the

captured panoramic image are shown in Fig. 2. The

aerial image of the desired area can be obtained from

Google Map by stitching together different image por-

tions, each shows the largest zoom in view, of that

area.

Some reasonable assumptions about the input data

of our approach is described as follows: The projec-

tion model of each source panoramic image can be

modeled by the spherical projection with respect to

a point (i.e., a projection center) representing the lo-

cation of the camera in 3D space where the image

was captured. An imaging coordinate system is de-

ﬁned for each image originated at its corresponding

projection center. Moreover, the panoramic images

were acquired along a smooth and continuous path.

The GPS position information associated with some

panoramic image acquisition locations is given as the

car was equipped with the GPS device. The informa-

tion is used for initialization and bundle adjustment

purposes. Furthermore, the buildings’ footprints on

the orthogonal aerial image are given (i.e., by pre-

processing the aerial map same as method proposed

in (Wang et al., 2007)) or manually speciﬁed. The

outlines of building footprints are used to construct

the 3D models of the buildings and also help identify-

ing the buildings’ front view regions in the panoramic

images.

The framework of our approach is depicted in

Fig. 3. First, feature detection algorithm is applied to

each of the source panoramic images, and then feature

point matching search is performed between each pair

of successive images. The matching results enable us

to recover the essential matrix describing the spatial

GRAPP 2012 - International Conference on Computer Graphics Theory and Applications

110

relationship between two imaging coordinate sys-

tems. The relative orientation, represented by a rota-

tion matrix, and position, represented by a unit vector,

of two successive panoramic images can be derived

from the essential matrix. Camera trajectory can be

recovered through point cloud reconstruction of the

scene and bundle adjustment based on the available

GPS information. The camera path can be reﬁned

through the process of registering it to the aerial im-

age. After each panoramic image’s position with re-

spect to the aerial image’s coordinate system is estab-

lished, the building rooftop edges on the panoramic

images can be identiﬁed through a matching process

supported by the information of building footprint

outlines on the aerial image. The height of build-

ing can be calculated based on the detected building

rooftop edge and the location of the panoramic image.

Finally, the building texture can be extracted form the

panoramic image, and it must go through warping and

rectiﬁcation processes before can be used for texture

mapping. The reminding sections report techniques

used in the framework followed by some 3D model-

ing results and conclusions.

3 CAMERA TRAJECTORY

RECOVERY

The ﬁrst half of the framework deals with the camera

trajectory recovery task, which can be considered as

a preparation step so that the registration of two types

of input images can take place. Currently, registering

the recovered camera trajectory to the aerial image is

done manually by specifying two end point locations

of the path on the map. Consequently, an image’s co-

ordinates representing a position on the aerial image

associated with each source panoramic image could

be derived.

In order to recover the relative image capturing

positions and orientations of the source panoramas,

we ﬁrst estimated the spatial relationship between

each adjacent pair of panoramic images, and then in-

tegrated those pairwise results. The spatial relation-

ship in terms of a rotation matrix and a translation

vector, referred to as the external camera parameters,

can be derived from the essential matrix describing

the epipolar constraint between the image correspon-

dences in two panoramas.

The image point correspondences can be estab-

lished by Scale-invariant Feature Transform (SIFT)

detection plus SIFT-based matching. A single thresh-

old D

SIFT

was used to determine if a match was ac-

ceptable in the SIFT-based matching algorithm. The

smaller the value, the more image correspondences

Figure 4: An intermediate result of the reconstructed scene

point cloud for determining the external camera parameters.

Blue dots represent the calculated scene points and red cir-

cles indicate the camera positions.

were identiﬁed, and the higher possibility that the re-

sult would include false matches. The eight-point al-

gorithm was employed to estimate the essential ma-

trix. A two-pass approach was proposed to obtain the

ﬁnal essential matrix. First, an initial essential ma-

trix was derived according to a smaller set of image

corresponding points, which was the matching result

associated to a relatively large threshold value D

SIFT

Those sparse corresponding points were believed to

be more accurate but less descriptive. Next, a smaller

threshold value was assigned to obtain a larger set of

point matches. The initial essential matrix was then

used to serve as a constraint to ﬁlter out the incorrect

matches. In other words, the matching outliers were

ﬁltered by epipolar constraint. The remaining point

matches were then used to compute the ﬁnal essential

matrix.

The derived essential matrix was used to solve for

the external camera parameters R and T, which stand

for the rotation matrix and the translation vector, re-

spectively. Pairwise external camera parameters were

ﬁrst determined and then integrated one by one to

obtain the global camera motion and thus the cam-

era’s moving trajectory. During the integration pro-

cess, the scene points based on the already processed

panoramic image were reconstructed with respect to

the 3D world coordinate system, which are used as

the references to estimate the next camera location.

An example illustrating an intermediate result of the

reconstructed scene point cloud is given in Fig. 4. The

major drawback of such method is that the camera pa-

rameter estimation error would propagate through the

integration process. One way to correct such drift is

by a path closing strategy, which however does not al-

ways work well. Moreover, identifying two or more

panoramic images captured at the same street inter-

section but at different locations and times is a very

3D MODELING OF STREET BUILDINGS FROM PANORAMIC VIDEO SEQUENCES AND GOOGLE MAP IMAGE

111

difﬁcult problem. In order to deliver an efﬁcient and

relatively more accurate solution to this problem, we

have chosen to incorporate GPS information. Since

the accuracy of GPS system varies from 1 to 5 meters,

it is sensible to correct the trajectory drift every 50

meters based on the GPS reading. The ﬁnal camera’s

moving trajectory was determined by a series of bun-

dle adjustments on the recovered camera locations.

4 GENERATION OF BUILDING

TEXTURE MAPS

The second half of the framework deals with the

generation of building texture maps from the source

panoramic images. A building can be identiﬁed

through the processes of image edge detection and

line matching with the supportive information of the

provided building footprint boundaries on the aerial

image. The usage of texture maps not only can en-

hance the visualization of the 3D models but also of-

fer the height information of the buildings, which is

needed in the building modeling process.

For each building shown in the aerial image, we

are mainly interested to extract the front elevation

portion of the building from the panoramic image.

Due to that a dense set of panoramic images were ac-

quired, the same building will appear on numbers of

successive panoramic images. Thus, it is essential to

look for a source panoramic image which contains a

largest projection region of the desired building. This

panoramic image, denoted by P

, can be obtained by

the following:

i = argminDist



),(

+ x

+ y

)



∀ j ∈ {1,2, ...,N},

where Dist function returns the distance between two

coordinate locations on the plane (i.e., the aerial im-

age space), (x

) represents the location of the j-th

panoramic images, (x

) and (x

) represent the

starting and ending points of the given building foot-

print edge, respectively, and N denotes the total num-

ber of candidate panoramic images.

The building footprint line segment on the aerial

image was projected to the panoramic image. Let

and u

denote the projections of the starting and

ending points of the footprint edge, respectively. In

order to identify the image portion that contains the

front elevation view of the desired building, we ﬁrst

reduced our searching space by deﬁning a horizontal

range [u

], where u

< u

and u

∈ {1,2,.. .,W}.

We have

= min(u

) − |u

− u

|/4

= max(u

) + |u

− u

|/4,

where W is the width of the panoramic image (in

pixels). We use H to denote the image height. We

could further reduce the searching space by deﬁning

a vertical range [v

], where u

< u

and u

∈

{1,2, ...,H}. These two boundary values can be ob-

tained by the elevation of the camera location, de-

noted as h, the shortest distance between the camera

and the building, denoted as d, and the maximum pos-

sible height of the building, denoted as b. We have

= H ×



arctan(h/d)



= H ×



1 −

arctan((b − h)/d)



The image region bounded by top-left corner

) and bottom-right corner (u

) is denoted by

. We applied Canny edge detection to region I

and back-project the resultant binary image to a pla-

nar surface, denoted as I

. Then, Hough transform

was employed to detect straight lines. We have posed

constraints that the length of the straight line must be

greater than half of the width of region I

and the

angle between the straight line and a horizontal axis

should be less than 45

◦

. The set of detected straight

lines potentially contains the desired building rooftop

edge. Let S denote the number of resultant straight

lines.

The building footprint boundary on the aerial im-

age, the one facing the camera, was projected to

the panoramic image by various b values within rea-

sonable building height range B, and as well as at

the same time back-projected onto I

. A similarity

test was then performed to calculate the number of

overlapped pixels between each of the back-projected

building footprint boundaries and the set of detected

straight lines. The desired building rooftop edge, de-

noted as l

, can be estimated by the followings:

(m,r)=arg max(Simility(l

, f

)+Length(l

)−Row(b))

∀n ∈ {1,2,. . .,S} and ∀b ∈ B,

where l

indicates the detected straight line indexed

as n, f

indicates the back-projected building footprint

boundary with height value equals to b, and function

Row returns the average image row of f

in I

. The

obtained value of r indicates the height of the build-

ing.

The building elevation view image was generated

by ﬁrst back-project the color panoramic image re-

gion I

to a planar surface, denoted as I

. The reso-

lution of color image I

is identical to the resolution

GRAPP 2012 - International Conference on Computer Graphics Theory and Applications

112

Figure 5: The aerial view of the interested building to be

reconstructed. The red line segment indicates the building

footprint boundary and the red dot on the street indicates the

camera location where Fig. 2(right) was captured.

of binary image I

. In general, the resultant texture

image I

is not in rectangular shape. Rectangular im-

age textures of building elevation views are preferred

for texture mapping task. Therefore, image I

needed

to be rectiﬁed by the perspective transformation pro-

vided in OpenCV before it was used for texture map-

ping.

5 EXPERIMENTAL RESULTS

The program was mainly written in MATLAB and

partially in C++. The experiments were performed

on Windows XP (Service Pack 3) operating system

running on a Intel(R) Core(TM) i7 CPU 920 2.67

GHz with 3G of RAM. A Point Grey Ladybug3 digi-

tal video camera was mounted on top of a car and used

to capture the input panoramic images. The car was

moving at an average speed of 45 kilometers per hour

on the street and the camera captured 15 panoramic

images per second. This way, adjacent panoramic

images were captured at locations approximately one

meter apart. The car was also equipped with a GPS

system.

The input panoramic image resolution was equal

to 2048(width) × 1024 (height) pixels. An example of

the captured spherical image is shown in Fig. 2(right),

which has been transformed to a planar rectangular

image. The aerial image of that area was obtained

from Google Map as shown in Fig. 5, where the red

line segment indicates the provided building footprint

boundary and the red dot on the street indicants the

corresponding camera location where Fig. 2(right)

was captured.

We recorded thousands of panoramic images this

way on different streets, however, for image exper-

Figure 6: Building texture generation example. (Top) the

captured panoramic image. The red rectangle encloses the

region that contains the desired building texture. (Left) the

Canny edge detection result of the building region after per-

forming the back-projected. The image has been faded to

emphasize the detected straight lines. (Middle) the gener-

ated building texture, and (right) the result after performing

the perspective transformation.

iments there is no sufﬁciently accurate ground truth

data available for evaluation. As described in the

previous sections, we aim to reconstruct a rough 3D

street model for which accuracy was not our major

concern. Some reconstruction examples based on our

approach are shown in Fig. 1.

A building texture generation example is given in

Fig. 6. The original panoramic image is shown on

the top. The red rectangle indicates the region I

that

contains the desired building texture, which is back-

projected to a planar surface I

to perform straight

line detection. Figure 6(left) illustrates the Canny

edge detection result of I

in binary image format.

Hough line detection algorithm has been performed

to obtain a set of nearly horizontally-oriented straight

lines. The binary image has been faded to emphasize

the detected straight lines (blue thin lines). Red thick

lines indicate the potential building roof. Finally, the

top-most thick black line has been identiﬁed to be the

building rooftop edge. Part of the top region of the

back-projected color panoramic image was cropped

according to this identiﬁed building rooftop edge, and

the result is shown in Fig. 6(middle). The perspective

transformation has been performed to obtain a rectan-

gular building texture illustrated on the right of Fig. 6.

Another building texture generation example is shown

in Fig. 7 and the corresponding reconstruction result

is shown in Fig. 1(left).

3D MODELING OF STREET BUILDINGS FROM PANORAMIC VIDEO SEQUENCES AND GOOGLE MAP IMAGE

113

(a) (b)

Figure 7: Another texture rectiﬁcation example. (a) is the

generated building texture, and (b) is the result after per-

forming the perspective transformation.

Figure 8: Camera trajectory recovery result of synthetic ex-

periment.

To evaluate the performance of the proposed cam-

era trajectory recovery approach, we have also con-

ducted some synthetic experiments. A 12 × 20 units

(note: the unit is as used in the software) virtual street

model was built by Maya and all the buildings were

texture mapped with real images. A virtual camera

was implemented to capture the panoramic images in

the virtual world. For the experiment illustrated in

Fig. 8, 50 panoramic images were generated at the lo-

cations indicated by green dots. The estimated cam-

era path is represented by a set of blue circles. The

average drifts of the resulting camera path to the ac-

tual path is equal to 0.324 units.

6 CONCLUSIONS

An street 3D modeling framework was proposed,

which takes two types of images as input sources,

namely a set of spherical panoramic images captured

along a continuous path and an orthogonal aerial im-

age of that area. The relative orientations and posi-

tions of all the panoramic images can be recovered

through a fully automatic process with the help of

sparse GPS data. The footprints of major buildings

to be reconstructed in the aerial image are assumed

given. The developed program is able to automati-

cally estimate the height information and generate a

rectangular front view texture image of each building

for large-scaled 3D city modeling.

ACKNOWLEDGEMENTS

This project is ﬁnancially sponsored by National Sci-

ence Council (grand no. NSC 100-2221-E-197 -

028). Some preliminary experiments were ﬁnancially

sponsored by MOEA (ministry of economics affairs)

project no. 98-ec-17-A-02-01-00809. The authors

would like to express special thanks to Akihiko Torii

for useful discussions and Jui-Yang Tsai for image ac-

quisitions.

REFERENCES

Fruh, C. and Zakhor, A. (2004). An automated method for

largescale, ground-based city model acquisition. In-

ternational Journal of Computer Vision, 6:5–24.

Gruen, A. (1997). Automation in building reconstruction.

In Proc. Photogrammetric Week97, pages 175–186,

Wichmann Verlag, Heidelberg.

Haala, N. and Brenner, C. (1998). Fast production of virtual

reality city models. In Proc. IAPRS, pages 0–0, 0.

Hu, J., You, S., and Neumann, U. (2006). Automatic pose

recovery for high-quality textures generation. In Proc.

ICPR, pages 561–565, Hong Kong, China.

Kang, S.-B., Szeliski, R., and Uyttendaele, M. (2004).

Seamless stitching using multi-perspective plane

sweep. In MSR-TR-2001-48.

Liu, L., Stamos, I., Yu, G., Wolberg, G., and Zokai, S.

(2006). Multiview geometry for texture mapping 2d

images onto 3d range data. In Proc. CVPR, pages

2293–2300, Washington, DC, USA.

Maas, H.-G. (2001). The suitability of airborne laser scan-

ner data for automatic 3d object reconstruction. In

Third International Workshop on Automatic Extrac-

tion of Man-Made Objects from Aerial and Space Im-

ages, pages 10–15, Monte Verita, Ascona, Switzer-

land.

Stamos, I. and Allen, P. K. (2002). Geometry and texture

recovery of scenes of large scale. Comput. Vis. Image

Underst., 88:94–118.

Vestri, C. and Devernay, F. (2001). Using robust methods

for automatic extraction of buildings. In Proc. CVPR,

pages 133–138, Kauai, HI, USA.

Wang, L., You, S., and Neumann, U. (2007). Semiautomatic

registration between ground-level panoramas and an

orthorectiﬁed aerial image for building modeling. In

Proc. VRML07 ICCV07 workshop, pages 8–15, Rio

de Janeiro, Brazil.

GRAPP 2012 - International Conference on Computer Graphics Theory and Applications

114