Generation of 3D Building Models from City Area Maps

Roman Martel

, Chaoqun Dong

, Kan Chen

, Henry Johan

and Marius Erdt

1,2

Fraunhofer Singapore, Singapore

Nanyang Technological University, Fraunhofer IDM@NTU, Singapore

Keywords:

Computer Vision, 3D Building Models, City Area Maps, Text Recognition, Deep Learning.

Abstract:

In this paper, we propose a pipeline that converts buildings described in city area maps to 3D models in the

CityGML LOD1 standard. The input documents are scanned city area maps provided by a city authority.

The city area maps were recorded and stored over a long time period. This imposes several challenges to

the pipeline such as different font styles of typewriters, handwritings of different persons, varying layout, low

contrast, damages and scanning artifacts. The novel and distinguishing aspect of our approach is its ability to

deal with these challenges. In the pipeline we, ﬁrstly, identify and analyse text boxes within the city area maps

to extract information like height and location of its described buildings. Secondly, we extract the building

shapes based on these locations from an online city map API. Lastly, using the extracted building shapes and

heights, we generate 3D models of the buildings.

1 INTRODUCTION

A 3D model of a city is a valuable tool for city aut-

horities to plan new construction projects. In the past,

city planning was done using hand drawings and ty-

pewriters. Therefore, for many older city areas there

is no digital data available which can be utilized by

3D modeling and rendering tools. Usually, creating

3D models of buildings in these areas is done manu-

ally which is time consuming and costs lots of efforts.

In this paper, we propose a pipeline to generate 3D

building models automatically based on archived city

area maps provided by city authorities.

Our input city area maps contain streets, smaller

overview maps and buildings with detailed shapes. A

couple of these buildings are marked. For the marked

buildings, several text boxes, distributed around the

map, add various additional information. The input

in Figure 1 shows an example city area map. These

maps pose several challenges for automatic document

analysis. As they are often maintained over a long

time period, they exhibit different font styles of ty-

pewriters and many different handwritings. Further-

more, the general layout of the documents varies as

well. For example, the text boxes with additional

building information can appear at arbitrary locati-

ons. In addition, long storage in archives may lead

to low contrast and damages. Making the city area

maps available to digital processing by scanning can

yield to additional artifacts.

The novelty of our approach lies in its robustness

to deal with these challenges. The output of our pi-

peline is a list of 3D models for all the marked buil-

dings in the city area maps. We use the CityGML

standard (Gr

oger et al., 2012) with level of detail

(LOD) 1 to store our models. LOD1 is used for sim-

ple building models with no details which are, howe-

ver, sufﬁcient to get an overview of a city area. The

output in Figure 1 shows an example for CityGML

LOD1 models. Whenever we use the term 3D model,

we mean a model according to this standard.

This paper is structured as follows. Section 2 sum-

marizes related work. Section 3 gives an overview of

the proposed pipeline. Subsequently, in Sections 4, 5

and 6 the different parts of the pipeline are described

in detail. Finally, Section 7 summarizes our results

and outlines future work.

2 RELATED WORK

Previous work related to the analysis of city area maps

focused on the analysis of ﬂoor plans of single buil-

dings (Ahmed et al., 2011) or the automatic detection

of rooms in ﬂoor plans (Ahmed et al., 2012). In the

context of 3D building model generation (Sugihara

et al., 2015) proposed a pipeline to automatically ge-

nerate models based on polygons which have been

manually marked on a 2D digital map before. (Bil-

Martel, R., Dong, C., Chen, K., Johan, H. and Erdt, M.

Generation of 3D Building Models from City Area Maps.

DOI: 10.5220/0007554905690576

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 569-576

ISBN: 978-989-758-354-4

569

Figure 1: Overview of the proposed pipeline. The input is a city area map with text boxes describing several marked buildings.

First, the content of the text boxes are analysed and information like location and height for all marked buildings extracted.

Afterwards, using the locations, the shape of the buildings are extracted from an online city map API. Finally, 3D models are

generated based on the shape and height information.

jecki et al., 2017) also investigated the generation of

building models with a low level of detail. Their ap-

proach uses building footprints, additional informa-

tion about the building and statistical data to deduce

the height and shape of the buildings. A variety of ap-

proaches have been developed for the reconstruction

of very detailed building models based on photo and

laser scanning data. An overview can be found in

(Fritsch and Klein, 2018). Furthermore, the CityGML

standard (Gr

oger et al., 2012), used to store the re-

sulting 3D models of our pipeline, is itself focus of

ongoing research (Biljecki et al., 2016).

The text analysis methods used in Section 4 are

related to previous work in text detection and recogni-

tion. Comprehensive surveys of this area can be found

in (Ye and Doermann, 2015) and (Zhu et al., 2016).

Most recent approaches which are able to deal with

challenging scenarios are based on deep neural net-

works (Jaderberg et al., 2016), (Bartz et al., 2017a),

(Zhou et al., 2017), (Liu et al., 2018). They exhi-

bit impressive results in terms of accuracy and ro-

bustness to variations of image quality, illumination,

text fonts, rotations and distortions. They, however,

also heavily depend on large image data sets with ap-

propriate labeling. The previous approaches which

are most relevant to ours rely on manually designed

features like extremal regions (Neumann and Matas,

2010), (Neumann and Matas, 2012) or stroke width

measures (Epshtein et al., 2010). They ﬁrst seek cha-

racter candidates using the features and rely on post

processing and classiﬁers to ﬁlter out the true cha-

racters. Due to the unique character of the input city

area maps and the consequential lack of appropriately

labeled data we chose a similar approach to (Neu-

mann and Matas, 2010) for the text analysis part of

our pipeline. We also extract maximally stable ex-

tremal regions (MSER) and use post-processing and

a classiﬁer to detect true characters. An additional

argument for this approach is the availability of an

open-source implementation for the MSER extraction

(Bradski, 2000).

3 OVERVIEW OF THE

PROPOSED PIPELINE

We propose a pipeline that converts buildings in a city

area map to 3D models in the CityGML LOD1 stan-

dard. The overview of this pipeline is illustrated in

Figure 1. The input are scanned versions of archi-

ved city area maps provided by city authorities. These

maps contain among others a couple of marked buil-

dings which are described in more detail in several

text boxes. We assume that these text boxes contain

information about the location and height of the mar-

ked buildings. The provided city area maps show a

large variation of quality, style and layout. We, the-

refore, invested a lot of effort to make the proposed

pipeline robust to this versatile input. Our approach

can be divided into three parts. For all marked buil-

dings in the city area maps, we

1. use text analysis to extract the location and

height from the text boxes describing the building

(Section 4).

2. search the location with an online city map API

and extract the shape from the resulting city area

image (Section 5).

3. generate a 3D model using the extracted shape and

height (Section 6).

4 BUILDING INFORMATION

EXTRACTION

This section describes the extraction of the location

and height information of buildings in the city area

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

570

(a) Result of ﬁnding text character

candidate regions using MSER

(Matas et al., 2002). Each detected

region is marked with a box.

(b) Text vs non-text classiﬁcation

result. The classiﬁer correctly ﬁnds

most of the text boxes while

excluding non-text elements.

extracting the regions classiﬁed as

text characters.

Figure 2: Example for the text extraction for a segmented text box of a city area map.

maps. The input are the previously described city

area maps. Several text boxes, distributed at arbi-

trary locations in the maps, add additional informa-

tion for a couple of marked buildings. However, these

text boxes contain many non-text elements as well.

This makes it impossible to process them with Optical

Character Recognition (OCR) software directly. The

non-text elements can be of arbitrary nature. Typical

are drawings, diagrams, grid cells, separation lines,

noise and artifacts due to the scanning process.

Our approach therefore contains several steps:

1. Segmenting the text boxes in the input maps

(Section 4.1)

2. Extracting candidates for characters in the text

boxes (Section 4.2)

3. Classifying these candidates into text or non-text

(Section 4.3)

4. Recognizing the content of the characters classi-

ﬁed as text (Section 4.4)

4.1 City Area Map Segmentation

Figure 3: Example for a city area map segmentation result.

The segments mainly align with the text box borders. The

segmentation of the map area is irrelevant because the target

information is contained in the text boxes only.

We segment the full city area map into smaller

parts containing the text boxes and fragments of the

map. Our method uses simple image statistics. We

extract long horizontal and vertical lines from the map

using a rank ﬁlter (Soille, 2002). This results in an

image containing mainly the separating lines of the

text boxes. The city area maps have been binarized

during the scanning process. Therefore, to ﬁnd the

location of the lines separating the boxes, we sim-

ply count the black pixels per line and column. We

smooth the resulting pixel count curves to make our

approach robust to noise and small rotations in the

documents. Finally, we create a histogram of all pixel

count maxima. The highest p percentile of the max-

ima count values are considered to be separating lines

with p being an empirically determined hyperparame-

ter of our approach. The result of this method applied

on a typical city area map is shown in Figure 3.

4.2 Character Candidate Extraction

After the image segmentation, we continue proces-

sing each segmented text box separately. To get pos-

sible candidates for characters we ﬁrst apply a median

ﬁlter to denoise the text box. Afterwards, we extract

the candidates using the maximally stable extremal

regions (MSER) algorithm (Matas et al., 2002). The

output of this method are regions surrounding possi-

ble characters. Figure 2a shows an example where

each detected region is marked with a box. These re-

gions are eventually extracted resulting in an image

containing only the parts within the regions.

4.3 Text vs. Non-text Classiﬁcation

Identifying which character candidates are real alpha-

numeric characters or not is a typical binary classiﬁ-

cation problem. To solve it, we experimented with a

couple of machine learning models ranging from Lo-

gistic Regression, Support Vector Machines to Con-

volutional Neural Networks (CNN). To train these

Generation of 3D Building Models from City Area Maps

571

Figure 4: Image retrie-

ved from the new One-

Map Static Map API by

the center coordinates of

Block No.113 (NewOne-

Map, 2018).

Figure 5: Thresholding re-

sult with a contiguous noisy

object indicated in red rec-

tangular.

Figure 6: Extracted shape

of the target building.

Figure 7: Result of the 3D

model generation for the

shape shown in Figure 6.

models we, however, needed to ﬁrst create a data set.

We therefore took a subset of all character candida-

tes and manually labeled them to be alphanumeric or

not. The result was a data set with 12602 images of

which 54% represented real alphanumeric characters.

Afterwards, we trained the different models on this

data set. So far, we achieved the best classiﬁcation

results on a hold-out test set with a simple VGG-like

CNN (Simonyan and Zisserman, 2014). An example

of applying this classiﬁer on a text box containing the

extracted character candidate regions can be seen in

Figure 2b. The boxes mark characters identiﬁed as

alphanumeric. Extracting only the regions of the cha-

racters within these boxes yields Figure 2c.

4.4 Text Recognition

The resulting text boxes containing the plain text only

can ﬁnally be used as input for an Optical Character

Recognition (OCR) software. There are several OCR

tools designed for this task (Smith, 2007), (Breuel,

2008). Furthermore, there are deep learning appro-

aches which could also be utilized (Shi et al., 2015),

(Bartz et al., 2017b). As we deal with texts containing

several different font styles and handwriting, we need

to ﬁne-tune a model for an OCR software or deep le-

arning architecture to our data. This is still work in

progress and the only yet missing part of the pipeline.

The result of these tools is the text content of the

text boxes in the city area maps. This content can

hence be processed by standard text search engines.

The location and height information in the provided

city area maps are indicated consistently across the

whole data set by the same keywords. One can there-

fore search the text content for these keywords to ﬁnd

the location and height for each described building.

This will serve as input for the following parts of the

pipeline.

5 BUILDING SHAPE AND SIZE

EXTRACTION

5.1 Shape Extraction

In order to generate the 3D model of a target building,

whose location and height information are obtained

from text extraction and recognition as described

above, the next step is to get the building shape and

size. However, processing a city area map as shown

in Figure 3 may lead to less accurate building shape

and size. This is because some of the old maps were

drawn by hand and afterwards digitalized by scan-

ning. Hence, inaccuracies due to the limitation of hu-

man drawing precision, the thickness of the pen and

other undesired noise is inevitable. Furthermore, de-

aling with such complicated noisy images generally

means longer computational time.

Due to the aforementioned reasons, instead of

using the original archived city area map, our met-

hod extracts a more accurate building shape and size

in a faster way by leveraging an online API, which is

the new OneMap API. Speciﬁcally, we used new On-

eMap Search API and Static Map API (NewOneMap,

2018). The Search API returns the coordinates of a

building based on its address. The Static Map API

returns an image of a map section based on some pre-

deﬁned parameters such as image center coordinates,

zoom level, image size and so on. Figure 4 shows an

example image retrieved by the API based on the cen-

ter coordinates of a certain building. (Block No.113,

which is the target building in this case.)

For the purpose of getting as many details as pos-

sible, we use the maximum zoom level and the big-

gest image size provided by the new OneMap Static

Map API. Only a grayscale version of this image will

be processed later. Dealing with such an image will

lead to much more accurate building shape and scale

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

572

Figure 8: Image of a large neighborhood retrieved from the

new OneMap Static Map API (NewOneMap, 2018).

as well as a much shorter processing time compared

to working on a noisy and possibly damaged archived

plan.

In the retrieved images of the new OneMap Sta-

tic Map API, all buildings of interest, which are the

public housing buildings, have the same grayscale va-

lue. Hence, one simple thresholding operation is able

to segment all of them successfully. Then, we can get

the target building boundary shape based on the con-

tour characteristics of all the objects left after thres-

holding. Firstly, we remove all the objects whose con-

tour area is small. The main purpose of doing this is

to get rid of contiguous objects around the target buil-

ding. According to the original city area map, these

objects are surrounding constructions, which are not

part of the target building. Figure 5 illustrates such an

object among all the segmented building candidates

by drawing a red box around it. Additionally, small

building parts cropped along the image boundary and

other noise sharing the same grayscale value (small

dots below Block No.113 in the green boxes in Fi-

gure 5) are eliminated as well. Secondly, we pick

the object whose contour centroid is the closest to

the image center among all the remaining objects as

the target building. As mentioned before, the target

building center coordinates are returned from the new

OneMap Search API then passed to the new OneMap

Static Map API. The Static Map API uses these coor-

dinates as the generated image center. Therefore, the

target building centroid and the image center overlap

in most of the cases in which the building has a regu-

lar shape. Finally, the shape of the target building is

extracted based on the detected contour. The result is

shown in Figure 6.

Figure 9: Extracted shapes of the target buildings in a large

neighborhood.

5.2 Size Computation

Since a ﬁxed zoom level is used in the new One-

Map Static Map API for generating the images, we

can compute the actual size of the building based on

the scale corresponding to this zoom level. Later, by

combining the computed actual size of the building

planar with the extracted building height from text

analysis part, we are able to generate the 3D building

models with correct ratio. The advantage of this met-

hod is that the scale is universal for all buildings if the

zoom level is unchanged, as in our case. This is ex-

tremely useful for creating 3D models of a large neig-

hborhood. Conversely, for the archived city maps, the

scale used for drawing may change from map to map.

When creating 3D models for all buildings in a large

area or even a whole city, several maps with different

scales may be used. As a consequence, extra effort is

needed to align the scales between different maps if

we directly extract the building shapes and sizes from

the archived city area maps.

6 3D MODEL GENERATION

Based on the shape and the size information from

Section 5, we ﬁrst reﬁne the extracted shape of a

building ground print using image processing operati-

ons (dilation, erosion and contour retrieval). We then

extrude it to 3D using the height information from

Section 4. As such, we can generate 3D models for

all buildings described in the input city area map. An

example of a resulting 3D model is shown in Figure

Figure 8 shows a big neighborhood retrieved from

the new OneMap Static Map API and Figure 9 shows

the extracted shapes of all the target buildings in this

Generation of 3D Building Models from City Area Maps

573

Figure 10: A view of 3D models of all target buildings from the neighborhood in Figure 8.

Figure 11: Another view of 3D models of all target buildings from the neighborhood in Figure 8.

neighborhood by our method. As described above,

with the size and height information we can gene-

rate 3D models of these buildings with correct ratio.

Figure 10 and Figure 11 give illustrations of the 3D

building models in such a big neighborhood from two

different views.

7 CONCLUSIONS AND FUTURE

WORK

We have proposed a pipeline to automatically gene-

rate 3D building models based on archived city area

maps. The pipeline is still in the middle of implemen-

tation as the text recognition part, described in Section

4.4, is still in development. We, however, showed re-

sults for the automatic extraction of text from the city

area maps such that it can serve as input for typical

OCR tools. During that, we showed how to deal with

the challenges of the input maps. We demonstrated

how to segment text boxes from non-text areas and

how to separate text characters from other elements

like noise or drawings. Furthermore, assuming that

we ﬁnd the location and height information, we sho-

wed how to extract the shape and size of buildings

using the new OneMap APIs. Finally, we presented

results for generating 3D building models using the

height, shape and size information.

The novelty and challenge of our approach is that

it leverages generally available data sources which

are however not primarily designed for the generation

of 3D building models. The building information is

obtained from scanned analog maps and the shape de-

duced from an online city map which contains much

more content irrelevant for the task. In contrast, the

approach in (Sugihara et al., 2015) relies on additi-

onal polygons added to a digital map. The building

information like number of stories is also provided

digitally for each polygon. In (Biljecki et al., 2017)

plans containing solely building footprints and a di-

gital database with additional information about each

building is used. Therefore, compared to our appro-

ach, the usage of text detection and recognition is not

necessary and both are able to deliver more detailed

results. The building models in (Sugihara et al., 2015)

have a higher level of detail because of the detailed

annotations in the polygons and in (Biljecki et al.,

2017) the height estimation is more precise, due to

the combination of several data sources.

The limitation of the building information ex-

traction is that some characters like ’i’, ’t’, ’j’, ’l’ and

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

574

’1’ are difﬁcult to differentiate from the noise in the

maps. As a consequence, the current classiﬁer tends

to exclude these characters which are therefore often

missing in the extracted text. We plan to ﬁx this is-

sue by adding more examples for these cases to our

training data set. Besides that, to complete the imple-

mentation of the pipeline, we are still working on ﬁne

tuning an OCR model to recognize the extracted text.

To do this, we need to create a second data set contai-

ning the extracted words and sentences and manually

label them with the correct text content.

Figure 12: Image re-

trieved with annotation

on top of target buil-

ding (Block No.243)

boundary (NewOneMap,

2018).

Figure 13: Extraction of

distorted target building

shape.

For the shape extraction part, one of the current

major problems is that some annotations may lay on

top of the building boundary. Then after shape ex-

traction, the retrieved building shape will be distorted.

In Figure 12, the block number is overlapped with the

building boundary of Block No.243. Figure 13 shows

the failure case of extracting its shape. We plan to add

post-processing to reﬁne the extracted shape for such

a situation. Also, even though the maximum zoom

level of the new OneMap Static Map API is used for

generating the image to be processed, its resolution is

still a bit too low, so some artifacts may be introdu-

ced along the building boundary. We intend to add

image super resolution or vectorization parts to over-

come this problem.

ACKNOWLEDGEMENTS

This research is supported by the National Research

Foundation, Prime Ministers Ofﬁce, Singapore under

the Virtual Singapore Programme.

Furthermore, we gratefully thank the Housing &

Development Board (HDB) for providing us the ar-

chived neighborhood maps.

We also want to express our appreciation to new

OneMap for providing us their APIs to retrieve their

map images.

REFERENCES

Ahmed, S., Liwicki, M., Weber, M., and Dengel, A. (2011).

Improved automatic analysis of architectural ﬂoor

plans. In 2011 International Conference on Document

Analysis and Recognition, pages 864–869.

Ahmed, S., Liwicki, M., Weber, M., and Dengel, A. (2012).

Automatic room detection and room labeling from ar-

chitectural ﬂoor plans. In 2012 10th IAPR Internati-

onal Workshop on Document Analysis Systems, pages

339–343.

Bartz, C., Yang, H., and Meinel, C. (2017a). See: To-

wards semi-supervised end-to-end scene text recogni-

tion. arXiv preprint arXiv:1712.05404.

Bartz, C., Yang, H., and Meinel, C. (2017b). STN-OCR:

A single neural network for text detection and text re-

cognition. CoRR, abs/1707.08831.

Biljecki, F., Ledoux, H., and Stoter, J. (2016). An improved

LOD speciﬁcation for 3D building models. Compu-

ters, Environment and Urban Systems, 59:25–37.

Biljecki, F., Ledoux, H., and Stoter, J. (2017). Generating

3D city models without elevation data. Computers,

Environment and Urban Systems, 64:1–18.

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Jour-

nal of Software Tools.

Breuel, T. M. (2008). The OCRopus open source OCR sy-

stem.

Epshtein, B., Ofek, E., and Wexler, Y. (2010). Detecting

text in natural scenes with stroke width transform.

In Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, pages 2963–2970. IEEE.

Fritsch, D. and Klein, M. (2018). 3D preservation of buil-

dings – reconstructing the past. Multimedia Tools and

Applications, 77(7):9153–9170.

oger, G., Kolbe, T., Nagel, C., and H

afele, K. (2012).

OGC city geography markup language (CityGML)

encoding standard, version 2.0, ogc doc no. 12-019.

Open Geospatial Consortium.

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2016). Reading text in the wild with convolutional

neural networks. International Journal of Computer

Vision, 116(1):1–20.

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., and Yan, J.

(2018). FOTS: Fast oriented text spotting with a uni-

ﬁed network. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

5676–5685.

Matas, J., Chum, O., Urban, M., and Pajdla, T. (2002). Ro-

bust wide baseline stereo from maximally stable ex-

tremal regions. In Proceedings of the British Machine

Vision Conference, pages 36.1–36.10. BMVA Press.

doi:10.5244/C.16.36.

Neumann, L. and Matas, J. (2010). A method for text lo-

calization and recognition in real-world images. In

Asian Conference on Computer Vision, pages 770–

783. Springer.

Neumann, L. and Matas, J. (2012). Real-time scene text

localization and recognition. In Computer Vision and

Pattern Recognition (CVPR), 2012 IEEE Conference

on, pages 3538–3545. IEEE.

Generation of 3D Building Models from City Area Maps

575

NewOneMap (2018). New OneMap Search API, https:

//docs.onemap.sg/#search. New OneMap Static Map

API, https://docs.onemap.sg/#static-map Contains in-

formation from new OneMap accessed on 13th No-

vember 2018 from new OneMap Static Map API,

which is made available under the terms of the Sin-

gapore Open Data Licence version 1.0: https://www.

onemap.sg/legal/opendatalicence.html.

Shi, B., Bai, X., and Yao, C. (2015). An end-to-end trai-

nable neural network for image-based sequence re-

cognition and its application to scene text recognition.

CoRR, abs/1507.05717.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Smith, R. (2007). An overview of the tesseract OCR engine.

In Proceedings of the Ninth International Conference

on Document Analysis and Recognition - Volume 02,

ICDAR ’07, pages 629–633, Washington, DC, USA.

IEEE Computer Society.

Soille, P. (2002). On morphological operators based on rank

ﬁlters. Pattern Recognition, 35(2):527 – 535.

Sugihara, K., Murase, T., and Zhou, X. (2015). Automa-

tic generation of 3D building models from building

polygons on digital maps. In 2015 International Con-

ference on 3D Imaging (IC3D), pages 1–9.

Ye, Q. and Doermann, D. (2015). Text detection and re-

cognition in imagery: A survey. IEEE Transacti-

ons on Pattern Analysis and Machine Intelligence,

37(7):1480–1500.

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W.,

and Liang, J. (2017). EAST: an efﬁcient and accurate

scene text detector. CoRR, abs/1704.03155.

Zhu, Y., Yao, C., and Bai, X. (2016). Scene text detection

and recognition: Recent advances and future trends.

Frontiers of Computer Science, 10(1):19–36.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

576