Polygonizing Roof Segments from High-Resolution Aerial Images Using

Yolov8-Based Edge Detection

Qipeng Mei

1 a

, Dimitri Bulatov

2 b

and Dorota Iwaszczuk

1 c

Remote Sensing and Image Analysis, Department of Civil and Environmental Engineering Sciences,

Technical University of Darmstadt, Franziska-Braun-Str. 7, 64287 Darmstadt, Germany

Fraunhofer IOSB Ettlingen, Gutleuthausstrasse 1, 76275 Ettlingen, Germany

Keywords:

Vectorization, Face, Extraction, Urban, Building, Object, Detection, Structure.

Abstract:

This study presents a novel approach for roof detail extraction and vectorization using remote sensing images.

Unlike previous geometric-primitive-based methods that rely on the detection of corners, our method focuses

on edge detection as the primary mechanism for roof reconstruction, while utilizing geometric relationships to

deﬁne corners and faces. We adapt the YOLOv8 OBB model, originally designed for rotated object detection,

to extract roof edges effectively. Our method demonstrates robustness against noise and occlusion, leading

to precise vectorized representations of building roofs. Experiments conducted on the SGA and Melville

datasets highlight the method’s effectiveness. At the raster level, our model outperforms the state-of-the-

art foundation segmentation model (SAM), achieving a mIoU between 0.85 and 1 for most samples and an

ovIoU close to 0.97. At the vector level, evaluation using the Hausdorff distance, PolyS metric, and our

raster-vector-metric demonstrates signiﬁcant improvements after polygonization, with a close approximation

to the reference data. The method successfully handles diverse roof structures and reﬁnes edge gaps, even on

complex roof structures of new, excluded from training datasets. Our ﬁndings underscore the potential of this

approach to address challenges in automatic roof structure vectorization, supporting various applications such

as urban terrain reconstruction.

1 MOTIVATION

From traditional Chinese pavilions to Mexican pyra-

mids and from skylines of US-megapoli to the resi-

dential areas of Australian urbanities, building roofs

are fascinating products of human creativity. Hu-

mans are motivated to be creative because a roof is

the culmination of what some people call “home”

and what for other people is the workplace and gives

them self-realization, or it belongs to a public place,

which must be presentable while looking at it from

afar. By contemplating a building roof, we can eas-

ily discern its constituting elements, detect important

corners and edges deﬁning its structure, and infer the

relations between these components. This ability to

perceive structures from images is the fundamental

aspect of human vision. However, retrieving very

complex topological structures still presents a signif-

icant challenge (Lin et al., 2024) for computer vision

algorithms. This challenge negatively affects the abil-

ity to create automatically digital representations of

https://orcid.org/0009-0006-0506-4478

https://orcid.org/0000-0002-0560-2591

https://orcid.org/0000-0002-5969-8533

large building databases from aerial images, posing

an obstacle to those numerous applications requiring

detailed knowledge of building roof geometry. The

applications may require 3D information and the re-

quired output, consistently, would be the 3D geom-

etry of roofs. Examples are environmental science,

planning mobile communication networks, real es-

tate marketing, and virtual tourism, but also quick re-

sponse applications (Bulatov et al., 2014). Alterna-

tively, there could be purely 2D applications, which

only require images: Roof panels for solar modules

(House et al., 2018), damage grading after some nat-

ural disasters (Lucks et al., 2019), and, to a certain

degree, urban planning.

These applications are particularly interesting for

us because aerial images in high resolution are easily

available nowadays whereas manual digitalization of

building roof segments is costly. We therefore noticed

a huge progress in this ﬁeld made by scholars in the

very few recent years. In what follows, a thorough

literature research is carried out to identify the most

promising tools but also certain shortcomings of re-

lated work that allowed us to propose a simple but ef-

ﬁcient method to process buildings with complex roof

graphs.

Mei, Q., Bulatov, D. and Iwaszczuk, D.

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection.

DOI: 10.5220/0013130400003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

39-49

ISBN: 978-989-758-728-3; ISSN: 2184-4321

1.1 Related Work

A concept of combining color region attributes and

grouping them to retrieve building roof details from

images has been performed by Henricsson (1998)

mode than a quarter-century ago. However, irrele-

vant objects or disturbances, solar panels, chimneys,

and shadows, with spectral characteristics other than

the main roof negatively affect the computation of

the region attributes. At least until nowadays, when

the advanced foundational models, like SAM (Kir-

illov et al., 2023) appeared, this was the reason why

the brilliant Henricsson’s research of that time has not

prevailed. Instead, in the absence of 3D data, schol-

ars mostly concentrated on building outlining, a task

successfully performed using conventional (Zhang,

1999; Turker and Koc-San, 2015) and deep-learning

based approaches (Wei et al., 2019; Zorzi et al., 2022).

Unlike the vector representation of building outlines,

which primarily deals with external contours, roof

vectorization requires consideration of more complex

topological structures, presenting a signiﬁcant chal-

lenge. To address this issue, the prevailing approach

in current research moved, as already said, from

color-based approaches towards retrieving geomet-

ric primitives and reconstructing the roof through

their topological relationships. Traditional geomet-

ric primitive detection methods, such as Harris cor-

ner detection (Harris et al., 1988) and Canny edge de-

tection (Canny, 1986), are simple and robust, but the

complexity of surface scenes and the impact of noise

can result in the detection of geometric primitives as-

sociated with non-roof features as well as the omis-

sion or misidentiﬁcation of roof-related primitives.

To grasp structures beyond local image gradi-

ent information, the researchers have turned to deep

learning approaches. Probably, Nauata and Furukawa

(2020) were the ﬁrst who used Convolutional Neu-

ral Networks (CNNs) to detect geometric primitives

in aerial images. By combining these detections with

integer programming, they inferred the relationships

between the primitives and thus were able to assem-

ble them into a cohesive planar graph. In a similar

vein, Zhang et al. (2020) employed the Convolutional

Message Passing Network (Conv-MPN) architecture

to reconstruct roof structures. Basically, there are two

networks, one for junction detection and one for es-

tablishing adjacency relationships. The method of

Hensel et al. (2021) relies on the PPGNet Deep Neu-

ral network (Zhang et al., 2019), which is end-to-end

trainable, and comprises modules for junction detec-

tion, line segment assignment, and adjacency matrix

inference. The authors assess the weight for different

losses and evaluate their results edge-wise. After the

so-called DSM reﬁnement step, supposed to detect,

essentially, buildings and suppress vegetation, Wang

et al. (2021) uses a similar style transfer technique to

detect rasterized roof corners and edges. These primi-

tives are further reﬁned, slightly regularized, and em-

ployed to detect roof faces using a graph search al-

gorithm. For each building, an undirected graph is

built from the obtained edges. A graph cycle, which

is a roof polygon or a union of such, can be detected

using a depth-ﬁrst search. After all the cycles have

been detected, large cycles that cover small cycles

are removed to avoid face overlapping. This may

work well for simple buildings, however, a choice of a

minimum cycle basis is an exponential problem, and

there is a need do combine geometrical and graph-

theoretical considerations to solve this problem for

complex buildings in an efﬁcient way.

Another remote-sensing-inspired algorithm de-

scribed in Alidoost et al. (2020) presupposes the ap-

plication of a Y-shaped CNN from a single aerial im-

age. The two outputs, which give the network its

name, are style-transferred DSMs on the one hand,

and the union of eave, ridge, and hip lines, on the

other hand. Individual roof areas are retrieved using a

post-processing step. This method is a further devel-

opment of the knowledge-based workﬂow (Alidoost

et al., 2019). Moreover, Partovi et al. (2019) proposed

a comprehensive workﬂow consisting of building de-

tection, decomposing of the roof into rectangles in

2D, ML-based assignment of every rectangle, as well

as reconstruction of the roof because, for every rect-

angle, the set of parameters has to be determined. All

possible models are instantiated by changing the pa-

rameters in the predeﬁned ranges and validated by the

PolyS metric (Avbelj et al., 2014). Even though for

the assignment of roof types, the 3D information is

dispensable, which makes this work very interesting

for us, the dataset considered in this work shows typi-

cal neo-classic style buildings, extending in rectangu-

lar rows along the boulevards and designed in a uni-

form style. However, we are more interested in res-

idential buildings since they have very complex roof

models and are hardly decomposable into rectangles.

From the point of view of entire wire-frame gener-

ation, Zhao et al. (2022) enhanced the detection units’

ability to perceive line segment primitives by incor-

porating HT-IHT (Hough Transform-Iterative Hough

Transform), which enabled the extraction of line seg-

ment and intersection point proposals. Subsequently,

they used Graph Neural Networks (GNNs) to learn

the relationships between line segments and inter-

sections, thereby achieving the vectorization of roof

structures. Inspired by the approach of Huang et al.

(2018), Esmaeily and Rezaeian (2023) developed an

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

end-to-end representation for intersection points in

images. This representation not only captures the

spatial coordinates of the intersection points but also

encodes the direction of the line segments that form

these intersections. Assisted by line segment detec-

tion masks and plane masks, this approach enables

the vectorized representation of roof structures. Wire-

frame extraction methods are efﬁcient regarding com-

putational resources, but sometimes unusual angles in

challenging building structures may affect their per-

formance negatively.

The work of Ren et al. (2021), despite mainly fo-

cusing on interactive roof annotation and roof graph

optimization, also proposed a two-step Transformer

and a GNN-based procedure. The transformer is

supposed to retrieve a roof outline while the GNN

is trained to predict the face adjacency. The ap-

proach even offers a tool to generate 3D models from

2D graphs using the so-called planarity metric, even

though, as default, all inner roof vertices must have

the same elevation. We exploit this contribution by

retrieving their ﬁrst dataset to validate our methods.

Finally, there are two very new and very success-

ful methods worth mentioning. To detect geometric

primitives and their relations, Lin et al. (2024) present

a very special CNN denoted as Switch and emphasiz-

ing inter-channel rather than intra-channel (i.e. tex-

tural) patterns. The Transformer-based architecture

called Roof-Former has been proposed by Zhao et al.

(2024). It consists of three steps: Feature Pyramid

Networks providing relevant features encoding edges

and vertices, Image feature fusion with enhanced seg-

mentation reﬁnement, in which relevant feature sets

seem to compete against each other, and Structural

Reasoning. Qualitative results look impressive in the

cases of misleadingly textured, blurred, etc. roofs.

The quantitative results are, however, given on the

heat-map-level only.

1.2 Lessons Taken and Own

Contributions

In the existing work, the ﬁrst clear message is that

roof junction points are the most important geomet-

ric primitives in achieving vectorization. In other

words, the lack or error in intersection point detec-

tion can severely degrade the results of roof structure

vectorization. However, during the acquisition pro-

cess, intersection points are susceptible to occlusion

and noise, which is an objective reality that even the

most advanced detection units cannot avoid. There-

fore, our approach focuses on another important ge-

ometric primitive, namely edges. Edge detection is

more stable and plays a role in connecting various

elements in geometric structures. In our approach,

we use a rotational object detection model (YOLOv8

OBB) to infer the vector representation of edges.

The second important conclusion is that it is

still state of the art to use some conventional post-

processing because a deep learning method cannot

cover all possible roof structures and perceive all

the roof details the architects are able to fantasize.

The additional advantage of such a post-processing

step is that rich ﬁndings from 3D-based roof detail

analysis workﬂows, where the edges are intersection

lines of some RANSAC-retrieved planes, are avail-

able and can be adopted (Verma et al., 2006; Sohn

et al., 2012; Pohl and Bulatov, 2015; Meidow et al.,

2016; Jung et al., 2017) and many others. To the line

end-points, we apply the Density-Based Spatial Clus-

tering of Applications with Noise (DBSCAN) algo-

rithm (Sander et al., 1998) which helps to gradually

recover the intersection points and plane information,

thereby achieving the vectorization of roof structures.

Bulatov et al. (2017) utilized DBSCAN to simplify

the vector structure of the road network and to rec-

ognize junctions or dead ends. Inspired by them, in

our case, it is used to cluster the endpoints of edge

based on their spatial proximity to identify the poten-

tial junction points.

We apply our method to two challenging datasets

and present both raster- and vector-based evaluation

metrics. In particular, the PolyS metric (Avbelj et al.,

2014), widely used in remote sensing, has been ap-

plied for the ﬁrst time, to our knowledge to 2D build-

ing roof polygons reconstructed using only images

and no 3D data. Summarizing,

• We applied the YOLOv8 OBB method for detec-

tion of roof edges in high-resolution nadir aerial

images;

• we developed a procedure for roof polygonization

and face retrieval that relies on topology and ge-

ometry, allowing processing complex buildings;

• we evaluated our results on two datasets, one

of which was not used for training but to track

the model’s ability to generalize. Hereby, we

used both raster- and vector-based metrics, in-

cluding the PolyS metric and our vector-raster-

quality metric.

• As a competing approach, we use (naive) SAM to

evaluate its capability to extract roof faces.

2 METHODOLOGY

YOLOv8, the latest generation of the You Only Look

Once (YOLO) model developed by Jocher et al.

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

(2023), extends its capabilities across a broad spec-

trum of computer vision tasks, including object de-

tection, instance segmentation, pose estimation, and

image classiﬁcation. With the release of version 8.1.0

on January 10, 2024, a pivotal feature was intro-

duced: Oriented Bounding Box (OBB) models. Un-

like traditional object detection models, where bound-

ing boxes are aligned with the image axes, OBB mod-

els incorporate an additional angle parameter, allow-

ing for more precise localization of objects, particu-

larly those with irregular orientations. This advance-

ment holds signiﬁcant potential for applications in re-

mote sensing, where accurately detecting elongated

objects is critical.

Given their elongated and homogeneous nature,

roof edges are ideal candidates for OBB detection.

Additionally, the vectorized output simpliﬁes the con-

version of detected roof edges into structured vector

formats, facilitating the generation of complete vector

representations through subsequent polygonization.

Speciﬁcally, our model is built on the pre-trained

YOLOv8 OBB model provided by Ultralytics, us-

ing its default hyperparameters, with model weights

“yolov8l-obb.pt”. To adapt our training data to OBB,

we generated an approximate detected bounding box

for each edge. The trained model achieved a preci-

sion of 0.99 and recall of 0.96 in edge detection. The

mAP50 is 0.98 and mAP50-95 is 0.77.

2.1 Polygonization

2.1.1 Edge Complementation

From the YOLOv8 output, we obtain attributes about

the length and direction of edges and can infer their

endpoints coordinates. However, it is worth noting

that these edges often have gaps of varying sizes.

Therefore, we develop a polygonization procedure

based on geometric rules to complement the edges,

aiming to generate complete and closed polygons (see

Figure 1).

Figure 1: The process of edge complementary.

Firstly, we apply DBSCAN to the endpoints of

edge. It is a clustering algorithm for grouping densely

distributed points, which is common for downsam-

pling point clouds and suppressing noise (Sander

et al., 1998). More than a quarter-of-century old, this

method is still widely applied because of its ﬂexibility

regarding the number of clusters, robustness against

noise, and high processing velocity. In our situa-

tion, where the number of clusters is uncertain and

the intersections are generated through mutual con-

straints between different edges, this method is par-

ticularly suitable. We differentiate between two types

of clusters: Junction clusters contain two or more

endpoints, and we consider these points to form a

junction of two or more edges. Contrarily, Gap clus-

ters contain only one endpoint, this means that the

YOLOv8 detection box cannot fully cover this edge,

hence, we need to extend it appropriately to ﬁnd the

possibility of a closed polygon. Junction clusters resp.

gap clusters are supposed to represent two types of

problems while dealing with images: Noise and oc-

clusions.

For the junction cluster, we update it to be the cen-

ter of all points in the cluster and store this point as the

suggested junction. Considering that the junction may

be formed by two incompletely covered edges, we

compute the intersection of two incompletely covered

edges and store it as an additional suggested junction.

Completion of gap cluster is achieved based on all the

two types of junctions. More specially, for an end-

point x within a gap cluster, we search for the nearest

suggested junction on the line represented by the cor-

responding edge. Then this edge is modiﬁed by fusing

the endpoint x with the suggested junction.

2.1.2 Roof Face Vectorization

After reﬁning the edges, determining which points

and line segment elements constitute each roof face

is the ﬁnal step in roof vectorization. To match the

vertices with faces, we operate in the raster domain.

Speciﬁcally, we ﬁrst convert the edges into raster

form, as shown in Step A of Figure 2. A standard

Bresenham algorithm , dating back to 1965, can be

applied here. It produces a binary image in which the

pixels occupied by edges are assigned the value 1 and

the unoccupied pixels are assigned the value 0.

Figure 2: The process of roof face vectorization. Step A:

rasterization of edges; Step B: Connected component form-

ing; Step C: Vertices collection.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Then, through connectivity checks (1-

connectivity, which considers the four-neighborhood

of a pixel), we partition the pixels in the raster

space (see Step B of Figure 2). For a building, the

partitioning result is N + 1 components, representing

the various roof faces (N) and the background (1).

Next, we label the edges from 1 to M. By col-

lecting the incident edges for each component and

knowing the incidence relations between edges and

endpoints, we can index the vertices of the face and

represent it in vector form (see Step C of Figure 2).

3 RESULTS

3.1 Datasets

We used a building-image paired dataset created by

Ren et al. (2021). The dataset, denoted here as SGA,

contains more than 3K samples of residential build-

ings from presumably different countries and the cor-

responding labels in raster and vector form. Each

sample is an aerial image patch with the ground sam-

ple distance of around 5cm containing one roof of a

single residential building and the label stores the 2D

vertex positions of the roof vertices and the vertex-

based topology of each face on the roof. We trans-

formed the label into a rotation detection box for the

edge of the roof to adapt it to our training of the

YOLOv8 OBB model. Then, the SGA dataset is di-

vided into train-, val-, and test-set in a ratio of 6:2:2.

In the testset, we experiment with polygonization and

evaluate the results.

The second data set has been recorded by the

Spookﬁsh (now Aerometrex) corporation in a residen-

tial district Melville of Perth city, Western Australia.

The dataset has already been used in the context of

geo-typical representation of the building roofs for

heat simulation, see Ilehag et al. (2018). we repre-

sent it in the same structure of patches as in the ﬁrst

dataset. The roofs exhibit very complex shapes and

are difﬁcult to annotate, partly due to the moderate

resolution of slightly below 0.1m, that is, coarser than

the SGA dataset. This also resulted in problems dur-

ing annotation, because some roof segments are not

sufﬁciently distinguishable by the human eye. In or-

der to show the robustness of the proposed method,

we decided to train our model only using input of

the SGA dataset. For the Melville dataset, we merely

annotated 50 buildings to validate our model qualita-

tively.

3.2 Evaluation Strategy

3.2.1 Performance Metrics

For a single building, it is important to provide evalu-

ation on both raster and vector levels. The most com-

mon function for raster-based assessment is the aver-

age metric on intersection over union (mIoU). Firstly,

we calculate the IoU for each face of the roof. The set

of pixels belonging to the face i in the reference (R) is

represented as a

∈ R while that belonging to the face

j in the prediction (P) that has the largest overlap with

is b

∈ P. We obtain the IoU of a single face i and

mIoU of the whole building according to

IoU

∩ b

∪ b

and mIoU =

∑

i=1

IoU

, (1)

respectively, whereby N is the number of faces in a

roof. Note that this measure is not symmetric: we can

have many false roof segment hypothesis outside of P

without causing any harm to mIoU in (1). This is why

we additionally considered the overall IoU (ovIoU)

computed roof-level according to P and R, which are

the total set of pixels belonging to the roof in the ref-

erence and prediction, respectively:

ovIoU =

P ∩ R

P ∪ R

. (2)

Both measures have the advantage that the number of

segments in reference and prediction are not supposed

to coincide.

On the vector level, for every vertex x ∈ ∂P of the

prediction polygon ∂P (we omit ∂ in what follows),

one must compute the closest point y ∈ R and, in the

next step, vice versa. The corresponding distance

P→R

(x) = min

dist(x,y) (3)

must be aggregated – somehow – over all x ∈ P. The

symmetric Hausdorff distance presupposes taking the

maximum

= max



max

x∈P

P→R

(x),max

y∈R

R→P

(y)



(4)

while the PolyS metric of Avbelj et al. (2014) presup-

poses computation of the RMSE values

∑

x∈P

P→R

(x)

∑

y∈R

R→P

(y)

. (5)

Both metrics have often been applied to assess the

quality of reconstruction. Instead of averaging the

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

errors in d

, we take the maximum of both values.

Since we wish to have the value 1 to correspond to a

good reconstruction and 0 to a bad reconstruction, we

norm these metrics: q

= 1− d

max

, thus making the

uniform with (1) and (2). Hereby, q stands for qual-

ity, · means either Hausdorff (H) or PolyS (P) met-

ric, and d

max

is a scaling parameter, a diagonal of the

bounding box of both polygons. Finally, the vector-

raster metric presupposes preserving vector and ma-

trix properties of reference data. It is given by

= mIoU · q

(6)

and is scaled between 0 (bad on the raster or vector

level) and 1 (good on the vector and raster level), as

well.

The single error metrics are given building-wise.

Since our datasets consist of many buildings, we wish

to assess the reconstruction accuracy by the average

values of mIOU, OvIoU, q

, and q

as well as the

median of q

, because Hausdorff metric already takes

gross errors and outliers into account. Additionally,

we will show the boxplots of all metrics.

3.2.2 SAM as Competing Approach

While our approach heavily relies on roof edges, we

wonder to what extent modern foundational models

are suitable to retrieve roof faces directly. To this end,

we use the prediction of the Segment Anything Model

(SAM) as the baseline of our polygonization at raster

level. SAM is a foundation model for image segmen-

tation released by Meta (Kirillov et al., 2023). It con-

sists of three main modules: 1) Image encoder: com-

posed of MAE pre-trained Vision Transformer (ViT),

which maps the image to be segmented into an im-

age feature space; 2) Prompt encoder: responsible for

mapping the input prompt to a prompt feature space;

3) Mask decoder: integrates the embedding output by

the Image encoder and Prompt encoder, then decodes

the ﬁnal segmentation mask from this embedding’s

feature.

Trained on a massive dataset (SA-1B, compris-

ing 1B masks and 11M images), SAM has developed

powerful generalization capabilities, enabling trans-

fer to new image distributions and tasks through zero-

shot learning. Users can interact with SAM using

prompts (e.g., points, boxes, masks) to obtain relevant

segmentation masks. In our research, we designed

a prompt generation strategy to use SAM to obtain

masks for each face of building roofs, which we then

compared with our method’s results. The prompts

of the SAM can be a set of foreground/background

points. Therefore, we designed a prompt generation

strategy. First, we sample each face f of the roof

based on the ground truth, which plotted from the

roof face vector data in the dataset, to obtain a set of

prompt points.

Figure 3: Example of prompt generation strategy. (Fore-

ground/background prompts are denoted by yellow and

gray points, respectively, while the blue mask denotes the

prediction of SAM.).

Then, we divide the prompt set into two sub-

sets: The subset belonging to f is used as fore-

ground prompt points and the remaining as back-

ground prompt points, as shown in Figure 3. Using

these prompt points, SAM is able to segment each in-

dividual face and delineate the boundary. As a result,

we obtain the prediction results of the foundational

model under strongly directional prompts.

3.3 Findings

3.3.1 Quantitative Evaluation

Figure 4a is the quantitative evaluation on the raster

level. We compare our method and the prediction of

SAM, with outliers removed for a clearer view of the

majority. First of all, it is evident that our method

outperforms SAM in terms of mIoU.

The distribution of our method is heavily skewed

towards higher mIoU, for most roofs falls within the

range of 0.85 to 1, with a mean value of 0.91, indicat-

ing excellent segmentation performance. In contrast,

SAM’s distribution is more spread out and falls within

the range of 0.6 to 1. And its mean value is 0.8.

In terms of ovIoU, the difference between SAM

and our method is relatively small. However, our

method maintains ovIoU within a narrower range of

0.95 to 0.99, while SAM achieves a range of 0.85 to

0.98. This indicates that our approach provides more

consistent coverage of buildings.

Figure 4b presents the quantitative evaluation at

the vector level, with outliers removed for a clearer

view of the majority. Across the three metrics, the

outputs of YOLOv8 closely approximate the ground

truth, with most samples having a q

between 0.97

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

(a) Raster level. (b) Vector level.

Figure 4: Quantitative evaluation of the SGA dataset.

to 0.99, a q

between 0.98 to 0.99 and a q

be-

tween 0.97 to 0.99. After polygonization, the sam-

ple distribution becomes more concentrated, indicat-

ing further optimization of YOLOv8. This improve-

ment is also reﬂected in the considerably higher me-

dian lines across all metrics compared to YOLOv8.

Consequently, our method demonstrates better con-

sistency.

The q

, which is particularly sensitive to outliers,

is of special interest to us. It reveals that YOLOv8’s

lower whiskers extend further, indicating the pres-

ence of some gross errors during evaluation. How-

ever, it’s important to note that these outliers are ef-

fectively suppressed by our polygonization process,

as can be seen from the increase in median value from

0.983 to 0.991, demonstrating the robustness of our

method. The PolyS metric q

reveals that YOLOv8

effectively identiﬁes most roof edges, but struggles

with completeness, resulting in relatively low values

of q

. Our polygonization method, however, reﬁnes

these roof edges, signiﬁcantly enhancing geometric

similarity to the reference object. Finally, The q

V M

metric, which combines raster-level and vector-level

evaluations, shows minimal divergence from q

and

of our method. This indicates excellent perfor-

mance in both geometric and raster accuracy for most

samples, yielding a reliable and robust vector repre-

sentation of the roof structure.

Overall, at the vector level, YOLOv8’s output is

satisfactory, and our polygonization further enhances

the results. While effectively representing the roofs

in the vector form, our method demonstrates strong

robustness and consistency.

3.3.2 Qualitative Evaluation

Figure 5 presents a qualitative analysis of eight rep-

resentative samples from the SGA dataset. For each

sample, we display its aerial image, the detection re-

sults from YOLOv8 OBB (with detected bounding

boxes shown in yellow), the polygonization results

(ﬁnal representations of roof structures with vertices

and edges shown in white), the SAM prediction re-

sults, and the ground truth (individual faces repre-

sented as colored areas). The colors of the SAM out-

puts and the ground truth correspond directly to the

colors of each face in the IoU distribution. Addition-

ally, the mIoU and ovIoU values are provided.

Sample (a) demonstrates a simple case. The accu-

rate prediction of YOLOv8 enables our polygoniza-

tion to complete the roof vectorization in the ﬁrst step.

In the SAM results, each face is also well predicted.

Through further observation of the IoU for each face,

the polygonization results are satisfactory. Sample (b)

is a complex roof case. Despite the complexity of the

roof structure, most of the edges are well recognized.

The lack of prediction of face edges18 leads to its low

IoU. However, for most faces, the results are satisfac-

tory, and the ovIoU is also at a high level.

Sample (c) to (h) have varying degrees of edge

gaps. The gap in Sample (c) is located on the longest

edge of the roof, at the lower left of the building. And

the gaps in Sample (d) are present on the upper and

lower outer contour edges. Our polygonization is able

to effectively handle these situations. The edge im-

plementation approach matches the gap endpoints to

appropriate potential junctions, thus achieving a com-

plete roof topology. As a result, each face obtains ex-

cellent IoU.

Sample (e) demonstrates the effectiveness of ad-

ditional junctions in our edge completing approach.

The presence of additional junctions allows its con-

tour edges to be well completed. From Samples (c) to

(e), we can see that gaps mainly exist in longer edges

of building roofs, and are more common in outer con-

tour lines. Possible reasons include: compared to

shorter edges, long edges have fewer samples; hence,

YOLOv8 may not learn them sufﬁciently; addition-

ally, the prediction tends to cover areas with higher

conﬁdence, which may sacriﬁce some areas that could

be edges. Nevertheless, our polygonization compen-

sates well for this weakness, using geometric prior

knowledge to connect these edges with appropriate

junctions.

Sample (f) demonstrates an advantage of our

method: on buildings covered by trees, edge as a geo-

metric primitive have a better spatial continuity, thus

overcoming the effects of visual occlusion. The lower

left corner edge in this example is occasionally ob-

scured by trees, which may present a challenge for

algorithms that use corners as geometric primitives.

The SAM results also show that prompt-based Seg-

mentation cannot cover this face well. However, from

the vector prediction of YOLOv8, this face is still well

reconstructed with the assistance of our polygoniza-

tion.

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

(a) Simple case. (b) Complex case.

(e) Multi edge gaps. (f) Tree cover.

(g) Detect error. (h) Edge loss.

Figure 5: Qualitative evaluation samples (SGA).

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Sample (g) illustrates a weakness of our method,

namely if the roof is composed of textures from two

different materials, this texture may create a false

edge. Then, our method might incorrectly divide this

face, causing prediction errors. Surprisingly, SAM

was able to achieve a better result. Finally, Sample

(h) represents a situation our method cannot handle.

Here, one face is completely occluded by trees, re-

sulting in up to three edges not being recognized, thus

causing the adjacent face to be missed. Certainly, this

also poses a challenge to SAM, which has prompts

situated on trees and results as mask of tree. For such

cases, we can only seek help from other data sources

to achieve roof reconstruction.

Overall, our method can effectively handle vari-

ous roof scenarios with clear geometric relationships,

and also demonstrates well robustness for partially

obscured edges.

Figure 6: Qualitative evaluation samples (Melville). From

left to right: Input patch mit YOLOv8 lines, Polygonization

output, SAM-output, ground truth delineation.

Turning our attention to the dataset Melville,

we present in Figure 6 a comparison between the

roof structure vector representations generated by our

method and the manual annotation. Lines represent

detected roof edges, while points indicate the corners

of these edges. The top row, for example, represents

a very complex building, all edges of which were de-

tected by YOLOv8, followed by one spurious fusion

caused by our polygonization method. SAM, how-

ever, could not detect some narrow segments. The

bottom row represents a building quite typical for the

SGA dataset. It is not very complex, but one solar

panel on the roof makes the SAM algorithm strug-

gle. Our method reconstructs this building completely

and correctly. Although the Melville dataset was not

involved in our model training, our method’s output

shows a generally consistent pattern with manual an-

notations. The main shortcomings lie in the accuracy

of some roof details — our method tends to ignore

or simplify narrower and irregular faces. Such sim-

pliﬁcation may have severe repercussions in the case

of 3D reconstruction of roofs; nevertheless, these re-

sults demonstrate our method’s potential in handling

complex roof vectorization and signiﬁcantly reducing

labor costs.

4 CONCLUSION

We have developed a novel method for roof detail

extraction and vectorization based on remote sensing

images. In particular, our approach leverages edges as

the primary elements for reconstruction, facilitating

the determination of roof corners and faces through

geometric relationships. The strength of this approach

lies in its robustness in edge detection, enabling the

reconstruction of accurate structures even when the

roof is partially occluded and minimizing the impact

of noise. This overcomes the limitations of corner-

based vector structure reconstruction.

Speciﬁcally, roof edge extraction is achieved us-

ing the YOLOv8 OBB model. We have innovatively

adapted this model, typically employed for rotated

object detection, to the task of roof edge detection. Its

vectorized output is particularly well-suited for edge

extraction, simplifying the process of generating vec-

tor representations of roofs.

We conducted experiments on two datasets: the

SGA dataset created by Ren et al. (2021), to the ex-

tent we could observe, without topological or geomet-

ric errors, and the Melville dataset that we annotated.

The former was also used to train our YOLOv8 OBB

model and evaluated our method at both raster and

vector levels. At the raster level, we used the SAM

as the baseline, which is widely regarded as the state-

of-the-art foundational model for segmentation. Our

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

model not only performed excellently at the raster

and vector level, but also produced quite stable re-

sults with narrow quantile ranges, including for the

quite outlier-sensitive Hausdorff metric. In compari-

son, SAM’s performance ﬂuctuated more. The main

reason for this is that the edges are real, detectable ob-

servations while the roof segments SAM relies on can

suffer from occlusions by trees, color changes, roof

objects, and so on.

We also conducted qualitative evaluations on both

the SGA and Melville datasets, showcasing represen-

tative results. Overall, our method can effectively

handle different roof structures and eliminate edge

gaps in YOLOv8. Even on the Melville model, which

was not involved in training and has a relatively lower

resolution, the results were satisfactory. Although

the roofs in Melville present more complex struc-

tures, most roof segments could still be well recog-

nized. This demonstrates our great potential in han-

dling complex roof vectorization.

In conclusion, our ﬁndings highlight the potential

of our method to effectively handle diverse roof struc-

tures, even in challenging scenarios with complex ge-

ometries. Moving forward, we plan to explore addi-

tional datasets and integrate our technique into urban

terrain reconstruction workﬂows. On the one hand, it

will help to explore further radiometric (detecting im-

portant installations on roofs: photovoltaic panels, so-

lar collectors, etc.) and geometric (non-planar roof el-

ements, such as domes, towers) aspects. On the other

hand, we aim to conduct a more comprehensive com-

parative analysis with other competing methods, fur-

ther establishing the robustness and versatility of our

approach. Finally, in future work, we plan to incorpo-

rate 3D data to search for more precise intersections

near our predicted ones to improve the vectorization.

ACKNOWLEDGEMENTS

The authors thank the China Scholarship Council

(CSC) for supporting this research, Grant/Award

Number: 202308080109. We also thank the review-

ers for their insightful comments.

REFERENCES

Alidoost, F., Areﬁ, H., and Hahn, M. (2020). Y-shaped

convolutional neural network for 3D roof elements

extraction to reconstruct building models from a sin-

gle aerial image. ISPRS Annals of the Photogramme-

try, Remote Sensing and Spatial Information Sciences,

2:321–328.

Alidoost, F., Areﬁ, H., and Tombari, F. (2019). 2D image-

to-3D model: Knowledge-based 3D building recon-

struction (3DBR) using single aerial images and con-

volutional neural networks (CNNs). Remote Sensing,

11(19):2219.

Avbelj, J., M

uller, R., and Bamler, R. (2014). A metric for

polygon comparison and building extraction evalua-

tion. IEEE Geoscience and Remote Sensing Letters,

12(1):170–174.

Bulatov, D., H

aufel, G., Meidow, J., Pohl, M., Solbrig, P.,

and Wernerus, P. (2014). Context-based automatic

reconstruction and texturing of 3D urban terrain for

quick-response tasks. ISPRS Journal of Photogram-

metry and Remote Sensing, 93:157–170.

Bulatov, D., Wenzel, S., H

aufel, G., and Meidow, J. (2017).

Chain-wise generalization of road networks using

model selection. ISPRS Annals of the Photogramme-

try, Remote Sensing and Spatial Information Sciences,

4:59–66.

Canny, J. (1986). A computational approach to edge de-

tection. IEEE Transactions on Pattern Analysis and

Machine Intelligence, (6):679–698.

Esmaeily, Z. and Rezaeian, M. (2023). Building roof wire-

frame extraction from aerial images using a three-

stream deep neural network. Journal of Electronic

Imaging, 32(1):013001–013001.

Harris, C., Stephens, M., et al. (1988). A combined corner

and edge detector. In Alvey Vision Conference, vol-

ume 15, pages 147–152. Citeseer.

Henricsson, O. (1998). The role of color attributes and sim-

ilarity grouping in 3-d building reconstruction. Com-

puter Vision and Image Understanding, 72(2):163–

184.

Hensel, S., Goebbels, S., and Kada, M. (2021). Building

roof vectorization with PPGNET. The International

Archives of the Photogrammetry, Remote Sensing and

Spatial Information Sciences, 46:85–90.

House, D., Lech, M., and Stolar, M. (2018). Using deep

learning to identify potential roof spaces for solar pan-

els. In nternational Conference on Signal Process-

ing and Communication Systems (ICSPCS), pages 1–

6. IEEE.

Huang, K., Wang, Y., Zhou, Z., Ding, T., Gao, S., and Ma,

Y. (2018). Learning to parse wireframes in images of

man-made environments. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 626–635.

Ilehag, R., Bulatov, D., Helmholz, P., and Belton, D.

(2018). Classiﬁcation and representation of com-

monly used rooﬁng material using multisensorial

aerial data. The International Archives of the Pho-

togrammetry, Remote Sensing and Spatial Informa-

tion Sciences, 42:217–224.

Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics

yolov8.

Jung, J., Jwa, Y., and Sohn, G. (2017). Implicit regular-

ization for reconstructing 3D building rooftop models

using airborne lidar data. Sensors, 17(3):621.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,

Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Lo, W.-Y., et al. (2023). Segment anything. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 4015–4026.

Lin, M., Jing, W., Li, C., and Jung, A. (2024). Optimized

vectorizing of building structures with switch: High-

efﬁciency convolutional channel-switch hybridization

strategy. IEEE Geoscience and Remote Sensing Let-

ters.

Lucks, L., Bulatov, D., Th

onnessen, U., and B

oge, M.

(2019). Superpixel-wise assessment of building dam-

age from aerial images. In International Conference

on Computer Vision Theory and Applications (VIS-

APP), pages 211–220.

Meidow, J., Hammer, H., Pohl, M., and Bulatov, D. (2016).

Enhancement of generic building models by recogni-

tion and enforcement of geometric constraints. ISPRS

Annals of the Photogrammetry, Remote Sensing and

Spatial Information Sciences, 3:333–338.

Nauata, N. and Furukawa, Y. (2020). Vectorizing world

buildings: Planar graph reconstruction by primi-

tive detection and relationship inference. In Euro-

pean Conference, Glasgow, UK, August 23–28, 2020,

Proceedings, Part VIII 16 (ECCV), pages 711–726.

Springer.

Partovi, T., Fraundorfer, F., Bahmanyar, R., Huang, H., and

Reinartz, P. (2019). Automatic 3-d building model re-

construction from very high resolution stereo satellite

imagery. Remote Sensing, 11(14):1660.

Pohl, M. and Bulatov, D. (2015). Gap-closing 3D build-

ing reconstruction by aligning boundaries of roof seg-

ments and detecting uncovered details. ISPRS Annals

of the Photogrammetry, Remote Sensing and Spatial

Information Sciences, 2:173–180.

Ren, J., Zhang, B., Wu, B., Huang, J., Fan, L., Ovsjanikov,

M., and Wonka, P. (2021). Intuitive and efﬁcient

roof modeling for reconstruction and synthesis. arXiv

preprint arXiv:2109.07683.

Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998).

Density-based clustering in spatial databases: The al-

gorithm gdbscan and its applications. Data Mining

and Knowledge Discovery, 2:169–194.

Sohn, G., Jwa, Y., Kim, H. B., and Jung, J. (2012). An

implicit regularization for 3D building rooftop model-

ing using airborne LIDAR data. ISPRS Annals of the

Photogrammetry, Remote Sensing and Spatial Infor-

mation Sciences, 2 (3):305–310.

Turker, M. and Koc-San, D. (2015). Building extraction

from high-resolution optical spaceborne images us-

ing the integration of support vector machine (SVM)

classiﬁcation, Hough transformation and perceptual

grouping. International Journal of Applied Earth Ob-

servation and Geoinformation, 34:58–69.

Verma, V., Kumar, R., and Hsu, S. (2006). 3D building de-

tection and modeling from aerial lidar data. In 2006

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition (CVPR’06), volume 2,

pages 2213–2220. IEEE.

Wang, Y., Zorzi, S., and Bittner, K. (2021). Machine-

learned 3D building vectorization from satellite im-

agery. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

1072–1081.

Wei, S., Ji, S., and Lu, M. (2019). Toward automatic

building footprint delineation from aerial images us-

ing CNN and regularization. IEEE Transactions on

Geoscience and Remote Sensing, 58(3):2178–2189.

Zhang, F., Nauata, N., and Furukawa, Y. (2020). Conv-

mpn: Convolutional message passing neural network

for structured outdoor architecture reconstruction. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 2798–

2807.

Zhang, Y. (1999). Optimisation of building detection in

satellite images by combining multispectral classiﬁ-

cation and texture ﬁltering. ISPRS Journal of Pho-

togrammetry and Remote Sensing, 54(1):50–60.

Zhang, Z., Li, Z., Bi, N., Zheng, J., Wang, J., Huang, K.,

Luo, W., Xu, Y., and Gao, S. (2019). Ppgnet: Learning

point-pair graph for line segment detection. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 7105–7114.

Zhao, W., Persello, C., Lv, X., Stein, A., and Vergauwen,

M. (2024). Vectorizing planar roof structure from

very high resolution remote sensing images using

transformers. International Journal of Digital Earth,

17(1):1–15.

Zhao, W., Persello, C., and Stein, A. (2022). Extracting

planar roof structures from very high resolution im-

ages using graph neural networks. ISPRS Journal of

Photogrammetry and Remote Sensing, 187:34–45.

Zorzi, S., Bazrafkan, S., Habenschuss, S., and Fraundorfer,

F. (2022). Polyworld: Polygonal building extraction

with graph neural networks in satellite images. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 1848–1857.

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection