Automatic Drywall Analysis for Progress Tracking and Quality Control

in Construction

Mariusz Trzeciakiewicz

1 a

, Aleixo Cambeiro Barreiro

1 b

, Niklas Gard

1,2 c

Anna Hilsmann

1 d

and Peter Eisert

1,2 e

Fraunhofer HHI, Berlin, Germany

Humboldt University of Berlin, Berlin, Germany

{mariusz.trzeciakiewicz, aleixo.cambeiro, niklas.gard, anna.hilsmann, peter.eisert}@hhi.fraunhofer.de

Keywords:

Automated Drywall Analysis, Construction Progress Tracking, Quality Control, Building Digitalization,

Deep Learning, Data Augmentation.

Abstract:

Digitalization in the construction industry has become essential, enabling centralized, easy access to all rele-

vant information of a building. Automated systems can facilitate the timely and resource-efﬁcient documen-

tation of changes, which is crucial for key processes such as progress tracking and quality control. This paper

presents a method for image-based automated drywall analysis enabling construction progress and quality

assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance

segmentation model to detect and classify various drywall elements with an analysis module to cluster indi-

vidual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This

system extracts valuable information from images, enabling more accurate progress tracking and quality as-

sessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis,

improving instance segmentation accuracy through architecture modiﬁcations and targeted data augmentation,

and a novel algorithm to extract important information from the segmentation results. Our modiﬁed model,

enhanced with data augmentation, achieves signiﬁcantly higher accuracy compared to other architectures, of-

fering more detailed and precise information than existing approaches. Combined with the proposed drywall

analysis steps, it enables the reliable automation of construction progress and quality assessment.

1 INTRODUCTION

Digitalization has become essential on modern con-

struction sites, centralizing all the relevant build-

ing data and enabling dynamic, distributed access

to vital project information. Recent efforts have

aimed to enhance digitalization in construction, such

as model reconstruction from scanned 2D ﬂoor-

plans (Kalervo et al., 2019; Lv et al., 2021; Cam-

beiro Barreiro et al., 2023), text analysis of construc-

tion plans (Sch

onfelder and K

onig, 2022), and auto-

mated digitalization of ﬁre safety information (Bayer

and Aziz, 2022).

Additionally, AI-based applications using digi-

tal models are emerging, demonstrating the grow-

https://orcid.org/0009-0007-6445-5759

https://orcid.org/0000-0002-1019-4158

https://orcid.org/0000-0002-0227-2857

https://orcid.org/0000-0002-2086-0951

https://orcid.org/0000-0001-8378-4805

ing demand for digitalization in modernizing the in-

dustry. Examples include leveraging digital models

for autonomous robot localization (Zhao and Cheah,

2023), increasing worksite automation, and localizing

building interior images for automating inspection re-

sults (Gard et al., 2024).

In practice, digital frameworks support key pro-

cesses throughout a construction project’s life cycle,

such as real-time progress tracking and quality as-

surance, by maintaining up-to-date status information

for building elements. This enables effective evalua-

tion and ensures compliance with planning and qual-

ity norms over time. However, keeping information

up to date is a costly task, both in terms of time and re-

sources. This makes the automation of this ﬁeld very

valuable for the construction industry.

In terms of automation, image-based analysis is

particularly useful, due to its potential to extract com-

plex information from a scene without human in-

tervention (Ekanayake et al., 2021). With recent

advancements in deep learning techniques, such as

Trzeciakiewicz, M., Barreiro, A. C., Gard, N., Hilsmann, A. and Eisert, P.

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction.

DOI: 10.5220/0013302500003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

707-716

ISBN: 978-989-758-728-3; ISSN: 2184-4321

707

the spread of accurate instance segmentation models,

the ability to automatically analyze construction site

scenes has improved signiﬁcantly.

In this paper, we focus our attention on the use

of computer vision techniques for drywall analysis,

with the goal of enabling automatic progress track-

ing and quality control. (Wei et al., 2022) classiﬁes

walls into broad categories such as “under construc-

tion” and “ﬁnished”. However, it fails to detect in-

dividual wall material types, making it impossible to

estimate intermediate progress stages. On the other

hand, (Ekanayake et al., 2024) distinguishes between

material types but does not segment individual ele-

ments. Another method, (Pal et al., 2023), identiﬁes

different phases of drywall construction and detects

individual elements. However, the range of detected

element types remains limited, and the accuracy for

certain classes is relatively low.

To address the limitations of existing approaches,

this paper introduces a tailored instance segmenta-

tion model for detecting and classifying individual

drywall elements. To further improve model accu-

racy and ensure a good class coverage, particularly

in scenarios with limited data, we introduce a tar-

geted data augmentation strategy. Additionally, algo-

rithms are introduced for drywall analysis, enabling

the separation of multiple drywall segments in an im-

age and their transformation into orthographic projec-

tions. These algorithms form the basis for a detailed

analysis of the construction process, facilitating pre-

cise, segment-speciﬁc progress tracking and compli-

ance checks with construction norms.Speciﬁcally, the

contributions of this paper are:

• We propose a specialized instance segmentation

model for detecting and classifying individual

drywall elements.

• We enhance the model’s accuracy through a tar-

geted data augmentation method.

• We develop a set of wall analysis algorithms,

enabling new opportunities for automation of

progress estimation and quality assurance.

2 RELATED WORK

Automated wall analysis has received some attention

in the literature (Wei et al., 2022; Ekanayake et al.,

2024; Chauhan and Sepp

anen, 2023). In (Wei et al.,

2022), the authors employ a deep learning model

for classiﬁcation. However, they classify walls into

general categories such as “no construction”, “un-

der construction”, and “ﬁnished”, without detecting

the types of elements used during the building pro-

cess. In contrast, (Ekanayake et al., 2024) presents

a method that classiﬁes different types of elements.

However the authors do not distinguish between indi-

vidual objects. Instead, they detect regions in the im-

age where a certain element has been installed. While

this improves wall progress analysis, separating in-

dividual elements would provide more detailed in-

formation and enable analysis of their relationships,

such as checking whether metal frames are correctly

spaced and parallel. Both (Shamsollahi et al., 2021)

and (Ying and Lee, 2019) implement deep learning

models for detecting and classifying individual ele-

ments on construction sites. However, (Shamsollahi

et al., 2021) focuses on duct detection, while (Ying

and Lee, 2019) classiﬁes walls, lifts, and doors.

(Pal et al., 2023) highlights that most vision-based

progress monitoring methods rely on a binary clas-

siﬁcation of progress as “ﬁnished” or “not ﬁnished”,

which overlooks the gradual and multi-stage nature of

real-world construction. For instance, in drywall con-

struction, it is possible to detect the metal skeleton’s

completion, estimate the insulation’s percentage, and

verify the alignment of metal frames.

A more detailed approach is proposed in

(Chauhan and Sepp

anen, 2023), deﬁning distinct

stages like installing metal frames, drywall panels,

and insulation. While this method improves stage-

speciﬁc progress estimation, limited data led to lower

accuracy in detecting metal frames and excluded in-

sulation from the pipeline, hindering comprehensive

performance.

Compared to (Wei et al., 2022), which focuses

on general wall classiﬁcations, we detect and classify

key drywall elements for progress and quality esti-

mation, including “drywall panels”, “metal frames”,

“wood panels”, and “insulation materials”. In con-

trast to (Ekanayake et al., 2024), our method separates

each individual drywall element, enabling a detailed

analysis of how their relationships. We also introduce

a data augmentation strategy to address dataset limi-

tations. In comparison with (Chauhan and Sepp

anen,

2023), our approach effectively detects all key types

of drywall elements. Furthermore, unlike other ap-

proaches, we cluster and correct perspective distor-

tions in drywall segments, enabling detailed progress

estimation and quality assurance in drywall construc-

tion.

3 METHODOLOGY OVERVIEW

Figure 1 illustrates the proposed pipeline, which con-

sists of two main components: instance segmenta-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

708

Figure 1: Instance segmentation and drywall analysis pipeline.

tion and drywall analysis. The instance segmentation

component, described in detail in Section 5, uses a

modiﬁed Mask R-CNN model (Section 5.1) to detect

individual drywall components, enhanced by task-

speciﬁc data augmentation techniques (Section 5.2).

In the drywall analysis component, detailed in Sec-

tion 6, we process the output of the instance segmen-

tation model. First, we simplify and optimize the

detected geometries based on prior knowledge (Sec-

tion 6.1). Then, using these optimized geometries

and the orientation of their edges, we separate dry-

wall segments when multiple are present in the im-

age (Section 6.2). Finally, based on the reﬁned ge-

ometries, we estimate perspective distortions caused

by varying drywall angles relative to the camera and

correct these distortions to obtain orthographic pro-

jections (see Section 6.3).

4 DATASET

We use a dataset of 176 drywall images, divided

into two subsets: 140 images for training and 36 for

testing. Each image, with resolutions of 600×800,

800×450, or 800×600 pixels, is manually annotated

with class labels, segmentation masks, and bounding

boxes for each element of four essential classes: wood

panel, insulation, drywall panel, and metal frame.

These classes represent the main elements involved

in the construction of a drywall, and are crucial for

accurate progress and quality estimation.

These images capture drywalls at various stages of

construction, some of them containing multiple wall

segments. The majority depict walls in an intermedi-

ate phase, with interior layers exposed, often showing

elements from each class. A few images show only

the metal frame skeleton, while others display com-

pleted, plastered walls ready for painting.

The main challenges of the dataset, which moti-

vated the development of the proposed data augmen-

tation technique (Section 5.2), are its limited size and

the variability of materials within the same class.

In addition to this dataset, we obtained a sepa-

rate video recording documenting the construction of

a drywall structure for a new room within a build-

ing. Selected frames from this recording were used

to assess the generalization capabilities of our dry-

wall analysis pipeline on scenes outside the training

dataset. This approach allows us to evaluate the prac-

tical viability of our method for application in real-

world construction projects.

5 COMPONENT DETECTION

We propose a modiﬁed deep learning architecture for

detecting drywall elements in camera images, incor-

porating task-speciﬁc enhancements to improve per-

formance. Additionally, we introduce a drywall-

speciﬁc data augmentation technique that increases

dataset variability, addressing challenges posed by

limited training data and enhancing model accuracy.

Accurate drywall analysis depends on extracting

high-quality, detailed information from images, en-

abling precise results and enriching digital models

with valuable metadata for tracking and quality con-

trol. To achieve this, we focus on detecting and clas-

sifying individual drywall elements with high spatial

accuracy. Instance segmentation, chosen for its pixel-

wise masks, provides the precision needed to meet the

stringent requirements of automated drywall analysis,

surpassing simpler bounding-box detection methods.

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction

709

5.1 Model Architecture

Figure 2: Proposed modiﬁcation of the Mask R-CNN ar-

chitecture. We replace the backbone used in the original

paper with the ConvNeXt V2 model, introduce additional

ratios for computing anchor boxes, and add a block of de-

convolutional layers at the end of the instance mask branch

to increase the output mask resolution.

For instance segmentation, we employ a modiﬁed ver-

sion of Mask R-CNN (He et al., 2017), an architec-

ture whose variants with different backbones achieve

state-of-the-art results (Dalva et al., 2023) on widely

recognized benchmark datasets such as COCO (Lin

et al., 2014). This choice was motivated by the

model’s strong performance and adaptability across

various segmentation tasks.

The modiﬁed architecture is illustrated in Fig-

ure 2. The ﬁrst modiﬁcation involves replacing the

original backbone model. After evaluating several op-

tions, ConvNeXt V2 (Woo et al., 2023) yielded the

best results on our dataset, leading us to select it as

the backbone for our model.

The second proposed modiﬁcation accounts for

the unique characteristics of the elements in the

dataset and the way Mask R-CNN calculates region

proposals. To detect objects in images, Mask R-CNN

contains a block of convolutional layers called the Re-

gion Proposal Network (RPN) that identiﬁes potential

object locations by generating a set of anchor boxes,

distributed over the entire image, based on a list of

predeﬁned ratios. Each of these ratios is responsible

for detecting objects of different shapes. The RPN is

then trained to predict whether each anchor box cor-

responds to an object or background, and to adjust the

anchor box coordinates to match the object’s size.

In our dataset, most objects share similar, typi-

cally rectangular shapes, aligning well with the an-

chor box shapes in the original model. However,

metal frames are usually long and narrow, making

them less suited to the original anchor box conﬁgu-

rations and harder for the model to detect accurately.

To address this, we introduce additional ratios during

the anchor boxes generation step allowing the model

to better capture the distinctive shape of metal frames.

Figure 3: Example of synthetic images. The left im-

age shows augmentation with randomized placement of

cropped elements, while the right image demonstrates struc-

tured placement.

The ﬁnal proposed modiﬁcation to the model ad-

dresses the size of the predicted instance masks. The

original instance segmentation branch of Mask R-

CNN produces low-resolution binary masks with a

ﬁxed size of 28×28 pixels. These masks are then up-

scaled to match the dimensions of the detected object

in the original image. This incurs a loss of spatial

resolution that often introduces artifacts, especially

for larger elements, causing distortions in object out-

lines. To reduce these artifacts, we add a set of de-

convolution layers at the end of the instance segmen-

tation branch, increasing the resolution of the masks

to 56×56 pixels before the upscaling step. This mod-

iﬁcation results in smoother and more accurate object

borders.

5.2 Data Augmentation

The number of images in our dataset is limited, with

176 samples to train and validate our model. For com-

parison, in (He et al., 2017), the authors trained their

Mask R-CNN model on the COCO dataset, which

contains 330,000 images. This scarcity of training

data poses a challenge in achieving a model with good

generalization capabilities.

A common approach to address training data

scarcity, particularly when acquiring additional sam-

ples is unfeasible or costly, is to employ data augmen-

tation techniques. In addition to basic methods, such

as random rotation, ﬂipping, or scaling of training im-

ages, we here adapt a more sophisticated technique

introduced in (Cambeiro Barreiro et al., 2022). To

generate new training samples, labeled samples are

selected from the training set that constitute “good

examples”, i.e. with unambiguous labels and cover-

ing sufﬁcient appearance variations. These selected

examples are then used to generate synthetic images.

To this end, randomized transformations such as rota-

tion, translation and scaling are applied to the exam-

ples, which are then pasted onto backgrounds consist-

ing of a collage of random COCO images, increasing

the variety of a scarce dataset.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

710

Drywall elements are, however, usually arranged

in a speciﬁc, structured layout, providing additional

contextual information that can be leveraged to im-

prove training. Insulation is usually placed between

parallel, evenly spaced metal frames, with wall and

wood panels attached to them. Randomly position-

ing the objects in a synthetic image disrupts this spe-

ciﬁc structure of a drywall, reducing realism. For this

reason, we modify the aforementioned data augmen-

tation method to generate synthetic images that more

accurately reﬂect the organized conﬁguration of dry-

wall construction.

Following the approach in (Cambeiro Barreiro

et al., 2022), we ﬁrst select suitable elements using

the same criteria as the original method. Instead of

randomly pasting the selected objects onto new im-

ages, we insert them in a more structured way. We

start by sampling long metal frames, positioning them

vertically and parallel at random intervals. The space

between adjacent frames is divided into rectangular

slots, ﬁlled with randomly selected, scaled examples

from various classes. Occasionally, spaces are left

empty to increase dataset variety. Finally, the columns

are rotated randomly around the image center. Instead

of COCO, we use a dataset that contains images of

empty rooms, called room-interior (Research, 2023).

This allows the model to distinguish drywall elements

from common background objects found in indoor en-

vironments, such as painted walls, ceilings and ﬂoors.

An example of a synthetic image generated using this

method is shown in Figure 3.

6 DRYWALL ANALYSIS

Once the construction elements are localized in an

image, we can analyze them to extract information

relevant to the digital model, enabling tasks such as

progress evaluation or quality control based on prede-

ﬁned standards.

Some images include multiple wall segments,

which presents a unique challenge. Ideally, the infor-

mation for each individual segment should be stored

separately to increase the level of detail in the digi-

tal model, since they may be at different construction

stages. Therefore, we must ﬁrst identify the elements

belonging to each segment. Additionally, each seg-

ment is captured from a different angle, resulting in

varying perspectives. By isolating segments and cor-

recting their perspective distortions, we generate an

orthographic projection of each, enabling more com-

plex analysis. This allows us to detect structural is-

sues that would be difﬁcult to assess in perspective-

distorted images, facilitating automated quality con-

trol by verifying, for example, that metal frames are

parallel and installed at the correct intervals.

It is trivial to differentiate wall segments that are

physically separated from one another. Walls forming

a corner can be distinguished by their perspective dis-

tortions. In an upright camera perspective view of a

wall segment, all associated horizontal lines converge

at a common vanishing point, which varies for each

segment if they are at different angles. The horizon-

tal sides of the detected elements provide an estima-

tion of the horizontal lines in the wall. Therefore, we

reﬁne these edges and use them to separate the wall

segments.

6.1 Geometry Optimization

As mentioned, we utilize the horizontal sides of de-

tected elements to estimate vanishing points in order

to separate drywall segments. However, these de-

tected elements often exhibit irregularities along their

borders. Therefore, we have implemented an algo-

rithm that simpliﬁes and optimizes these shapes.

The algorithm takes the polygons representing the

outlines of the detected elements and ﬁnds 4 corner

candidates for each by iteratively adding contiguous

vertices to a set until they cannot accurately be ﬁt

by a straight line. It then uses RANSAC (Fischler

and Bolles, 1981) to ﬁt lines through the vertices be-

tween the candidates, minimizing the impact of out-

liers. Thus, we obtain accurate 4-sided polygons rep-

resenting the detected elements.

To further increase accuracy, we leverage prior

knowledge of drywall structure. Speciﬁcally, neigh-

boring elements share sides and are often arranged in

rows or columns with edges aligned along the same

horizontal or vertical line. We group these aligned

edges and, for each group, we collect their endpoints

and apply RANSAC to ﬁt a line through them. The

original edges are then replaced with the newly cal-

culated lines. Finally, we calculate the intersection

points of updated sides with the other sides to form

the reﬁned polygons.

6.2 Clustering of Wall Segments

In this section, we describe our method for estimating

vanishing points for individual drywall segments us-

ing the horizontal edges of optimized polygons. By

identifying these points, we can assign each element

to its corresponding drywall segment, thereby cluster-

ing distinct drywall segments within a single image.

Depending on their angles with respect to the cam-

era position, wall segments will have different vanish-

ing points where the horizontal lines converge. Iden-

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction

711

tifying these points enables us to determine the edges

that align with them and correctly associate poly-

gons with their respective wall segments. To achieve

this, we divide the elements in the image into vertical

columns. Within each column, we collect the edges

that lie fully or partially within it and calculate their

pairwise intersections, which allows us to determine

a vanishing point for each column sequentially.

If neighboring columns share the same vanish-

ing point, they belong to the same segment. Con-

versely, if two adjacent columns have different van-

ishing points, the boundary between wall segments

lies between them. In cases where two segments con-

verge within a single column, the intersection points

of the edges within it are scattered across a larger area

instead of being concentrated in one location. There-

fore, if we detect such a case, we divide the column

into smaller parts and repeat the vanishing point cal-

culation for each.

6.3 Perspective Correction

As the last step of the proposed drywall analysis, we

obtain orthographic projections of each drywall seg-

ment, opening up new possibilities for automating

progress monitoring and quality control. To achieve

this, we ﬁrst estimate the perspective of each individ-

ual wall segment, then correct it to remove perspec-

tive distortions.

Most drywall elements and the walls themselves

are rectangular and approximately axis-aligned within

the same plane. As a result, the perspective distortion

affecting the entire wall segment, as well as the trans-

formation necessary to correct it, is approximately the

same for the individual elements that are part of it.

Therefore, a simple approach to obtaining the seg-

ment’s orthographic projection is to obtain this of one

of its constituting elements, which can be done by cal-

culating the homography that maps it to a rectangular,

axis-aligned box. This method is, however, quite sen-

sitive to small inaccuracies and limits imposed by ﬁ-

nite resolution, especially so when the chosen element

does not cover a big portion of the wall’s surface.

In order to tackle this, it is possible to combine the

transformations calculated for each of the elements

belonging to the wall segment. To this end, we ﬁnd

the wall’s corners within the orthographic projection

calculated for each element and project them back to

the original image. Then, we use RANSAC to calcu-

late the average position for each corner minimizing

the effect of outliers. Finding the corners of the wall

in an orthographic projection is a straightforward pro-

cess: the minimum and maximum values along the X

and Y axis of its constituting elements correspond to

the limits of its bounding-box, which is axis-aligned.

Once the four corners of the wall are calculated for

the original image, a homography transformation can

be calculated to obtain its orthographic projection.

7 RESULTS

In this section, we present the results of our method,

structured into two subsections: one in which we dis-

cuss the performance of the instance segmentation

module, and one in which we discuss this of the wall

analysis module. In the former, the effectiveness of

the modiﬁcations made to the Mask R-CNN model

and the impact of data augmentation on its accuracy

will be reviewed. In the latter, we demonstrate the

outcomes of each post-processing step applied during

the wall analysis.

7.1 Instance Segmentation

In this subsection, a comparison of performance be-

tween different backbone choices for the Mask R-

CNN model will be brieﬂy discussed. Based on this,

the one offering best results (ConvNeXt V2) is chosen

as a baseline, against which we evaluate the impact of

each proposed modiﬁcation to the network architec-

ture, as well as the data augmentation methods. This

evaluation is conducted at both the quantitative and

qualitative levels.

7.1.1 Quantitative Evaluation

For quantitative evaluation, we use mean average pre-

cision (mAP), a common metric for instance segmen-

tation models. Speciﬁcally, we apply mAP@0.5:0.95,

which computes mAP across intersection over union

(IoU) thresholds from 0.5 to 0.95 in increments of

0.05 and then averages the results. The IoU threshold

is used to decide whether a detection is true positive

(TP) of false positive (FP). If IoU between the ground

truth and detected object is smaller than the thresh-

old, it is considered as FP, otherwise TP. We provide

mAP for both bounding boxes (bbox mAP) and in-

stance masks (mask mAP).

Table 1 provides an overview of the accuracy

of Mask R-CNN with different backbones on our

dataset. All backbones were pretrained on the COCO

dataset. Among the various backbone architectures,

ConvNeXt V2 clearly yields the best results, which is

why we chose it as a baseline for our method.

Table 2 shows how the proposed model modiﬁ-

cations and data augmentation techniques inﬂuenced

accuracy compared to the baseline. All models were

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

712

Table 1: Accuracy results for Mask R-CNN with different

backbones on our dataset.

Backbone bbox mAP mask mAP

FPN 0.47 0.45

Swin 0.40 0.39

ResNest 0.40 0.40

ConvNeXt V2 0.55 0.52

Table 2: Evaluation results of Mask R-CNN models with

various modiﬁcations and data augmentation strategies.

Here, B represents the baseline model; EB extends the base-

line by adding additional anchor ratios and increasing out-

put mask sizes through deconvolution; CB incorporates data

augmentation by blending objects onto COCO dataset back-

grounds; RB employs data augmentation with room-interior

backgrounds; RP uses data augmentation with randomized

placement, and SP data augmentation with structured place-

ment.

Model bbox mAP mask mAP

B 0.55 0.52

EB 0.57 0.54

EB + CB + RP 0.62 0.58

EB + CB + SP 0.63 0.59

EB + RB + SP 0.64 0.60

trained for 60 epochs with early stopping and Con-

vNeXt V2 backbone. The initial learning rate was set

to 0.02 and decreased by a factor of 10 at the 50th and

55th epochs. We used a weight decay of 0.001 and a

momentum of 0.9.

Adding aspect ratios for anchor boxes to improve

narrow frame detection, as well as increasing the in-

stance mask output size, led to a slight improvement

in the accuracy of both bounding box and instance

mask detections. However, the most signiﬁcant im-

provement was achieved by incorporating syntheti-

cally created images to the training as data augmen-

tation. The base synthetic generation method, as de-

scribed in (Cambeiro Barreiro et al., 2022), improved

bbox mAP by approximately 5% and mask mAP by

around 4% on top of the improvements achieved by

the modiﬁed baseline with only basic augmentation

techniques. Further accuracy gains were obtained by

using synthetic backgrounds that more closely resem-

ble the scenes from the original images, and replac-

ing the random placement of cropped elements with

a structured arrangement that imitates actual drywall

patterns. These enhancements, combining realistic

synthetic backgrounds with a more natural arrange-

ment of elements, enabled the model to produce more

accurate instance masks and achieve increase recall.

Figure 4: Visual comparison of detected metal frames (ﬁrst

row, in red) and wood panels (second row, in blue) by the

baseline model and the modiﬁed architecture.

Figure 5: Instance segmentation results using different data

augmentation techniques. Column (a) shows the original

image, (b) the baseline output, (c) the baseline with data

augmentation using COCO backgrounds and random place-

ment, and (d) the baseline with data augmentation using

room-interior backgrounds and structured placement. De-

tected wall panels are shown in pink, wood panels in blue,

insulation in orange, and metal frames in red.

7.1.2 Qualitative Evaluation

The impact of the proposed architecture modiﬁcations

is reﬂected in the quantitative results. However, this

impact is even more evident when we compare the

instance masks produced by the baseline model and

its improved version. A visual comparison is shown

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction

713

in Figure 4. The addition of extra aspect ratios to

generate anchor boxes more suitable for narrow metal

frames was essential in improving the recall for this

class. On the other hand, we can see that for the base-

line model with masks at 28×28 pixels, visible arti-

facts appear along the borders of detected elements.

Increasing the instance mask output size to 56×56

pixels through deconvolution resolved this issue, as

the detected polygons no longer exhibit the “stair-

case” effect seen in the ﬁgure. This is particularly rel-

evant for a clear element edge determination, which

is vital to the accuracy of our wall analysis algorithm.

Figure 5 illustrates the inﬂuence of data augmen-

tation on instance detection using the baseline model.

Overall, the baseline model delivers strong detection

results, accurately identifying most elements within

the drywall with only a few misclassiﬁcations. More-

over, it can even distinguish between adjacent wall

panels with minimal, barely visible borders.

The model trained with data augmentation, nev-

ertheless, detects fewer false positives and shows

fewer mismatched class labels compared to the model

trained without augmentation. However, the augmen-

tation method that creates new images by randomly

placing elements onto COCO backgrounds disrupts

the characteristic structure of drywalls. As a result,

some elements that clearly do not belong to any wall

segment (i.e. those outside the segment borders) may

be falsely identiﬁed as drywall elements. This hap-

pens because the model learns that some elements can

appear randomly distributed across the entire image.

Our modiﬁed augmentation approach addresses this

issue by teaching the model that drywall elements are

typically surrounded by other components following

a certain structure, e.g., insulation is always placed

between metal frames.

7.2 Drywall Analysis

For the evaluation of the proposed drywall analysis

process, we focus on the qualitative results that allow

us to estimate the viability of this method for practi-

cal applications, such as progress tracking and quality

control. To this end, we include visualizations of the

results of the most important steps and provide a dis-

cussion of the ﬁndings.

As shown in Figure 6, our polygon reﬁnement

technique signiﬁcantly enhances the accuracy and

consistency of the outlines of detected elements.

The horizontal edges of these polygons are precisely

aligned with each other and with the intersecting hor-

izontal lines that converge at the corresponding van-

ishing point. This alignment greatly improves the es-

timation of vanishing points in the subsequent steps.

Figure 6: Results of the group reﬁnement of simpliﬁed

polygons. Borders of the detected wall panels are shown

in white, and optimized polygons in green.

Figure 7: Visualization of our drywall segment clustering

and perspective correction results. Column (a) displays

the original images with two bounding boxes in blue and

green indicating the different wall segments. Column (b)

shows the undistorted view of left segments, while column

With accurately estimated polygons, we can suc-

cessfully apply our drywall segment clustering and

perspective correction algorithms. Figure 7 shows ex-

amples from our dataset with two visible wall seg-

ments. In both cases, we successfully identiﬁed the

borders between segments, estimated perspective dis-

tortions, and corrected them. The resulting projec-

tions show elements that are parallel to each other,

creating new opportunities for analyzing individual

drywall elements and their relationships, which is cru-

cial for progress and quality analysis.

8 DISCUSSION

Our model demonstrates robust instance segmenta-

tion performance on our dataset. Additionally, the

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

714

Figure 8: Visualization of the complete instance segmentation and wall analysis pipeline on new data. In this example: (a)

shows the original image, (b) displays the instance segmentation output, (c) illustrates the bounding boxes of the two wall

segments, and (d) and (e) present the segments with corrected perspective distortions.

proposed wall analysis method successfully isolates

individual drywall segments and corrects their per-

spective distortions. However, the limited number of

testing images in the dataset poses challenges in fully

assessing the system’s generalization capabilities. To

address this, we captured the construction process of

a drywall structure to create a new room by installing

on-site cameras at ﬁxed locations. We then applied

the entire pipeline to these new images representing

different stages of construction captured from differ-

ent points of view. Figure 8 offers a visualization of

the results for one of these images.

We observe that our model not only successfully

detects and classiﬁes most drywall elements but also

effectively uses this information to separate the two

wall segments present in the image and obtain their

orthographic projections. This demonstrates that our

proposed instance segmentation model and analysis

algorithm generalize well to data the model has never

encountered before, and it would be viable to use our

system in real construction projects to automate data

collection.

Compared to (Wei et al., 2022; Ekanayake et al.,

2024), our segmentation model not only detects more

types of drywall elements but also distinguishes indi-

vidual elements and estimates their precise location in

the image. Combined with our drywall analysis mod-

ule, it enables a more comprehensive progress esti-

mation than simply classifying walls as “ﬁnished” or

“not ﬁnished.” Compared to (Chauhan and Sepp

anen,

2023), our method successfully detects metal frames

and includes insulation detection, offering more de-

tailed information on the drywall construction pro-

cess.

Figure 8 illustrates the results of our pipeline ap-

plied to a frame of our recorded drywall construc-

tion process. Utilizing a calibrated camera with

known positioning, the extracted information is di-

rectly mapped to a corresponding wall in a digital

model, automating status updates. Furthermore, the

pipeline effectively detects individual elements within

each wall segment with high accuracy. This data, such

as the surface area covered by insulation, can then be

compared against digital model plans to evaluate con-

struction progress and ensure compliance with project

speciﬁcations.

To demonstrate a practical application of our wall

analysis method for quality assessment, we analyzed

an image showing only metal frames, as seen in Fig-

ure 9. We generated an orthographic view of a spe-

ciﬁc wall segment from the image. Using this view,

we estimated the orientation of the metal frames. This

anaylsis revewaled that one of the frames exhibited a

tilt exceeding 1

◦

(highlighted in red). Based on this

information a warning can be issued, enabling correc-

tive action to be taken before further progress is made

in the wall construction process.

Figure 9: An example of a quality assurance use case em-

ploying the proposed method. The left image shows the

segment under analysis highlighted in blue. The right im-

age illustrates the detected vertical metal frames in green,

along with their angles (in degrees).

9 CONCLUSIONS

In this paper, we present an automated pipeline for

drywall analysis using on-site cameras, combining

a deep learning instance segmentation module with

an analysis module for extracting valuable informa-

tion to create rich digital models for progress track-

ing and quality control. We enhanced a standard neu-

ral network architecture and developed targeted data

augmentation to improve segmentation performance

with limited training data. Additionally, we intro-

duced a novel algorithm to extract useful information

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction

715

from the segmentation results, to support automatic

progress tracking and quality control, as shown in real

construction-site applications. Our method provides

more detailed information than existing state-of-the-

art approaches, and we demonstrate generalization to

scenes outside the training dataset, showcasing its po-

tential to boost productivity, planning accuracy, and

regulatory compliance in the construction industry.

ACKNOWLEDGEMENTS

This work has partly been funded by the German

Federal Ministry for Digital and Transport (project

EConoM under grant number 19OI22009C). We

thank HOCHTIEF ViCon GmbH for providing train-

ing images of construction sites.

REFERENCES

Bayer, H. and Aziz, A. (2022). Object detection of ﬁre

safety equipment in images and videos using yolov5

neural network. In Proceedings of 33. Forum Bauin-

formatik.

Cambeiro Barreiro, A., Seibold, C., Hilsmann, A., and Eis-

ert, P. (2022). Automated damage inspection of power

transmission towers from uav images. In Proceed-

ings of the International Joint Conference on Com-

puter Vision, Imaging and Computer Graphics Theory

and Applications, volume 5.

Cambeiro Barreiro, A., Trzeciakiewicz, M., Hilsmann, A.,

and Eisert, P. (2023). Automatic reconstruction of se-

mantic 3d models from 2d ﬂoor plans. In 2023 18th

International Conference on Machine Vision and Ap-

plications (MVA), pages 1–5. IEEE.

Chauhan, I. and Sepp

anen, O. (2023). Automatic indoor

construction progress monitoring: challenges and so-

lutions. In EC3 Conference 2023, volume 4, pages 0–

0. European Council on Computing in Construction.

Dalva, Y., Pehlivan, H., Altindis, S. F., and Dundar, A.

(2023). Benchmarking the robustness of instance seg-

mentation models. IEEE Transactions on Neural Net-

works and Learning Systems.

Ekanayake, B., Fini, A. A. F., Wong, J. K. W., and Smith,

P. (2024). A deep learning-based approach to facili-

tate the as-built state recognition of indoor construc-

tion works. Construction Innovation, 24.

Ekanayake, B., Wong, J. K.-W., Fini, A. A. F., and Smith,

P. (2021). Computer vision-based interior construc-

tion progress monitoring: A literature review and fu-

ture research directions. Automation in construction,

127:103705.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Communications of the ACM, 24(6):381–395.

Gard, N., Hilsmann, A., and Eisert, P. (2024). Spvloc: Se-

mantic panoramic viewport matching for 6d camera

localization in unseen environments. In Proc. Euro-

pean Conf. on Computer Vision (ECCV).

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In 2017 IEEE International Conference

on Computer Vision (ICCV), pages 2980–2988.

Kalervo, A., Ylioinas, J., H

aiki

o, M., Karhu, A., and Kan-

nala, J. (2019). Cubicasa5k: A dataset and an im-

proved multi-task model for ﬂoorplan image analy-

sis. In Image Analysis: 21st Scandinavian Confer-

ence, SCIA 2019, Norrk

oping, Sweden, June 11–13,

2019, Proceedings 21, pages 28–40. Springer.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll

ar, P., and Zitnick, C. L. (2014). Mi-

crosoft COCO: Common objects in context. In Fleet,

D., Pajdla, T., Schiele, B., and Tuytelaars, T., edi-

tors, Computer Vision – ECCV 2014, pages 740–755.

Springer International Publishing.

Lv, X., Zhao, S., Yu, X., and Zhao, B. (2021). Residential

ﬂoor plan recognition and reconstruction. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 16717–16726.

Pal, A., Lin, J. J., Hsieh, S.-H., and Golparvar-Fard,

M. (2023). Automated vision-based construction

progress monitoring in built environment through dig-

ital twin. Developments in the Built Environment,

page 100247.

Research (2023). room-interior dataset. https://universe.

roboﬂow.com/research-twzom/room-interior. visited

on 2024-11-06.

Sch

onfelder, P. and K

onig, M. (2022). Deep learning-based

text detection on architectural ﬂoor plan images. In

IOP Conference Series: Earth and Environmental Sci-

ence, volume 1101.

Shamsollahi, D., Moselhi, O., and Khorasani, K. (2021). A

timely object recognition method for construction us-

ing the mask r-cnn architecture. In Proceedings of the

International Symposium on Automation and Robotics

in Construction, volume 2021-November.

Wei, W., Lu, Y., Zhong, T., Li, P., and Liu, B. (2022). In-

tegrated vision-based automated progress monitoring

of indoor construction using mask region-based con-

volutional neural networks and bim. Automation in

Construction, 140.

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon,

I. S., and Xie, S. (2023). Convnext v2: Co-designing

and scaling convnets with masked autoencoders. In

Proc. IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Ying, H. Q. and Lee, S. (2019). A mask r-cnn based ap-

proach to automatically construct as-is ifc bim objects

from digital images. In Proceedings of the 36th Inter-

national Symposium on Automation and Robotics in

Construction, ISARC 2019.

Zhao, X. and Cheah, C. C. (2023). Bim-based indoor mo-

bile robot initialization for construction automation

using object detection. Automation in Construction,

146:104647.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

716