GenGUI: A Dataset for Automatic Generation of Web User Interfaces

Using ChatGPT

alina Dicu

1 a

, Enol Garc

ıa Gonz

alez

2 b

, Camelia Chira

1 c

and Jos

e R. Villar

2 d

Faculty of Mathematics and Computer Science, Babes

-Bolyai University,

Str. Mihail Kog

alniceanu nr. 1, Cluj-Napoca 400084, Romania

Department of Computer Science, University of Oviedo, C. Jes

us Arias de Velasco, s/n, Oviedo 33005, Spain

Keywords:

User Interface Recognition, Dataset, Computer Vision, ChatGPT, Object Detection.

Abstract:

The identiﬁcation of elements in user interfaces is a problem that can generate great interest in current times

due to the signiﬁcant interaction between users and machines. Digital technologies are increasingly used to

carry out almost any daily task. Computer vision can be helpful in different applications, such as accessibility,

testing, or automatic code generation, to accurately identify the elements that make up a graphical interface.

This paper focuses on one problem that affects almost any Deep Learning and computer vision problem, which

is the generation and annotation of datasets. Few contributions in the literature provide datasets to train vision

models to solve this problem. Moreover, analyzing the literature, most datasets focus on generating images of

mobile applications, all in English. In this paper, we propose GenGUI, a new dataset of desktop applications

that presents various contents, including multiple languages. Furthermore, this contribution will train different

versions of YOLO models using GenGUI to test their quality with reasonably good results.

1 INTRODUCTION

The great technological advances of the last years

have made us increasingly dependent on digital de-

vices such as computers or smartphones to carry out

multiple daily tasks efﬁciently. The increased use of

digital tools has led to the development of a new prob-

lem in recent years: detecting elements in graphical

interfaces. Automating the detection of elements in

the graphical interfaces of daily applications is es-

sential for developing and using digital tools. Some

examples where this problem is relevant are the test-

ing of user interfaces (Bielik et al., 2018; Qian et al.,

2020; White et al., 2019; Yeh et al., 2009), the analy-

sis and improvement of the accessibility (Zhang et al.,

2021; Mi

on et al., 2013; Xiao et al., 2024), the auto-

generation of code for interfaces from images (Chen

et al., 2018; Moran et al., 2020; Nguyen and Csallner,

2015), and the search for content within user inter-

faces (Deka et al., 2017; Reiss, 2014).

The problem of element detection in user inter-

faces poses a situation in which, starting from an im-

https://orcid.org/0009-0001-3877-527X

https://orcid.org/0000-0001-7125-9421

https://orcid.org/0000-0002-1949-1298

https://orcid.org/0000-0001-6024-9527

age of a user interface, it is necessary to detail the

elements that compose it, including the position, size,

and type of each element present in the user interface.

It is, therefore, a problem in the ﬁeld of computer vi-

sion. To develop a good model that works well in

detecting user interfaces, it is essential to have a large

and varied dataset to train the models. However, the

datasets currently available in the literature are insuf-

ﬁcient, as they contain only a limited number of im-

ages. For example, see datasets (Bunian et al., 2021)

and (Dicu et al., 2024a). Moreover, these present lim-

itations in the variety of elements and languages.

The current work aims to present a novel way

to generate user interfaces using ChatGPT (OpenAI,

2024) automatically. The goal is to build a large

dataset, named GenGUI, with many user interfaces

and a wide variety of elements and languages to

develop a good computer vision model capable of

successfully recognizing elements in user interfaces.

With this use of ChatGPT, we have artiﬁcially gener-

ated a dataset composed of 250 websites, of which

more than 20,000 elements can be annotated. To

conclude the paper, different versions of the YOLO

model have been evaluated to identify these elements,

obtaining good results with YOLOv9.

Dicu, M., González, E. G., Chira, C. and Villar, J. R.

GenGUI: A Dataset for Automatic Generation of Web User Interfaces Using ChatGPT.

DOI: 10.5220/0013177200003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 707-714

ISBN: 978-989-758-737-5; ISSN: 2184-433X

707

The paper is organized as follows: Section 2 re-

views existing datasets and their characteristics. Sec-

tion 3 introduces our approach for automatically gen-

erating web user interfaces using ChatGPT. Section

4 describes the labeling process and dataset features.

Section 5 presents the experimental analysis of com-

puter vision models trained on the dataset. Finally,

Section 6 summarizes the ﬁndings and outlines future

work.

2 EXISTING DATASETS

When searching the literature, we found that there are

not many published works that present datasets to ad-

dress the problem of element detection in user inter-

faces. We can mainly ﬁnd several contributions that

are the most relevant and used by other authors.

One of the most widely used datasets is RICO

(Deka et al., 2017), which contains over 72,000

screenshots of Android mobile applications. It iden-

tiﬁes a broad range of interface elements, making it a

foundational resource for UI element detection. How-

ever, RICO has a major drawback: all the screenshots

are conﬁgured in English, which limits the dataset’s

generalizability to multilingual contexts. Building on

RICO, the Enrico dataset (Leiva et al., 2020) reﬁnes

the labeing of 10,000 images to improve annotation

quality. Despite this enhancement, Enrico inherits

RICO’s limitations in terms of scope and diversity.

To expand beyond RICO’s focus, the VINS

dataset (Bunian et al., 2021) includes images from

Android and iOS applications. While this improves

device variety, VINS is signiﬁcantly smaller, contain-

ing only 4,543 images and covering fewer element

types. Like its predecessors, VINS also includes only

English-language interfaces, limiting its use in mul-

tilingual environments where UI layouts often vary

with language.

A different approach is introduced by the UICVD

dataset (Dicu et al., 2024b), which shifts focus from

mobile applications to websites. This dataset consists

of images from 121 websites, marking a move to-

wards web-based interfaces. While UICVD provides

valuable insights into website UI elements, it shares

the same limitations as RICO, Enrico, and VINS—all

the content is exclusively in English. Table 1 summa-

rizes the most relevant characteristics studied in these

datasets.

While these datasets are valuable, they exhibit

several limitations that reduce their applicability in

broader contexts. First, they primarily focus on

mobile applications, neglecting desktop applications,

which are typically more complex and feature ad-

vanced UI elements such as multi-window interac-

tions and toolbars. Second, the exclusive use of

English in these datasets limits their generalizability

to multilingual environments. In multilingual con-

texts, interface layouts and element structures can

vary signiﬁcantly based on the language, further lim-

iting the effectiveness of models trained on English-

only datasets.

To address these gaps, our work proposes the cre-

ation of a new dataset that focuses on desktop appli-

cations and incorporates diverse languages. This ex-

pansion is crucial for improving the robustness of UI

element detection models, ensuring they can perform

effectively in more varied, real-world scenarios. By

covering these previously neglected areas, our dataset

will offer a more comprehensive resource for training

and evaluating models in user interface recognition.

3 AUTOMATIC GENERATION OF

WEB USER INTERFACES

The primary goal of this work is the automatic gener-

ation of user interfaces. The dataset generation pro-

cess consists of two main phases. In the ﬁrst phase,

ChatGPT is used to automatically generate the source

code for multiple user interfaces. In the second phase,

each generated interface is opened, and a screenshot

is captured for further use.

The most relevant part of the ﬁrst part is the gener-

ation of the source code of the user interfaces. For this

ﬁrst part, ChatGPT was used with the GPT-4o model

(OpenAI, 2024) to generate the code for the GenGUI

dataset. The decision to generate code using Chat-

GPT (speciﬁcally, the GPT-4o model) instead of rely-

ing on pre-existing web templates was driven by sev-

eral factors. One of the primary reasons is that while

many publicly available templates may appear visu-

ally similar, the underlying code structures can vary

signiﬁcantly. This inconsistency in code structure can

create issues when trying to build a uniﬁed dataset for

UI element detection. After analyzing multiple web

templates, we observed variations in the way HTML,

CSS, and JavaScript were implemented, which could

complicate efforts to create a cohesive dataset suit-

able for training machine learning models. By using

ChatGPT to generate the code, we ensured that the

structure remained consistent across all generated in-

terfaces, adhering to our exact requirements in terms

of layout, functionality, and visual diversity. Further-

more, generating the code ourselves eliminated con-

cerns over licensing issues and provided the ﬂexibility

to tailor designs to our speciﬁc needs, including mul-

tilingual support and custom UI components.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

708

Table 1: Summary of the characteristics of the similar datasets studied.

Dataset Number of images Number of classes Language Device Reference

RICO ∼72000 23 English Android (Deka et al., 2017)

UICVD 121 16 English PC / Websites (Dicu et al., 2024b)

VINS 4543 12 English Android+iOS (Bunian et al., 2021)

Enrico 10000 23 English Android (Leiva et al., 2020)

The conversation with this automatic generation

model always started with the prompt: “Use Boot-

strap to create a web page for a ﬁctitious company.

The company intends to use it A. Make sure the web-

site includes a navigation bar B. You can include

some of the following elements on the site: Icons, Text

Fields, Buttons, Images, Table, Containers, Section

Title, Search bar, Checkboxes, Dropdown, Radio but-

tons, Text areas, Forms, Graphs. Support as many in-

terface elements as possible with icons from the Font

Awesome library. Make the page content in C. Re-

place placeholder images with real images. In the

output include only the HTML code, you don’t need

to explain anything”.

The prompt to initiate the conversation has a series

of gaps –A, B, and C–, which will be ﬁlled with differ-

ent content to generate varied user interfaces. Table 2

shows the different values that have been used to ﬁll

in the prompt gaps. Once a chat conversation with the

generation model is started, the statements “Generate

me another site with the same characteristics, but a

different look and feel”, “Generate another site that

looks different” and “Generate another different web-

site” are used to request the source code of more web

sites, until ten web sites with similar characteristics,

but different look and feel, are obtained.

In terms of content structure, GPT-4o occasionally

favored content-heavy pages, which might not always

be desirable for certain applications. To overcome

this, we reﬁned the prompts further to generate web

pages with varied densities of content, ranging from

simple, clean layouts to more complex, information-

rich interfaces. This iterative process of adjusting the

prompts and reviewing the generated code allowed us

to arrive at a balanced approach for diversity in UI

designs.

Once the source code of the user interfaces was

generated, a manual inspection was performed to

check that there were no interfaces with errors or that

were very similar. In the generated websites, few in-

terfaces presented this problem, but a small number

of interfaces were eliminated to avoid contaminating

the dataset. In addition, during this manual inspec-

tion, the images and graphics included in the websites

were replaced with images and graphics generated by

Bing Image Creator and MS Excel, as some websites

had been generated with a gray image to mark the

site where an image should be included. Bing Im-

age Creator was used for the more decorative images

with prompts such as “Generate me an image of a

real person”, “Generate me an image of an ofﬁce” or

“Generate me an image of a marketplace”. MS Excel

was used for more business-oriented sites. In Excel,

a matrix of random numbers was generated and many

different types of graphs were drawn with those num-

bers.

Once the source code of the websites was avail-

able and the manual inspection had been done to elim-

inate errors and replace images, we moved on to the

second phase of the generation of the user interfaces

for the GenGUI dataset, which consists of opening the

web interfaces and taking a screenshot of them, since

the dataset will be composed of images, not code.

As the user interfaces were developed as web inter-

faces, Firefox (Mozilla Foundation, 2024) and Sele-

nium (SeleniumHQ, 2024) were used for this part.

Selenium is a web testing framework that allows the

control of the Firefox web browser to be automated.

This phase consisted of opening a website in Fire-

fox, extracting a screenshot of the complete site us-

ing Selenium, and moving on to do the same with the

following image until the complete dataset was pro-

cessed.

4 DATASET CREATION

As described in the previous section, the process

of generating user interfaces was automated using a

GPT-4o model and the Bootstrap framework. Screen-

shots of these interfaces were then used to create

a dataset aimed at training computer vision models.

The dataset creation involved several stages: auto-

matic interface generation, the elimination of inappro-

priate or redundant images, and ﬁnally, manual anno-

tation of the individual elements. In this section, we

detail the annotation process, the criteria used for la-

beling, and the ﬁnal structure of the dataset.

4.1 Element Annotation

To ensure a high-quality dataset, every image from

the generated interfaces was manually annotated. Al-

though we had access to the HTML code of these in-

GenGUI: A Dataset for Automatic Generation of Web User Interfaces Using ChatGPT

709

Table 2: Different options to ﬁll in the gaps in the prompt used to start a conversation with ChatGPT. The different options

are separated by the / character.

A Promote the company and publicize its services / Promote a product / Manage the company internally /

as an intranet for employees to carry out tasks such as consulting payroll and requesting days off.

B Horizontal / Lateral

C English / Spanish / Romanian / German / French / Italian / Dutch / Swedish / Norwegian / Portuguese

terfaces, we opted for manual annotation because the

HTML structure does not always accurately reﬂect

how elements are visually displayed on the screen.

Speciﬁcally, the visual layout can be inﬂuenced by

CSS styles and JavaScript and dynamic or hidden el-

ements can complicate the automatic annotation pro-

cess.

We chose manual annotation to guarantee accu-

racy, particularly because a study conducted by (Dicu

et al., 2024a) demonstrated that manual annotations

yield better results for training visual detection mod-

els, especially when dealing with complex elements

like icons. This approach also provides a solid foun-

dation for potential automation in the future. Addi-

tionally, manual annotation allowed us to correct er-

rors and establish clear criteria for labeling. Each vi-

sual element was identiﬁed and labeled according to

a well-deﬁned classiﬁcation, covering both visual and

functional aspects.

Figure 1 presents an example from the dataset,

showing both the raw interface image and its anno-

tated version, where each visual element is correctly

labeled.

4.2 Annotation Criteria

To establish a coherent and uniﬁed annotation process

for the GenGUI dataset, we developed a set of spe-

ciﬁc criteria. These criteria focused on the following

aspects:

• Accuracy in Label Positioning. We aimed to

place labels as close as possible to the correspond-

ing visual elements, ensuring that they were delin-

eated and did not interfere with other elements in

the interface.

• Granularity. We sought to distinguish elements

not only based on their visual characteristics but

also their functionality.

• Functional Context. Some elements may serve

multiple roles. For example, each button was

labeled accordingly, but internal elements, such

as text or icons within the button, were anno-

tated separately. This allows vision models to dis-

tinguish between the different visual components

within a single functional element.

We used LabelStudio (Tkachenko et al., 2022), an

open-source data annotation platform, which allowed

us to perform manual labeling in line with the estab-

lished criteria.

4.3 Dataset Structure

The ﬁnal dataset consists of 250 PNG images with

variable resolutions. The width of all images is ﬁxed

at 1920 pixels, while the height ranges from 533 to

3285 pixels. This variation in height enabled us to

better simulate real-world scenarios, where user inter-

faces can have different dimensions depending on the

content. We eliminated certain images that either con-

tained errors during generation or featured elements

that were difﬁcult to annotate, ensuring a high-quality

dataset. Although this process resulted in an unequal

number of images per language, we considered this

necessary to achieve relevant experimental results.

After the elimination process, the dataset contains

a total of 250 images and 20,484 annotations, dis-

tributed across 13 main classes and 29 subclasses, as

detailed in Table 3:

We chose these classes and subclasses to reﬂect

the diversity and complexity of graphical components

found in user interfaces. The dataset includes essen-

tial elements such as images, text, buttons, and icons,

as well as more complex components like input ﬁelds,

menus, and tables. This classiﬁcation allows for a

greater level of granularity in detecting and classify-

ing elements, ensuring that the dataset can be used for

a wide range of computer vision applications.

The decision to divide elements into classes and

subclasses was driven by the need to cover as many

scenarios as possible in modern graphical interfaces.

For instance, text appears in various forms and

functions—from titles to text buttons or menu sec-

tions—which is why we introduced several subclasses

to capture these variations. Similarly, we differenti-

ated icons from other visual elements to offer more

precision in the annotation process.

5 EXPERIMENTS AND RESULTS

To validate the quality and utility of the created

dataset, we conducted a series of experiments using

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

710

(a) Image 29 from dataset (b) The annotated Image 29 from dataset

Figure 1: Example from the dataset: the ﬁrst image shows the raw user interface, while the second image illustrates the same

interface with manually annotated elements.

two of the most advanced object detection models,

YOLOv8 (Jocher et al., 2023) and YOLOv9 (Wang

et al., 2024) . The purpose of these experiments was

to observe how these models perform on the 13 main

classes in our dataset, providing a concrete evaluation

of their effectiveness in detecting graphical user inter-

face elements. Additionally, these experiments will

help identify potential limitations and areas for im-

provement in future expansions of the dataset.

5.1 You Only Look Once (YOLO)

Models

The YOLO (You Only Look Once) architecture (Red-

mon et al., 2016) has transformed object detection by

combining speed and accuracy in a single-stage pro-

cess. By dividing the input image into a grid and pre-

dicting bounding boxes and class probabilities simul-

taneously, YOLO enables real-time detection, making

it a widely used approach for a variety of applications.

YOLOv8 (Jocher et al., 2023) improves on earlier

versions by enhancing accuracy with the C2f mod-

ule, particularly for detecting small or overlapping ob-

jects. Its anchor-free design and decoupled head op-

timize detection, classiﬁcation, and regression tasks,

ensuring precision without compromising speed.

YOLOv9 (Wang et al., 2024) extends these ad-

vancements by incorporating attention mechanisms

and Feature Pyramid Networks (FPN) for better de-

tection across various object sizes. It also uses a hy-

brid training strategy, combining supervised and un-

supervised learning, which improves performance on

datasets with limited labeled data.

We selected these models for their balance of

speed and accuracy, as well as their effectiveness

in detecting small and overlapping objects—key in

graphical interfaces with buttons, icons, and text

ﬁelds. Testing on our dataset evaluates performance

and highlights areas for improvement, particularly in

class balance and diversity, providing a strong foun-

dation for future work.

5.2 Experimental Setup and Evaluation

Metrics

To objectively compare the performances of YOLOv8

and YOLOv9, we used the same parameters across

both experiments, employing the default versions of

these models without major modiﬁcations, as the

goal was to assess their general performance on our

dataset. Both models were trained for 100 epochs

with a batch size of 2, adapted to the relatively small

size of our dataset. We used SGD (Stochastic Gradi-

ent Descent) as the optimizer, with a learning rate of

0.01 and an image size of 1024x1024 pixels. The im-

age augmentations applied were the default ones pro-

vided by the YOLO framework, ensuring consistency

in performance evaluation.

The experiments were conducted on the Google

Colaboratory (Bisong and Bisong, 2019) platform,

utilizing a T4 GPU, which enabled efﬁcient process-

ing of the dataset and training of the models over mul-

tiple epochs.

In terms of performance evaluation, we used sev-

eral key metrics. IoU (Intersection over Union), with

a ﬁxed value of 0.5, was employed to measure the

overlap between the predicted and actual bounding

GenGUI: A Dataset for Automatic Generation of Web User Interfaces Using ChatGPT

711

Table 3: Distribution of annotations across classes and subclasses in GenGUI dataset.

Class Number of Annotations Subclass Number of Annotations

Text 8927

text 5231

sectionTitle 1227

textForNavigationBar 889

textForButton 792

textForSidebar 335

title 209

statusLabel 125

textForDropdown 119

Icon 4566

icon 4331

dropdownIcon 121

checkBox 71

radioButton 43

Container 1370 container 1370

MenuItem 1227

navigationItem 893

sideMenuItem 334

Button 1036 button 1036

InputField 771

inputField 582

dropdown 120

datePicker 69

TableColumn 754 tableColumn 754

Row 718

row 550

tableHeader 168

Menu 309 navigationBar 247

sideBar 62

WorkingArea 250 workingArea 250

Image 228

image 122

graph 106

Table 168 table 168

Footer 160 footer 160

boxes. Additionally, we calculated AP (Average Pre-

cision) for each class based on the precision-recall

curve and mAP (mean Average Precision) to provide

an overall performance measure across all classes.

The models were evaluated using a single conﬁdence

threshold of 0.25, ensuring consistent ﬁltering of pre-

dictions.

5.3 Results

The primary goal of the experiment is to assess the

performance of YOLOv8 and YOLOv9 by training

them on the 13 main classes of the dataset. We fo-

cused on the main classes to gain a general under-

standing of how the dataset performs in object detec-

tion and to simplify the analysis of the results.

The data was randomly split into three subsets:

80% for training, 10% for validation, and 10% for

testing. This split ensures that the models have suf-

ﬁcient data for learning while allowing for a repre-

sentative evaluation. Table 4 presents the distribution

of images and annotations across the dataset.

Following the experiments on our dataset, we ob-

tained the performances shown in Tables 5 and 6 for

the YOLOv8 and YOLOv9 models.

The experimental results reveal a signiﬁcant per-

formance gap between YOLOv9 and YOLOv8, with

YOLOv9 achieving a mAP of 57.78%, nearly dou-

ble that of YOLOv8’s 30.44%. This demonstrates

YOLOv9’s ability to process the variability and com-

plexity of our dataset more effectively.

In general, YOLOv9 performed much better

across all classes, especially in classes that require

high detection accuracy, such as Text, Icon, and But-

ton. The strong performance in these classes can be

explained by their high representation in the dataset,

with a large number of annotations providing more

learning opportunities for the models, leading to bet-

ter generalization. For example, in the Text class,

YOLOv9 achieved an impressively high AP, demon-

strating the model’s ability to handle different types

of text in graphical interfaces, such as titles or button

labels.

In contrast, both models struggled with underrep-

resented classes like Footer, Row, and WorkingArea,

where AP scores were low or nonexistent. The lim-

ited number of annotations and the contextual vari-

ability of these elements likely contributed to the

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

712

Table 4: Distribution of images and annotations across training, validation, and test sets.

Number of Images Number of Annotations

Total Train Validation Test Total Train Validation Test

250 200 25 25 20484 16480 2082 1922

Table 5: Results of YOLOv8 and YOLOv9 models trained on the dataset (Part 1), showing the mean Average Precision (mAP)

and Average Precision (AP) per class. The models were evaluated using a conﬁdence threshold of 0.25 and an Intersection

over Union (IoU) of 0.5.

Model mAP Button Container Footer Icon Image InputField

YOLOv8 30.44 65.71 26.84 0.00 74.68 70.13 3.27

YOLOv9 57.78 93.97 57.60 0.00 95.66 79.31 90.00

Table 6: Results of YOLOv8 and YOLOv9 models trained on the dataset (Part 2), showing the mean Average Precision (mAP)

and Average Precision (AP) per class. The models were evaluated using a conﬁdence threshold of 0.25 and an Intersection

over Union (IoU) of 0.5.

Model Menu MenuItem Row Table TableColumn Text WorkingArea

YOLOv8 3.03 36.12 0.00 0.00 31.15 84.64 0.21

YOLOv9 9.92 86.87 0.00 63.36 77.05 96.51 0.83

weak performance, as the models lacked sufﬁcient

data to learn their distinctive features. A particu-

larly notable example is the InputField class, where

YOLOv9 achieved strong results while YOLOv8 un-

derperformed, highlighting YOLOv9’s ability to han-

dle complex visual contexts more effectively.

These ﬁndings highlight the importance of ad-

dressing annotation imbalance and improving class

diversity to enhance detection accuracy and model

generalization. Underrepresented classes such as

Footer, Row, and WorkingArea require additional at-

tention, as their limited presence impacts the models’

ability to learn their distinct features effectively.

The current dataset represents a solid foundation

for detecting elements in desktop graphical interfaces.

However, its class imbalance reﬂects the natural dis-

tribution of these elements in real-world applications.

To build a more comprehensive and balanced re-

source, it will be necessary to expand the dataset with

a broader range of images and annotations, partic-

ularly for rarer elements. This effort will not only

improve detection performance for underrepresented

classes but also strengthen the models’ ability to gen-

eralize across diverse scenarios and graphical appli-

cations.

6 CONCLUSIONS AND FUTURE

WORK

This paper introduces GenGUI, a new dataset for de-

tecting elements in graphical user interfaces. Accu-

rate detection of these elements is essential for au-

tomating the visualization and processing of user in-

terfaces. Generated using the GPT-4o model and

Bootstrap framework, GenGUI includes a variety of

visual elements, such as text, buttons, icons, and input

ﬁelds. Unlike existing datasets, which focus primar-

ily on mobile interfaces and English-language con-

tent, GenGUI includes diverse desktop interfaces in

multiple languages, addressing a major gap in the lit-

erature.

Experiments with YOLOv8 and YOLOv9 demon-

strate the dataset’s effectiveness in identifying UI el-

ements, though challenges remain for underrepre-

sented classes like WorkingArea and Footer. Expand-

ing the dataset and increasing annotations for these

classes will help address these issues.

In the future, we plan to expand the dataset by

adding more diverse and complex interfaces, along

with a wider variety of graphical elements, to bet-

ter capture real-world scenarios. We also aim to de-

velop an automated labeling method based on the cur-

rent annotations, which will serve as a reliable ground

truth and help reduce the need for manual work. By

improving the dataset and streamlining the annotation

process, we hope to create a more valuable and practi-

cal resource for researchers and developers, contribut-

ing to the advancement of graphical interface element

detection and understanding.

The dataset is available at the following link:

https://github.com/MadaDicu/GENGUI

ACKNOWLEDGEMENTS

The authors, Enol Garc

ıa Gonz

alez and Jos

e R. Vil-

lar, acknowledge support from the Spanish Ministry

of Economics (PID2020-112726RB-I00), the Spanish

GenGUI: A Dataset for Automatic Generation of Web User Interfaces Using ChatGPT

713

Research Agency (PID2023-146257OB-I00), Prin-

cipado de Asturias (SV-PA-21-AYUD/2021/50994),

the Council of Gij

on, and Fundaci

on Universidad de

Oviedo (FUO-23-008, FUO-22-450).

REFERENCES

Bielik, P., Fischer, M., and Vechev, M. (2018). Robust re-

lational layout synthesis from examples for android.

2(OOPSLA).

Bisong, E. and Bisong, E. (2019). Google colaboratory.

Building machine learning and deep learning models

on google cloud platform: a comprehensive guide for

beginners, pages 59–64.

Bunian, S., Li, K., Jemmali, C., Harteveld, C., Fu, Y., and

El-Nasr, M. S. (2021). Vins: Visual search for mobile

user interface design.

Chen, C., Su, T., Meng, G., Xing, Z., and Liu, Y. (2018).

From ui design image to gui skeleton: A neural ma-

chine translator to bootstrap mobile gui implementa-

tion. In 2018 IEEE/ACM 40th International Confer-

ence on Software Engineering (ICSE), pages 665–676.

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan,

D., Li, Y., Nichols, J., and Kumar, R. (2017). Rico:

A mobile app dataset for building data-driven design

applications. UIST ’17, page 845–854, New York,

NY, USA. Association for Computing Machinery.

Dicu, M., Gonz

alez, E. G., Chira, C., and Villar, J. R.

(2024a). The impact of data annotations on the per-

formance of object detection models in icon detection

for gui images. In International Conference on Hy-

brid Artiﬁcial Intelligence Systems, pages 251–262.

Springer.

Dicu, M., Sterca, A., Chira, C., and Orghidan, R. (2024b).

Uicvd: A computer vision ui dataset for training rpa

agents. In ENASE, pages 414–421.

Jocher, G., Chaurasia, A., and Qiu, J. (2023). YOLO by

Ultralytics. https://github.com/ultralytics/ultralytics.

Accessed: June 20, 2024.

Leiva, L. A., Hota, A., and Oulasvirta, A. (2020). Enrico: A

high-quality dataset for topic modeling of mobile UI

designs. In Proc. MobileHCI Adjunct.

on, R., Moreno, L., and Abascal, J. (2013). A graph-

ical tool to create user interface models for ubiqui-

tous interaction satisfying accessibility requirements.

Univers. Access Inf. Soc., 12(4):427–439.

Moran, K., Bernal-C

ardenas, C., Curcio, M., Bonett, R.,

and Poshyvanyk, D. (2020). Machine learning-based

prototyping of graphical user interfaces for mobile

apps. IEEE Transactions on Software Engineering,

46(2):196–221.

Mozilla Foundation (2024). Mozilla Firefox. Web browser,

Version 118.

Nguyen, T. A. and Csallner, C. (2015). Reverse engineer-

ing mobile application user interfaces with remaui (t).

In 2015 30th IEEE/ACM International Conference on

Automated Software Engineering (ASE), pages 248–

259.

OpenAI (2024). Chatgpt (october 2024 version). https://

openai.com/chatgpt. Large language model.

Qian, J., Shang, Z., Yan, S., Wang, Y., and Chen, L. (2020).

Roscript: a visual script driven truly non-intrusive

robotic testing system for touch screen applications.

In Proceedings of the ACM/IEEE 42nd International

Conference on Software Engineering, ICSE ’20, page

297–308, New York, NY, USA. Association for Com-

puting Machinery.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Reiss, S. P. (2014). Seeking the user interface. In Proceed-

ings of the 29th ACM/IEEE International Conference

on Automated Software Engineering, ASE ’14, page

103–114, New York, NY, USA. Association for Com-

puting Machinery.

SeleniumHQ (2024). Selenium WebDriver. Testing frame-

work for web applications.

Tkachenko, M., Malyuk, M., Holmanyuk, A., and Liu-

bimov, N. (2020-2022). Label Studio: Data label-

ing software. Open source software available from

https://github.com/heartexlabs/label-studio.

Wang, C.-Y., Yeh, I.-H., and Liao, H.-Y. M. (2024).

Yolov9: Learning what you want to learn using

programmable gradient information. arXiv preprint

arXiv:2402.13616.

White, T. D., Fraser, G., and Brown, G. J. (2019). Im-

proving random gui testing with image-based wid-

get detection. In Proceedings of the 28th ACM SIG-

SOFT International Symposium on Software Testing

and Analysis, ISSTA 2019, page 307–317, New York,

NY, USA. Association for Computing Machinery.

Xiao, S., Chen, Y., Song, Y., Chen, L., Sun, L., Zhen, Y.,

Chang, Y., and Zhou, T. (2024). UI semantic com-

ponent group detection: Grouping UI elements with

similar semantics in mobile graphical user interface.

Displays, 83(102679):102679.

Yeh, T., Chang, T.-H., and Miller, R. C. (2009). Sikuli:

using gui screenshots for search and automation. In

Proceedings of the 22nd Annual ACM Symposium on

User Interface Software and Technology, UIST ’09,

page 183–192, New York, NY, USA. Association for

Computing Machinery.

Zhang, X., de Greef, L., Swearngin, A., White, S., Murray,

K., Yu, L., Shan, Q., Nichols, J., Wu, J., Fleizach, C.,

Everitt, A., and Bigham, J. P. (2021). Screen recogni-

tion: Creating accessibility metadata for mobile appli-

cations from pixels. In Proceedings of the 2021 CHI

Conference on Human Factors in Computing Systems,

CHI ’21, New York, NY, USA. Association for Com-

puting Machinery.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

714