UICVD: A Computer Vision UI Dataset for Training RPA Agents

Madalina Dicu

1 a

, Adrian Sterca

1 b

, Camelia Chira

1 c

and Radu Orghidan

2 d

Faculty of Mathematics and Computer Science, Babes¸ -Bolyai University, 1 M. Kogalniceanu, Cluj-Napoca, Romania

Endava Romania, 51 A. V. Voevod, Cluj-Napoca, Romania

Keywords:

UICVD Dataset, Robotic Process Automation, User Interface Recognition, User Interface Understanding,

Computer Vision.

Abstract:

This paper introduces the UICVD Dataset, a novel resource fostering advancements in Robotic Process Au-

tomation (RPA) and Computer Vision. The paper focuses on recognizing UI (User Interface) components of

a web application which is not as well known as recognizing real objects in images in the ﬁeld of computer

vision. This dataset derives from extensive screen captures within an enterprise application, offering a rare,

in-depth look at real-world automation and interface scenarios. For RPA, the UICVD Dataset helps in training

the machine model of an RPA agent for recognizing various UI components of the web application which is

the target of the automation process. In Computer Vision, it serves as an invaluable tool for identifying and

understanding user interface elements, ranging from basic icons to intricate structural details. Designed to

support a wide spectrum of research and development initiatives, the UICVD Dataset is positioned as a critical

asset for technology advancements in automation and user interface recognition. Its extensive, detailed content

and ease of access make it a promising resource for enhancing existing applications and inspiring innovations

in RPA and Computer Vision.

1 INTRODUCTION

Robotic Process Automation (RPA) is an innovative

technology in the dynamic ﬁeld of computer science

that is changing how companies optimize and man-

age their processes. This section of the article intro-

duces RPA as an engine for improving accuracy, op-

erational efﬁciency, and resource management in the

digital world. It further highlights the importance of

UI object identiﬁcation in RPA, describing in detail

how this complex interface interaction process opens

up the path to more efﬁcient automation and platform

connectivity. Central to this discussion is the explo-

ration of a dataset that aids in the advancement of both

RPA strategies and object detection mechanisms.

1.1 Robotic Process Automation

Robotic Process Automation (RPA) is revolutioniz-

ing business process management by automating rou-

tine tasks usually performed by humans (Institute for

https://orcid.org/0009-0001-3877-527X

https://orcid.org/0000-0002-5911-0269

https://orcid.org/0000-0002-1949-1298

https://orcid.org/0000-0002-1450-2680

Robotic Process Automation, 2015), (Hofmann et al.,

2020). RPA platforms such as UiPath

, Automa-

tion Anywhere

, Blue Prism

, Microsoft Power Au-

tomate

enable the automation of business processes

by interacting with software applications’ User Inter-

faces (UI) components like buttons, text inputs, drop-

down lists, and tables. They operate by allowing RPA

developers to easily identify UI components of the

target application and write code snippets that per-

form actions on these components, thus forming au-

tomated business processes that can be executed re-

peatedly with different inputs. These RPA platforms

praise themselves because they require low or zero

code and most of the programming is done graphi-

cally.

RPA refers to those tools that operate on the User

Interface aiming to perform automation tasks using an

”outside-in” approach. The information systems are

kept unchanged, compared to the traditional workﬂow

technology, which allows the improvement using an

”inside-out” approach (Van-der Aalst et al., 2018).

https://www.uipath.com/

https://www.automationanywhere.com/

https://www.blueprism.com/

https://powerautomate.microsoft.com/en-us/

414

Dicu, M., Sterca, A., Chira, C. and Orghidan, R.

UICVD: A Computer Vision UI Dataset for Training RPA Agents.

DOI: 10.5220/0012632600003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 414-421

ISBN: 978-989-758-696-5; ISSN: 2184-4895

RPA technology improves work efﬁciency and ac-

curacy by reducing human errors and executing tasks

more robustly. These software robots mimic human

actions like mouse clicks and data entry, interacting

with various software to ensure increased productiv-

ity and reduced costs. By being available 24/7, they

offer a continuous operational advantage over human

users. Advanced RPA tools integrate AI to enhance

capabilities beyond mere UI interaction, contributing

to standardization and business process enhancement

(Chakraborti et al., 2020; Rajawat et al., 2021).

In the realm of process automation, commercial

RPA tools also offer process mining

and process dis-

covery

. However, the optimization of these pro-

cesses often requires human expertise for analysis

and decision-making. Academically, there’s sub-

stantial interest in optimizing Business Process Man-

agement (BPM) by automatically extracting process

steps and converting them into software robot se-

quences. Research focuses on understanding the

anatomy of tasks from natural language process de-

scriptions of the process that details the executed rou-

tines. The automatic identiﬁcation of the type of per-

formed activities (manual, human interaction, or auto-

mated) from text documents while employing super-

vised machine learning techniques was investigated

in (Leopold et al., 2018). To identify the existing

relationship between various activities of a process,

the authors of (Han et al., 2020) used long short-term

memory (LSTM) recurrent neural networks to learn

from process description documents (PDDs). (Ito

et al., 2020) proposes a new grammar for complex

workﬂows with chaining machine-executable mean-

ing representations for semantic parsing.

The current development of RPA tools makes use

of AI advances in routine identiﬁcation and automa-

tion. Still, the use of human expert skills is required

to analyze how the routines are executed on the appli-

cation’s UI. Another research area addresses the ac-

tual automation of routines by examining the actions

performed by human users when executing their tasks

using software applications.

Tools like Robidium (Leno et al., 2020) and

SmartRPA (Agostinelli et al., 2020) represent a new

generation of automation tools. Robidium focuses on

discovering automatable tasks from UI logs and gen-

erating execution scripts, operating differently from

traditional record-and-play RPA tools. This is a Soft-

ware as a Service (SaaS) tool that implements the

robotic process mining pipeline proposed in (Leno

et al., 2021).

Robidium uses UI log ﬁles that consist of data and

https://www.uipath.com/product/process-mining

https://www.uipath.com/product/task-capture

events that are not related to a speciﬁc task identiﬁed

beforehand. Its architecture comprises a preprocess-

ing step on UI logs that allows the routine extraction

and discovery of automatable routines and then these

are compiled into a UiPath script.

SmartRPA (Agostinelli et al., 2020), on the other

hand, records UI actions into a log ﬁle for rou-

tine identiﬁcation and enables the generation of exe-

cutable RPA scripts, offering customization for man-

ual user inputs. The tool uses its action logger to

record UI actions on the actions system, Microsoft

Ofﬁce applications, or web browser (e.g., Google

Chrome, Mozilla Firefox) into a log ﬁle, used as input

for routine identiﬁcation. The tool allows the gener-

ation of a high-level ﬂowchart diagram that can be

studied by expert users for potential diagnosis opera-

tions and to generate executable RPA scripts based on

the most frequent routine variant. Some input ﬁelds of

the selected routine variant can be personalized before

executing the related RPA scripts, supporting those

steps that require manual user inputs.

Furthermore, tools like Ringer (Barman et al.,

2016), a Chrome extension web replayer and Rousil-

lon (Chasins et al., 2018), which uses Ringer for

developing web automation scripts, show the poten-

tial for automating user actions in web environments.

These technologies record and replay user actions, fa-

cilitating complex web automation and demonstrating

the versatility and potential of RPA in enhancing busi-

ness processes. The continual advancement of RPA

tools, combined with AI and machine learning, signi-

ﬁes a promising future for business process automa-

tion, offering unprecedented levels of efﬁciency and

accuracy.

1.2 Automatic Recognition of UI

Elements

RPA platforms like UiPath or PowerAutomate recog-

nize UI controls in two ways, depending on whether

the target business application is a Windows desktop

application (i.e., usually these RPA platforms operate

only on Microsoft operating systems, i.e., Microsoft

Windows family) or is a web application. If the target

application is a Windows desktop one, the RPA plat-

form uses directly the Windows APIs which return a

hierarchical model for the UI form (very similar to the

DOM (i.e., Document Object Model) of HTML doc-

uments)) to identify the components of the UI (i.e.,

buttons, labels, text input ﬁelds, etc.). If on the other

hand, the target application is a web application, the

RPA platform uses a browser plug-in (extension) to

access the DOM structure of the web pages of the tar-

get application. This browser plug-in can extract from

UICVD: A Computer Vision UI Dataset for Training RPA Agents

415

the DOM structure the tags corresponding to simple

UI controls like <input type=”button” >, <button >,

<input type=”checkbox” >. For more complicated

UI controls like icons, grids, tables, and lists, things

are more complicated since these can be realized with

a large and diverse number of HTML tags. One op-

tion is to recognize these high-level UI controls using

computer vision techniques. In this paper, we deal

only with the second type of target applications, that is

web applications, not Windows desktop applications.

Object Detection is a fundamental task in com-

puter vision, that aims to locate and identify objects

in images or videos by outlining their position and

boundaries and by categorizing them. Techniques in

this domain are divided into two categories: one-stage

methods, including YOLO (Redmon et al., 2016),

SSD (Liu et al., 2016), and RetinaNet (Lin et al.,

2017), which concentrate on rapid inference, and two-

stage methods, such as Faster R-CNN (Ren et al.,

2015), Mask R-CNN (He et al., 2017), and Cascade

R-CNN (Cai and Vasconcelos, 2018), known for their

focus on enhancing detection precision (Zou et al.,

2023).

The rest of the paper is structured as follows. In

Section 2 we present research work related to ours.

Section 3 presents the details of the UI dataset pro-

posed by this paper. We outline its structure, the num-

ber of samples, and also how the dataset was con-

structed. Following, Section 4 presents possible usage

scenarios of the UICVD dataset. Section 5 presents

the architecture of a prototype RPA agent that uses the

UICVD dataset and the paper ends with conclusions

in Section 6.

2 RELATED WORK

In recent years, the detection of components in User

Interface (UI) design has garnered increasing inter-

est from researchers, leading to rapid advancements

and notable contributions in the ﬁeld. A signiﬁcant

dataset in this domain is the Rico (Deka et al., 2017)

dataset, which encompasses over 72,000 unique UI

screens from more than 9,700 Android applications.

It is a comprehensive combination of visual, textual,

structural, and interactive design data. The develop-

ment of the Rico dataset involved a blend of crowd-

sourcing and automation, ensuring a wide range of UI

designs.

Another well-known dataset is VINS (Bunian

et al., 2021), designed to enhance the process of lo-

cating mobile UI design examples. VINS stands out

for its comprehensive collection of UI designs, total-

ing 4,800 images, including abstract wireframes and

high-ﬁdelity screens. It features detailed annotations

with bounding boxes covering 11 distinct UI com-

ponents, such as background images, menus, pop-up

windows, input ﬁelds, and various buttons. A sig-

niﬁcant portion of these images are derived from the

Rico dataset. Additionally, there is the Enrico dataset

(Leiva et al., 2020), an extension of Rico that reﬁnes

10,000 UIs from Rico into 1,460 UIs across 20 design

subjects, utilizing deep learning for classiﬁcation.

It is noteworthy that these datasets primarily fo-

cus on mobile interfaces, leaving a gap in datasets

addressing desktop applications. Regarding datasets

with an emphasis on desktop interfaces, we identi-

ﬁed the Website Screenshots dataset (Dwyer, 2022).

This dataset, which is freely accessible and hosted

on the Roboﬂow website, consists of 2,412 screen-

shot images sourced from over 1,000 different web-

sites. The organization of this dataset is methodical:

it includes 1,688 images in the training folder, 482

in the validation folder, and 242 in the testing folder.

Each image is in JPG format with a uniform resolu-

tion of 1024 x 768 pixels, which is a common desk-

top screen resolution. Accompanying these images in

each folder is a corresponding CSV ﬁle. These ﬁles

contain detailed annotations for various objects within

the images, providing a rich source of labeled data for

analysis and model training. The dataset is annotated

across eight distinct classes: button, text, image, link,

title, ﬁeld, iframe, and label. This variety of classes

enables a comprehensive analysis of common web el-

ements.

The predominance of mobile-focused datasets like

Rico, VINS, and Enrico reveals a signiﬁcant gap: the

lack of datasets speciﬁcally tailored for desktop or

enterprise web applications. This gap is crucial, as

enterprise applications often have more complex in-

terfaces and functionalities than mobile applications.

The distinct features of enterprise applications - such

as sophisticated navigation patterns, dense informa-

tion displays, and complex interaction models - ne-

cessitate a specialized dataset for more effective UI

design and analysis in this area. Introducing a new

dataset focused on screenshots from enterprise appli-

cations would not only ﬁll this void, but also allow for

the development of more nuanced and effective tools

for UI analysis and design in the broader context of

desktop and enterprise environments. This would be

a signiﬁcant step forward in the ﬁeld of UI compo-

nent detection, addressing the unique challenges and

requirements of enterprise application interfaces.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

416

3 STRUCTURE OF THE DATASET

In the ﬁeld of computer vision and robotic process

automation (RPA), identifying User Interface (UI) el-

ements is a crucial step in automating business pro-

cesses. Recognizing a gap in existing datasets, which

predominantly focus on mobile interfaces and are un-

suitable for the automation of desktop-based, busi-

ness web applications, we have embarked on cre-

ating a new dataset speciﬁcally tailored for desktop

web applications. This initiative aims to facilitate the

automation of business applications by providing a

dataset dedicated to desktop images, addressing a sig-

niﬁcant gap in the current literature.

3.1 Speciﬁcations of the New Dataset

Our dataset comprises PNG images, primarily in two

resolutions: 1920x940 and 1902x922. These speciﬁc

resolutions stem from the native display settings of

the application we used to create the dataset. The ma-

jority of images are at the resolution of 1920x940,

aligning with the most commonly used screen reso-

lution in our desktop environment. This diversity in

resolution ensures that our dataset is representative of

real-world scenarios, where variations in screen sizes

and resolutions are common.

3.2 Data Source

The dataset was created based on screenshots from

an enterprise web application based on Microsoft Dy-

namics CRM and used within a multi-national soft-

ware company. We captured a diverse range of

screenshots from this application. Given the nature

of enterprise applications, where structural variation

between screens is minimal, many of our images may

appear similar. However, this similarity reﬂects the

real challenges in automating such applications. In

total, we gathered 121 distinct screenshots.

3.3 Data Processing

Our approach to data processing was minimalist, fo-

cusing mainly on anonymization to maintain conﬁ-

dentiality. No other signiﬁcant processing techniques

were applied to preserve the original structure of the

application as much as possible. This approach en-

sures that our dataset accurately represents the typical

environment of enterprise applications.

3.4 Data Annotation

The annotation and labeling of our dataset were car-

ried out manually, using LabelStudio (Tkachenko

et al., 2022), a versatile tool for annotating various

types of data. Given the relatively small size of our

dataset, manual annotation was deemed the most ap-

propriate approach. Figure 1a displays a sample im-

age from the dataset, while the second Figure 1b

presents the same image, but in an annotated format.

(a) Image 24 from UICVD dataset.

(b) The annotated Image 24 from UICVD dataset.

Figure 1: Images from UICVD dataset.

Through extensive internal discussions and con-

sultations with the company, we established 16 dis-

tinct classes to reﬂect both the current and future

needs of the application. The dataset was then pro-

cessed to include only signiﬁcant elements for identi-

fying UI components. Annotations are structured in a

CSV ﬁle with the following information: image id,

class, xmin, ymin, xmax, and ymax. In total, our

dataset comprises 15123 annotations. Details of the

classes and the number of annotations per class are

presented in Table 1.

4 USING THE DATASET

The UI dataset can be used in at least the following

two contexts: a) training RPA agents in recognizing

UI controls and understanding user interfaces and b)

training machine learning models for recognizing UI

components outside of the RPA context.

UICVD: A Computer Vision UI Dataset for Training RPA Agents

417

Table 1: Distribution of classes and annotations in the

UICVD dataset.

Class Nr. of annotations

Icon 5704

TextLabel 4298

MenuItem 1138

Row 813

Button 757

SubmenuItem 685

NavigationItem 424

DropdownItem 379

InputField 149

SectionTitle 145

TitleBar 121

Menu 121

WorkingArea 121

VerticalMenu 110

NavigationMenu 100

TableHeader 58

4.1 Training RPA Agents in

Recognizing UI Controls and

Understanding User Interfaces

For an RPA agent that automates a business web ap-

plication like a CRM (i.e., Customer Relationship

Manager), ERP (i.e., Enterprise Resource Planning),

or project planning application, it is very important

to recognize various UI controls from the user inter-

face of the application. UI controls like buttons, input

text ﬁelds, drop-down lists, text areas, date controls,

grids, tables, etc. are the entry points of the human

user actions on the UI of the application. The human

user operates on these UI controls by triggering vari-

ous events like mouse clicks, text inputs, and mouse

over, so an RPA agent must also operate on these UI

controls to control the application.

For web applications, many such simple UI con-

trols like buttons and input text ﬁelds can be eas-

ily recognized by an RPA agent that comes in the

form of a browser extension. But some web appli-

cations customize these simple UI controls so much

that they can not be easily identiﬁed from the DOM

structure (i.e., some web applications want to cre-

ate custom clickable buttons or checkbox/radio but-

tons and do not rely on the classical HTML con-

trols <input type=”button” >, <button >, <input

type=”checkbox” >). For these controls, it is very

hard to recognize them by parsing the DOM of the

HTML document and they are better identiﬁed by a

computer vision approach. Also, complex UI controls

like lists, tables, and grids are not easily recognized

directly from the DOM structure of the HTML docu-

ment - as these controls are usually realized by a va-

riety of HTML tags together with a lot of CSS code.

These complex UI controls are many times better rec-

ognized through a computer vision approach.

Our UI dataset is meant to be useful in training

machine learning-based RPA agents for recognizing

UI controls in the user interface of a business web

application.

4.2 Training Machine Learning Models

for Recognizing UI Components

Outside of RPA Context

Outside the RPA domain, there is still the need to rec-

ognize icons and UI controls in a web application.

Some possible scenarios where this would be useful

are the following:

• screen reading for visually impaired people,

• advanced web scraping that does not rely only on

XPaths in the DOM structure,

• automatic testing of a web application.

Our UI dataset could be used in training machine

learning models that help visually impaired people

use the target web application. It could also help de-

scribe the UI of the target web application for old peo-

ple not familiar with digital technology.

Another possible usage scenario of our UI dataset

is for advanced web scraping. Usually, web scrap-

ing programs navigate over to a speciﬁc URL and

select a speciﬁc UI element from that HTML docu-

ment based on a preset XPath in the DOM structure

and they return the textual content of that UI element.

More complicated web scraping scenarios can be sup-

ported if the scraping agent automatically recognizes

the UI interface elements present in the HTML docu-

ment loaded in the browser.

A very important and time-consuming phase of

web application development is the testing of this

application. Machine learning models previously

trained on datasets of UI elements can automatically

recognize UI components and could be used to per-

form automatic testing of the target web application.

5 PROTOTYPE RPA SYSTEM

IMPLEMENTATION USING

THE UI DATASET

The UICVD dataset is very useful for an RPA agent

that automates a target web application. While such

an RPA agent can easily recognize simple UI ele-

ments like clickable buttons and text input ﬁelds in

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

418

a user interface through DOM analysis (if the RPA

agent comes in the form of a browser extension ), it

will have a hard time recognizing clickable icons or

complex, higher level, UI elements like grids and ta-

bles only by DOM analysis means. In this direction, a

machine learning model can help the RPA agent rec-

ognize UI elements by computer vision means. We

present in Fig. 2 the architecture of such an RPA agent

that can use the UICVD dataset to apply computer

vision techniques for recognizing and understanding

user interfaces of web applications.

The RPA agent architecture outlined in Fig. 2

comes mainly in the form of a browser extension

because the RPA agent targets/automates web appli-

cations (i.e., it needs to have access to the DOM

of the HTML documents of the web application).

The RPA agent also includes a back-end machine

learning model that resides on a remote REST web

server. The browser extension communicates with

the back-end REST web server through XHR re-

quests. The browser extension is always implemented

in JavaScript and is made of the following type of

code resources: background scripts, content scripts,

and popup script.

The popup script is just the UI of the extension

and it interacts directly with the human user. The

popup script is an html+css+javascript bundle that

presents a user interface to the user which assists the

user in starting/ending the recording of an automated

process, extracts for the user the semantics of the UI

of the target web application (i.e., extracts and high-

lights the components (i.e., buttons, text labels, text

input ﬁelds, grids, tables, icons, etc.) of the UI of the

target web application), and allows the automated ex-

ecution of a previously recorded process.

The background scripts is the JavaScript code of

the browser extension that is loaded all the time across

browser tabs and the content scripts is the JavaScript

code that gets injected in the context of the document

loaded in a browser tab. All these three components of

the extension communicate with each other through

message channels. The RPA agent has two main func-

tionalities: a) discovering or recording a process that

can be later automatically executed and b) executing

automatically a pre-recorded process (i.e., automated

ﬂow). Both functionalities require code in both the

background scripts and the content scripts.

First, a human user records a process with the as-

sistance of the RPA agent (i.e., browser extension)

and later the human user can automatically execute

the pre-recorded process using the RPA agent; of

course, the execution can be parameterized. This ap-

proach is also called programming-by-demonstration

(Li et al., 2017). A recorded automated process is rep-

resented by a sequence of browsing events (usually

mouse clicks required to navigate through the vari-

ous menus of the target web application to reach the

web page that supports a speciﬁc functionality of the

target web application like adding a new user or up-

dating a resource etc.) and also business functionality

events like ‘inserting a text in a text ﬁeld‘, ‘selecting

a value from a drop-down list‘, ‘clicking a save but-

ton‘ etc. Of course, all business functionality events

support parameterization (e.g. when executing the au-

tomated process, the user can specify what text value

should be introduced in a speciﬁc text input ﬁeld).

The Executor and Process discovery/recording

components from the background scripts are respon-

sible for coordinating the automatic execution and,

respectively, the recording of a process. They are

also responsible for injecting the content scripts in

the document loaded in the browser. The components

from the content scripts perform the direct work on

the target web application. The Executor from the

content scripts triggers click events on buttons and

ﬁlls input ﬁelds in the browser so that the automated

process is executed. The Executor component from

background scripts takes each step from an automated

process representation (i.e., automated ﬂow), instructs

the Executor component from the content scripts to

perform the respective step (e.g. click event, text ﬁll-

ing event), then it makes sure that the content scripts

are loaded in the context of the current document (re-

)loaded in the browser and moves to the next execu-

tion step in the automated process.

The UI understanding component in the content

scripts is the most important component of the RPA

agent and is responsible for understanding the UI of

the target web application. It needs to detect sim-

ple controls like buttons, text input ﬁelds, and click-

able icons, but also complex controls like grids so that

it makes sense on the current operation/functionality

that is facilitated by the UI of the web application.

The RPA agent uses a dual approach for UI under-

standing: analyzing the DOM and performing a com-

puter vision analysis on a picture of the document

loaded in the browser.

To detect UI elements, the CV analyzer takes

snapshot images of the document currently loaded in

the browser and it sends these images to the REST

web server through XHR requests. The REST web

server runs a machine learning model on the snapshot

image to detect UI elements like buttons, icons, input

text ﬁelds, text areas, check and radio buttons, drop-

down lists, tables, and grids and returns the detected

UI elements as JSON objects to the CV analyzer of

the browser extension. The DOM analyzer detects

simple input controls directly from the DOM struc-

UICVD: A Computer Vision UI Dataset for Training RPA Agents

419

Background scripts

Content scripts

Executor Executor

Process discovery/recording

UI Understanding

DOM analyzer

CV analyzer

Popup script

Browser extension

Computer Vision

ML model

XHR requests

(back- end REST

HTTP server)

Figure 2: RPA agent architecture.

ture (e.g. buttons, text input ﬁelds, drop-down lists,

links).

The machine learning model is developed in

Python and is designed for object detection tasks, pri-

marily utilizing the Faster R-CNN (Ren et al., 2015)

framework, though it is compatible with other Convo-

lutional Neural Network (CNN) architectures such as

YOLOv8 (Jocher et al., 2023) for varied applications.

The machine learning model was previously trained

on the UI dataset.

The source code of the prototype RPA

agent presented in this section is available for

download from the following Git repository:

https://github.com/KiralyCraft/WAPlugin/.

6 CONCLUSIONS

This paper highlights the UICVD Dataset as a valu-

able resource in the ﬁelds of Robotic Process Au-

tomation (RPA) and Computer Vision alike, demon-

strating its wide applicability in both areas. It is note-

worthy that, in the specialized literature, we have not

been able to identify a similar dataset originating from

an enterprise application. The process of data collec-

tion and the structure of the dataset have been elab-

orated, emphasizing two main applications, yet the

possibilities for use are far more varied and intercon-

nected.

The UICVD Dataset was created through the cap-

ture of screenshots from an enterprise application,

providing a range of scenarios for both RPA and Com-

puter Vision tasks. In the context of RPA, the UICVD

Dataset supports building advanced RPA agents that

recognize and understand the UI of enterprise web ap-

plications. In the ﬁeld of Computer Vision, the dataset

is distinguished by its ability to identify user interface

(UI) elements - ranging from simple icons and but-

tons to more sophisticated structural components of

the application, such as menus and data rows. Ac-

cess to the dataset is provided via the URL below:

https://github.com/MadaDicu/UICVD.

The paper also showcases a prototype implemen-

tation for a possible RPA agent that uses the UICVD

dataset for UI understanding. We present its software

architecture and detail its main components.

ACKNOWLEDGEMENTS

This research was funded by the ERDF through the

Competitiveness Operational Programme 2014-2020,

for the project ”Integrated System for Business Pro-

cess Automation using Artiﬁcial Intelligence” (No.

POC/163/1/3/121075), executed by ENDAVA Roma-

nia and Babes¸-Bolyai University Cluj-Napoca.

REFERENCES

Agostinelli, S., Lupia, M., Marrella, A., and Mecella, M.

(2020). Automated Generation of Executable RPA

Scripts from User Interface Logs, pages 116–131.

Barman, S., Chasins, S., Bodik, R., and Gulwani, S. (2016).

Ringer: web automation by demonstration. In Pro-

ceedings of the 2016 ACM SIGPLAN international

conference on object-oriented programming, systems,

languages, and applications, pages 748–764.

Bunian, S., Li, K., Jemmali, C., Harteveld, C., Fu, Y., and

Seif El-Nasr, M. S. (2021). Vins: Visual search for

mobile user interface design. In Proceedings of the

2021 CHI Conference on Human Factors in Comput-

ing Systems, pages 1–14.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

420

Cai, Z. and Vasconcelos, N. (2018). Cascade r-cnn: Delving

into high quality object detection. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 6154–6162.

Chakraborti, T., Isahagian, V., Khalaf, R., Khazaeni,

Y., Muthusamy, V., Rizk, Y., and Unuvar, M.

(2020). From robotic process automation to intelli-

gent process automation: Emerging trends. CoRR,

abs/2007.13257.

Chasins, S. E., Mueller, M., and Bodik, R. (2018). Rousil-

lon: Scraping distributed hierarchical web data. In

Proceedings of the 31st Annual ACM Symposium on

User Interface Software and Technology, pages 963–

975.

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan,

D., Li, Y., Nichols, J., and Kumar, R. (2017). Rico:

A mobile app dataset for building data-driven design

applications. In Proceedings of the 30th annual ACM

symposium on user interface software and technology,

pages 845–854.

Dwyer, B. (2022). Website screenshots dataset.

https://universe.roboﬂow.com/roboﬂow-gw7yv/

website-screenshots.

Han, X., Hu, L., Dang, Y., Agarwal, S., Mei, L., Li, S., and

Zhou, X. (2020). Automatic business process struc-

ture discovery using ordered neurons LSTM: A pre-

liminary study. CoRR, abs/2001.01243.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

Hofmann, P., Samp, C., and Urbach, N. (2020). Robotic

process automation. Electronic Markets, 30(1):99–

106.

Institute for Robotic Process Automation (2015). Intro-

duction to robotic process automation. a primer.

https://irpaai.com/wp-content/uploads/2015/05/

Robotic-Process-Automation-June2015.pdf.

Ito, N., Suzuki, Y., and Aizawa, A. (2020). From natural

language instructions to complex processes: Issues in

chaining trigger action rules. CoRR, abs/2001.02462.

Jocher, G., Chaurasia, A., and Qiu, J. (2023). YOLO by

Ultralytics. https://github.com/ultralytics/ultralytics.

Accessed: December 1, 2023.

Leiva, L. A., Hota, A., and Oulasvirta, A. (2020). Enrico:

A dataset for topic modeling of mobile ui designs. In

22nd International Conference on Human-Computer

Interaction with Mobile Devices and Services, pages

1–4.

Leno, V., Deviatykh, S., Polyvyanyy, A., Rosa, M. L., Du-

mas, M., and Maggi, F. M. (2020). Robidium: Auto-

mated synthesis of robotic process automation scripts

from UI logs. In Proceedings of the Best Disserta-

tion Award, Doctoral Consortium, and Demonstra-

tion & Resources Track at BPM 2020 co-located with

the 18th International Conference on Business Pro-

cess Management (BPM 2020), Sevilla, Spain, Sept.

13-18, 2020, volume 2673, pages 102–106. CEUR-

WS.org.

Leno, V., Polyvyanyy, A., Dumas, M., Rosa, M. L.,

and Maggi, F. M. (2021). Robotic process min-

ing: Vision and challenges. Business & Information

Systems Engineering: The International Journal of

WIRTSCHAFTSINFORMATIK, 63(3):301–314.

Leopold, H., van der Aa, H., and Reijers, H. (2018). In

Identifying Candidate Tasks for Robotic Process Au-

tomation in Textual Process Descriptions, pages 67–

81.

Li, T. J.-J., Azaria, A., and Myers, B. A. (2017). Sug-

ilite: Creating multimodal smartphone automation by

demonstration. In Proceedings of the 2017 CHI Con-

ference on Human Factors in Computing Systems,

CHI ’17, page 6038–6049, New York, NY, USA. As-

sociation for Computing Machinery.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In Computer Vision–ECCV 2016:

14th European Conference, Amsterdam, The Nether-

lands, October 11–14, 2016, Proceedings, Part I 14,

pages 21–37. Springer.

Rajawat, A. S., Rawat, R., Barhanpurkar, K., Shaw, R. N.,

and Ghosh, A. (2021). Chapter one - robotic process

automation with increasing productivity and improv-

ing product quality using artiﬁcial intelligence and

machine learning. pages 1–13.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Tkachenko, M., Malyuk, M., Holmanyuk, A., and Liu-

bimov, N. (2020-2022). Label Studio: Data label-

ing software. Open source software available from

https://github.com/heartexlabs/label-studio.

Van-der Aalst, W. M. P., Bichler, M., and Heinzl, A. (2018).

Robotic process automation. Business and Informa-

tion Systems Engineering, 60:269–272.

Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Ob-

ject detection in 20 years: A survey. Proceedings of

the IEEE.

UICVD: A Computer Vision UI Dataset for Training RPA Agents

421