version is necessary. However, if the model gener-
ation is as simple as clicking on an element on the
website, we can let the user update the model. In par-
ticular, if the model is transferable, such an updated
model can be shared among all users of a software
and some users might not even notice, that the web-
site changed.
A third use case, that benefits from the proposed
selector is supervised machine learning. To train
models large data sets needs to be prepared and la-
beled or categorized. Since the data sets are often
large, the combined work of many people is required.
In this case, it is an advantage if these people do not
need to be experts in programming, that are capa-
ble of creating regular expressions or XPath defini-
tions. This enables the utilization of crowd worker
platforms, as Amazon’s Mechanical Turk. Our ap-
proach allows the identification of complex selection
patterns based on mouse clicks of a user. This makes
our approach particularly accessible to untrained per-
sonnel.
The general idea of our approach is to weight dif-
ferent paths in the websites DOM tree depending on
the user inputs. Thereby, the user inputs are evaluated
to obtain a model, that can be used to assess whether
a given path is likely to be a selected one. The nov-
elty of our approach is the idea to combine the path
scoring with discrete periodicity analysis to identify
specific patterns selected by a user. For instance, se-
lecting every second or third row of a table or two
consecutive rows followed by one row that is not se-
lected. These are often properties, that can be easily
expressed in scripting-based selectors as XPath, but
are difficult or elaborate to obtain with click events.
The paper is structured as follows. In Section 2,
we review related work, that discusses web scraping
and technical approaches to extract specific informa-
tion. We then start in Section 3 with the introduction
of the path scoring approach, that is based on anal-
yses of CSS classes and HTML tag names mapped
to states of the DOM tree. We introduce periodicity
analysis in Section 3.1 and a bisimulation relation to
asses the similarity of subtrees in Section 3.2, which
further improves the selection results of our approach.
We evaluate the model sizes and the time complexity
of our approach in Section 4 and also discuss experi-
mental results. The results of this paper and potential
future work are discussed in Section 5.
2 RELATED WORK
Glez-Pe
˜
na et al. differentiate three different classes
of web scraping approaches “(i) libraries for general
purpose programming languages, (ii) frameworks and
(iii) desktop-based environments“ (Glez-Pe
˜
na et al.,
2014). Thereby, the first category supports in the uti-
lization of programming languages as Perl or Python
to extract information from websites. Glez-Pe
˜
na et al.
also count commandline tools, as curl or wget to this
category, what softens the clear separation of the three
categories. The second category entails frameworks,
as Scrapy or jARVEST and demarcates itself from the
other categories by containing approaches, that are a
“more integrative solution“ (Glez-Pe
˜
na et al., 2014).
The final category, enables layman programmers to
fathom web scraping.
Thereby, the proposed categorization demon-
strates the fundamental paradigms: On the one hand
approaches, that can be in practice only used by
trained personnel, since they require programming
and those, that support users with elaborate interfaces.
However, the categorization appears partially incon-
sistent in the sense, that there exists overlaps between
categories (some frameworks can be also used as a li-
brary (for instance Scrapy)) and we further argue, that
tools as curl or wget are independent programs. It is
possible to call these programs within program code,
but it stretches the concept of program code libraries.
A different family of approaches to extract infor-
mation from web pages is based on semantic infor-
mation. Thereby, DOM elements are enriched with
ontological information. The World Wide Web Con-
sortium (W3C)
1
proposes a language, called Web On-
tology Language (OWL), that provides different ex-
pressions to define relationsships and affiliations be-
tween content. Lamy (Lamy, 2017) presents how
this additional information can be used to automat-
ically classify data. Fern
´
andez et al. present a se-
mantic web scraping approach based on resource de-
scription framework (RDF) models, that are used to
map DOM elements to “semantic web resources“
(Fern
´
andez Villamor et al., 2011). One problem of
these approaches is that often semantic annotations
are not available for websites.
Mahto et al. discuss in their paper the ethics of
web scraping and provide web scraping examples
written in Python with BeautifulSoup (Mahto and
Singh, 2016), which is categorized by Glez-Pe
˜
na et
al. as a library (Glez-Pe
˜
na et al., 2014).
Krosnick and Oney (Krosnick and Oney, 2021)
discuss the potential use of scripting based scraping
techniques for web automation and end-to-end user
interface testing. They conduct two studies to assess
usual challenges, that users face when they implement
web automation scripts. In the first study all users
needed to develop selection scripts in a simple editor,
1
https://www.w3.org/
Scoring-based DOM Content Selection with Discrete Periodicity Analysis
281