to as web harvesting, web crawling, and web mining.
While web harvesting can be used as a synonym to
web scraping, the definitions of web crawling and web
mining do not allow for an interchangeable use (Gatter-
bauer 2009, 3472; Bharanipriya and Kamakshi Prasad
2011, 211; Najork 2009, 3462).
The respective class of programs is referred to as
web scrapers (Najork, 2009). They imitate the interac-
tion of a human with a server and perform the same
tasks a human would perform to get the information
of interest, but in a shorter amount of time. In detail,
they access the web page and search the underlying
HTML-code using regular expressions. When the in-
formation required is found, they extract it and copy it
to a pre-defined output file (Glez-Pe
˜
na et al., 2013, 789
f.). The obtained information usually has to be cleaned
and checked after extraction. A web scraper can be a
library, a framework or a desktop-based environment
(Glez-Pe
˜
na et al., 2013, 789 f.). The last option enables
the use of web scraping without prior knowledge of a
programming language (Glez-Pe
˜
na et al., 2013, 790
f.).
Web scraping has numerous advantages and dis-
advantages. Advantages are time-efficiency, the pos-
sibility of a regular data collection with shorter time
intervals, the reduction of costs, avoidance of response
burden of survey respondents, no survey response ef-
fects such as social desirability bias, and enhancing
the quality of statistics resp. the amount of information
(Landers et al. 2016, 2; Hoekstra et al. 2010, 6 f., 15).
The disadvantages are the necessity for the skills to
write a web scraper, changes of the underlying HTML-
code of websites, lack of automatic plausibility checks
when extracting data, and ethical and legal concerns
(Hoekstra et al., 2010, 7 f.). Furthermore, there is a
chance that the web scraper overloads the servers or
enhances the costs for the owner of the website for
needing a larger bandwidth (Koster 1993b; Thelwall
and Stuart 2006, 1776). Given the increasing band-
width and capacity of web servers, this objection seems
to be negligible. The remaining problems are mostly
ethical problems.
2.1 Ethical and Legal Concerns
When extracting information from websites using a
web scraper, the information is derived without the
knowledge and consent of the individuals such as the
person whose information is retrieved, the provider of
the website and in the case of death notices relatives
of the deceased person (van Wel and Royakkers, 2004,
129). Although, the websites
robots.txt
, which reg-
ulates the access by robots, can be used to stop web
scrapers from accessing the information presented on
the website (Koster 1993a; Thelwall and Stuart 2006,
1775), it should be noted, that the
robots.txt
can
be bypassed. Ethically and legally the laws for web
scraping have to be followed.
Regarding the legal concerns, national laws, as well
as European laws, have to be considered (in the case of
Europe). For Germany, the copyright law and the Eu-
ropean directive 96/9/EG are of interest. Both, as well
as different court decisions throughout the years, show
that web scraping is legal for scientific purposes if the
extracted information is not made publicly available
and is not commercialized. Further, the information
has to be freely accessible, without the need for regis-
tration. Even if the terms and conditions of a website
prohibit the use of a web scraper, they only apply if
they have been accepted by a registration.
Since personal information is extracted, data pro-
tection laws have to be considered as well. In Europe,
a deceased person is, by definition, not covered by
the data protection law. Nevertheless, postmortem per-
sonal rights concerning the dignity of the deceased
person still apply. However, for scientific purposes, the
extraction of personal information from data sets is
permitted, as long as they do not include information
about persons alive (L
¨
ower, 2010, 33). In the case of
death notices, this refers to the names (and addresses)
of relatives mentioned.
2.2 Online Newspaper Death Notices
Since the 18th century, the death of persons was an-
nounced publicly. The first death notice in Germany
was published in 1753. But not until the 19th century
the regular publication of death notices was established
(Gr
¨
umer and Helmrich, 1994, 69). The structure of a
death notice is nowadays more or less standardized, as
can be seen in figure 1.
Traditionally, death notifications are published in
newspapers. Increasingly, in many countries, death no-
tices are also published in online newspaper obituaries.
Online death notices have been used as a data source
for research purposes in the US. For example, Boak
et al. (2007) used death notices from the Pittsburgh
Post-Gazette to monitor mortality. When comparing
their extracted death notices to administrative records,
they were able to find death notices for 73.5% of the
registered deceased persons, 31.4% could be found in
registers of surrounding cities, and 8.96% could not be
detected (Boak et al., 2007, 534). Similar results were
reported by Soowamber et al. (2016, 167) when using
death notices to detect panel mortality. Although they
do not report the proportion of detected deaths, they
considered death notices as a valid and reliable data
source (Soowamber et al., 2016, 167).
HEALTHINF 2019 - 12th International Conference on Health Informatics
320