4 DISCUSSION AND
CONCLUSION
The paper reports our initial efforts in building a large
dataset of cyber-security incidents by merging a col-
lection of four publicly available datasets of different
size and provenance, overcoming the lack of publicly
available datasets of substantial size observed in pre-
vious research (Romanosky, 2016).
By analysing the resulting dataset with standard
statistical techniques, our work confirms the gener-
ally observed rapidity with which the phenomenon
of cyber-attacks is evolving. While incidents caused
by malicious outsiders passed from 16% to 50% in
a time-span of just five years, other leading causes
of data breaches such as malicious insiders and un-
intended disclosures lost most of their importance in
the same period. There may be multiple causes un-
derlying this trend. On the one hand, the decreasing
relevance of unintended disclosures and malicious in-
siders may be the result of the adoption of better se-
curity procedures and awareness programs by compa-
nies and organisations. On the other hand, remote at-
tacks are more and more widespread because of the
explosion of personal and sensitive data available on-
line resulting from the digitalisation of many aspects
of our lives. These factors seem to confirm the idea
that organisations and companies should take a holis-
tic approach and tune their cyber-security postures ac-
cording to a variety of sources about threats and coun-
termeasures including cyber-intelligence information
about current threats provided by, e.g., national or
international Computer Emergency Response Teams
(CERTs). It is thus not surprising that the forecasts
about the size of 2015 and 2016 data breaches con-
tained in (Edwards et al., 2016) remain partly un-
achieved.
Concerning the limitations of our approach, two
issues must be considered. The first is related to the
coverage of data and is shared with previous work
(e.g., (Romanosky, 2016)). Since the four datasets
used to build ours are based on public notifications to
authorities, it is unclear whether the data are repre-
sentative of the overall phenomenon of cyber-attacks
or not. We draw this consideration from the compari-
son of two figures. In our dataset, the share of private
USA companies and organisations involved in secu-
rity breaches amounts to minuscule figures, namely
0.02% (or less) per year. An official report based on
a representative UK sample highlights that 67% of
medium-large firms have suffered from cyber-attacks
in 2016 (Klahr et al., 2017). The corresponding num-
ber for Italy in the same period, based on another na-
tional representative survey, is 43% (Biancotti, 2017).
We are currently gathering additional sources of in-
formation to understand to what extent our analyses
reflects actual trends operating in the overall popu-
lation of US firms and organization. The second is-
sue to be considered is the remarkable amount of ef-
fort required to make the merged dataset coherent and
uniform. The result is apparently worth the effort; a
database derived from publicly available information
that is comparable in size to that used in (Romanosky,
2016), which is privately owned and contains around
15,000 descriptions of data breaches. However, we
acknowledge that the relevance of the results depends
on the quality of the generated dataset, which in turn
depends on the quality of the method used to join
the source datasets: it must be able to eliminate re-
dundancies and consistently map the source categori-
sations into one which is general enough to accom-
modate those used in the initial datasets and—at the
same time—not too coarse to loose precision and sig-
nificance in the analysis phase. To tackle this issue,
our future efforts will be devoted to reach a high-level
of automation of the various steps of the methodol-
ogy by developing a toolkit for automatically collect-
ing, tidying, mapping, and merging datasets of cyber-
security incidents. The main benefit of developing
such a toolkit is flexibility along two dimensions.
First, it will be possible to experiment with different
taxonomies for the types of attacks and economic sec-
tors to better identify which option minimises the loss
of precision and coherence when merging different
datasets. Ultimately, this would reduce the level of ar-
bitrariness in the data manipulations besides those im-
posed by the publishers of the original datasets. The
second dimension is a tighter integration with the data
analysis phase: depending on the results of the latter,
we can decide to investigate some features of the com-
ponent datasets and use the results to fine-tune some
aspects of the collection, selection, mapping, and re-
dundancy elimination steps. The flexibility deriving
from a high-level degree of automation of the method-
ology will also simplify the inclusion of new datasets,
increase the size of the merged dataset, and possibly
make the application of a wider range of data analy-
sis techniques.
The present work has revealed some preliminary
results and interesting potentialities, but it has also
highlighted issues and limitations. This raises an im-
portant observation. As stated in Section 1, several
surveys and statistical reports are available online,
mostly from private companies. Since the issues we
reported depend only partially from our approach, it
should be argued that the reports available online suf-
fer the same limitations and issues. This calls for a
deeper scientific exploration of the available data, to
Learning from Others’ Mistakes: An Analysis of Cyber-security Incidents
305