ture of the aggregated loss data, it calls for suitable
data analytics that can be used for processing statis-
tical data reporting. (Xie and Lawniczak, 2018; Xie,
2019).
As a multivariate statistical approach, the conven-
tional Principal Component Analysis (PCA) is often
used to reduce the dimension of multivariate data or
to reconstruct the multidimensional data matrix us-
ing only the selected PCs. However, within the PCA
approach, the functionality between the multivariate
data and other variables is not considered (Bakshi,
1998). Application using PCA may become problem-
atic when multivariate data are interconnected. From
the data visualization perspective, it could be mis-
leading if the frequency value at a given size-of-loss
interval is visualized without incorporating the size-
of-loss in the plot. In statistical data reporting, the
incurred losses are grouped as intervals. Each size-
of-loss interval is not even, and with the increase of
the incurred loss, the width of the intervals dramati-
cally increases. To overcome the potential mistakes
that can be caused by the visualization of loss data,
PCA is used to extract key information from the data
matrix so that the main pattern functionality between
the relative frequency and the size-of-loss can be vi-
sualized properly. By doing so, we significantly im-
prove data explainability. In this work, PCA is used
for both low-rank approximations and feature extrac-
tions, with the consideration of the functionality of
relative frequency values and the size-of-loss.
Our contribution to this research area is using
PCA in a novel way, to extract its key features of auto
insurance loss to improve the data visualization for a
better decision-making process. To our best knowl-
edge, the proposed method appears for the first time
in literature to consider the data explainability prob-
lem of statistical data reporting in insurance sector.
The proposed method helps to improve the data ex-
plainability as well as a better understanding of the
overall pattern of the size-of-loss relative frequency
at the industry level. Also, feature extraction by PCA
facilitates the understanding of loss count data vari-
ability, both the overall and the local behaviour, and
its natural functionality between the frequency values
and the size-of-loss. The analysis conducted in this
work illustrates the application of a suitable multi-
variate statistical approach to dimension reduction of
statistical data in auto insurance to have a higher data
interpretability. This paper is organized as follows.
In Section 2, the data and its collection are briefly in-
troduced. In Section 3, the proposed methods, includ-
ing feature extraction and low-rank approximation via
PCA, are discussed. In Section 4, analysis of auto in-
surance size-of-loss data and the summary of the main
results are presented. Finally, we conclude our find-
ings and provide further remarks in Section 5.
2 DATA
In this work, we focus on the study of the size-of-loss
relative frequency of auto insurance using datasets
from the Insurance Bureau of Canada (IBC), which
is a Canadian organization responsible for insurance
data collections and their statistical data reporting
problems in the area of property and casualty in-
surance. During the data collection process, insur-
ance companies report the loss information, includ-
ing the number of claims, number of exposures, loss
amounts, as well as other key information such as
territories of loss, coverages, driving records associ-
ated with loss, and accident years. These statistical
data are reported regularly (i.e., weekly, biweekly or
monthly). At the end of each half-year, the total claim
amounts and claim counts reported by all insurance
companies are aggregated by territories, coverages,
accident years, etc. The statistical data reporting is
then used for insurance rate regulation to ensure the
premiums charged by insurance companies are fair
and exact. The dataset used in this work consists of
summarized claim counts by different sizes of loss,
which are represented by a set of non-overlapping in-
tervals. The claim counts are aggregated by major
coverages, i.e. Bodily Injuries (BI) and Accident Ben-
efits (AB). Also, the data were summarized by differ-
ent accident years, by different report years and by
different territories, i.e. Urban (U) and Rural (R).
To carry out the study, we organize data by cov-
erages (AB and BI) and by territories (U and R). We
consider the data from different reporting years and
accident years as repeated observations. There are
two reporting years, 2013 and 2014, respectively. For
each reporting year, there is a set of rolling most re-
cent five years of data corresponding to five accident
years. Therefore, for this study, we have in total ten
years of observation. Also, since we have both Acci-
dent Benefits and Bodily Injuries as the coverage type
and Urban and Rural as the territory, we consider the
following four different combinations, Accident Ben-
efits and Urban ( ABU), Accident Benefits and Rural
(ABR), Bodily Injuries and Urban (BIU), and Bodily
Injuries and Rural (BIR). These data are then formed
into a data matrix with a 40 × 24 dimension, where
40 is the total number of observations, and 24 is the
number of total intervals of the size-of-loss.
DATA 2020 - 9th International Conference on Data Science, Technology and Applications
186