(Brown et al., 2004) plays a crucial role in auto insur-
ance pricing. In Canada, DR has 7 levels correspond-
ing to how many years a driver has not been involved
in a car accident. For example, when DR equals zero,
there are zero years that this driver has had no acci-
dent. This may imply that this driver recently had a
car accident or is a new driver with zero years of driv-
ing history. Because of this implication, for drivers
with a low driving record and no accident history, the
insurance premium may be double-penalized as other
risk factors are used to indicate a similar level of risk,
such as young driver class or a low number of years
of having a driver’s license. This may call for applica-
tion of statistical methods that can help to reveal the
potential interactions among risk factors more accu-
rately. From the statistical modelling point of view,
this may imply that modelling or analysis of loss data
may need to be conditioned on a certain level of an-
other risk factor. For example, the DR pattern may
depend on a level of Class (i.e., Type of Use) or a ter-
ritory level.
To better understand the relationship between in-
surance loss and considered risk factors, first, we ex-
amine the functional pattern of DR using general-
ized additive models (GAM) (Hastie, 2017; Wood,
2006). The GAM is an extension of generalized linear
models (GLM) that allows for flexibility by having
the response variable to be linear but explained using
functions that can uncover non-linear relationships
between the independent variables and its response
variable. GAM has been recently applied to auto in-
surance pricing, particularly for modelling telematics
data (Huang and Meng, 2019; Boucher et al., 2017;
Meng et al., 2022). The GAM constructed by us in
this paper was then extended by adding the Class fac-
tor of the driver and the Territory factor. The Class
factor has 14 different categorical levels and the Terri-
tory factor has 2 different levels, rural or urban. Class
and Territory are model factors in GAM which pro-
duce a separate smooth term for each level of Class
and Territory. Within this GAM modelling frame-
work, we combine two separate models that use loss
cost and premium as response variables into one. This
combination is possible because, when modelling loss
cost and premium, they are assumed to have the same
set of predictors. This combination of loss cost and
premium as a model response is particularly novel in
actuarial data analysis and it allows an overall better
estimate of risk factor relativities.
Furthermore, in this work, we propose using
GAM as an alternative approach to estimating DR
relativities, often derived from GLM in current actu-
arial practice (Ohlsson and Johansson, 2010). This
new method can help to de-couple the correlation be-
tween different risk factors and to avoid the double
penalty in auto insurance pricing when using multi-
plicative pricing algorithms. The obtained functional
patterns from GAM lead to a better understanding of
DR characteristics and how they are affected by other
major risk factors. The proposed method maintains
the model interpretability while sharing some power
from the machine learning approaches (Burka et al.,
2021; Denuit et al., 2021) by providing us with an
estimate of non-linear functional patterns of DR.
2 MATERIALS AND METHODS
2.1 Data
The data used in this paper comes from the Insur-
ance Bureau of Canada (IBC). The data sets con-
sist of aggregated loss costs, premiums, and expo-
sures used to calculate risk relativity for each driving
record level, class and other major risk factors. Loss
cost is defined as total losses (claim amount and ex-
penses used for settling claims) divided by the total
number of exposures. The premiums are the average
earned premiums. To systematically analyze loss cost
and premium, we define a dummy variable to indi-
cate whether the value for a particular combination of
driving record and class is the average loss cost (pure
premium) or average premium (rate). The value 1 in-
dicates that it is loss cost, and 0 means it is premium.
The response variable is denoted by LOSSPREM, and
its observation consists of loss cost or premium, de-
pending on which case. The data also are separated
by territory, rural or urban, where 1 indicates that it
corresponds to urban and 0 represents the rural area.
There are 3 major coverages that we focus on, Acci-
dent Benefit (AB), Collision (COL), and Third Party
Liability (TPL). Each coverage has three years of data
from 2009 to 2011, and we also include a summarized
data set that combines all three years. Exposures are
taken as weights for DR and Class to produce accu-
rate confidence intervals.
2.2 Extending GLM to GAM
As we mentioned earlier, the traditional approach to
estimating the risk relativity of each level of a given
risk factor is either through empirical measures based
on the relative level of loss costs or via a modelling
approach that includes a set of risk factors as inde-
pendent variables and the loss cost as the response
for some statistical models such as generalized lin-
ear models. However, the empirical measures of the
relative loss cost level for each combination of fac-
DATA 2023 - 12th International Conference on Data Science, Technology and Applications
272