Spatially Constrained Clustering to Deﬁne Geographical Rating

Territories

Shengkun Xie

1,3

, Zizhen Wang

and Anna Lawniczak

Department of Mathematical & Computational Sciences, University of Toronto Mississauga, Mississauga, Canada

Department of Mathematics and Statistics, University of Guelph, Guelph, Canada

Ted Rogers School of Management, Ryerson University, Toronto, Canada

shengkun.xie@utoronto.ca, shengkun.xie@ryerson.ca, zizhen@uoguelph.ca, alawnicz@uoguelph.ca

Keywords:

Spatially Constrained Clustering, Ratemaking, Geocoding, Gap Statistic, Business Data Analytic, Model

Selection.

Abstract:

In this work, spatially constrained clustering of insurance loss cost is studied. The study has demonstrated

that spatially constrained clustering is a promising technique for deﬁning geographical rating territories using

auto insurance loss data as it is able to satisfy the contiguity constraint while implementing clustering. In

the presented work, to ensure statistically sound clustering, advanced statistical approaches, including average

silhouette statistic and Gap statistic, were used to determine the number of clusters. The proposed method

can also be applied to demographical data analysis and real estate data clustering due to the nature of spatial

constraint.

1 INTRODUCTION

Clustering analysis has been now widely used for

business analytic including automobile insurance

pricing as a machine learning tool (A.C. Yeo and

Brooks, 2001; Grize, 2015). It aims to partition

a set of multi-dimensional data to a limited size of

groups. It has also been used for territory analysis in

many states of USA where zip codes were treated as

an atomic geographical rating unit (Peck and Kuan,

1983). The aim of such analysis is to balance the

group homogeneity and the number of clusters de-

sired in order to ensure that insurance premium is fair

and credible. This is particularly important when in-

surance premium is regulated. The main focus of this

type of clustering is to determine an optimal parti-

tion of spatially constrained data into a set of groups,

based on some distance measures such as Euclidean

distance. The optimality is in the sense of being statis-

tically sound as well as being able to satisfy insurance

regulation. In clustering, the distance measures are

applied to each data dimension ﬁrst and then the over-

all distance measure of each data point is compared to

each other to create different clusters or groups. How-

ever, how to handle the spatially constrained data in

clustering become a challenging task.

In determining a suitable insurance classiﬁcation

of territory, average loss cost (or loss cost in short),

i.e. pure premium, is often used as one of the key vari-

ables to differentiate levels of loss for each designed

territory. Loss cost per geographical rating unit is cal-

culated by dividing the total loss per year (in terms of

dollars amount) within a given rating unit by the total

number of risk exposures, i.e., the number of vehicles

per year. The spatially constrained loss cost clustering

is not only of particular interest to insurance regula-

tors, who are mainly focusing on studying high level

statistics estimates, but also it is important for auto in-

surance companies, where accurate pricing based on

different territories are needed for the success of busi-

ness to avoid the adverse selection.

In this work, we aim for an optimal grouping strat-

egy for average loss costs at a Forward Sorting Area

(FSA) level. In Canada, a FSA consists of ﬁrst three

letters of a postal code and it covers a much bigger

area than a single postal code does. This allows a

more reliable estimate of pure premium in a given

region of interest as it includes more risk exposures

(i.e., number of vehicles). Within insurance area, ge-

ographical information using postal codes has been

seriously considered for ﬂood insurance pricing be-

cause the nature of insurance coverage is heavily de-

termined by geographical location of insureds. To our

best knowledge, the territory design using geo-coding

of FSAs has not appeared in the literature. This work

is considered as the ﬁrst attempt on discussing this

Xie, S., Lawniczak, A. and Wang, Z.

Spatially Constrained Clustering to Deﬁne Geographical Rating Territories.

DOI: 10.5220/0006118100820088

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 82-88

ISBN: 978-989-758-222-6

topic.

In many countries including United States and

Canada, auto insurance rates are heavily regulated.

This implies that any rate-making methodology be-

ing used to analyze data must be both statistically

and actuarially sound. Ensuring statistical soundness,

it means that the approach being used must convey

meaningful statistical information and the obtained

results must be optimal in the statistical sense. From

the actuarial perspective, it requires that any proposed

rate-making methodology must take insurance regu-

lation and actuarial practice into consideration. For

example, loss cost must be at the similar level within

a given cluster and the total number of territories used

for insurance classiﬁcation should be within a certain

range. Also, the number of exposures should be suf-

ﬁciently large to ensure that an estimate of statistics

from the given group is credible. Because of these,

it is critical to quantify the clustering effect and bal-

ance the results by taking both statistical soundness

and actuarial rate & class regulation requirements into

consideration.

The main contribution of this work is to propose

a spatially constrained clustering approach, which is

suitable for regional based business decision making

using analytical approach. The proposed method has

been applied to auto insurance pricing problem. Due

to the nature of this work, it is possible to apply this

method to other types of problems such as real estate

pattern analysis.

This paper is organized as follows. In Section

2, we discuss the proposed methods including rate-

making, clustering algorithms, and the choice of num-

ber of clusters. In Section 3, analysis of spatial loss

cost data and summary of main results are presented.

Finally, we conclude our ﬁndings and provide further

remarks in Section 4.

2 METHODS

In rate-making methodologiesfor auto insurance pric-

ing, territory design and analysis based on loss cost

of a geographical rating unit is one of the key aspects.

Loss cost is deﬁned as a ratio of total loss to total

number of exposures. It is an average cost to cover

an exposure of risk for a given period (i.e., policy

term) and it is often called pure premium or theoreti-

cal premium. The need of territory design is to ensure

that the number of exposures in each territory is suf-

ﬁciently large so that the estimate of statistic within a

territory is credible. Also, the loss cost of basic rating

units within a territory must be at a similar level, i.e.

it must consider a suitable number of rating territo-

ries that satisfy contiguity constraint which ensure the

homogeneity and credibility for each territory. Often

large sizes of rating territories or a small total number

of rating territories easily satisfy the full credibility

requirement, but often not the homogeneity require-

ment. How to balance these two sides becomes the

major focus of this type of research. Also, each terri-

tory should contain only their neighbors, and cannot

include any rating units acrossing the boundaries be-

tween territories. This contiguity constraint inspires

us to consider a clustering with geocoding. Often we

refer this as a spatially constrained clustering.

2.1 Geocoding and Weighted Clustering

Auto insurance loss data contains residential infor-

mation of policy holders, i.e., postal codes, reported

claim information and others. The loss amount and

exposure of risk are then aggregated by postal codes

to derive loss cost. Often loss cost at postal code level

is less credible as it may not cover sufﬁcient number

of reported claims for accurate analysis. Therefore,

in order to better reﬂect the nature of loss level, we

have to consider a rating unit that includes a larger

size of exposures, so that the loss cost estimate be-

comes more credible. In this work, we deﬁne FSA

as a basic geographical rating unit. This geographical

information is then coded into latitude and longitude.

The geocoding is then combined with other loss infor-

mation to become an input of a clustering algorithm.

We then consider an optimization problem that essen-

tially leads to a clustering algorithm, which can be

described as follows when taking K-mean as an ex-

ample.

Given a set of high dimensional observations

, X

, ..., X

}, where each observation is a d-

dimensional real vector, i.e. X

∈ R

, a weighted K-

mean clustering aims to partition the n observations

into K sets (K ≤ n), S = {S

,... ,S

}, so that it

minimizes the within-cluster sum of squares (WCSS):

argmin

∑

i=1

∑

∈S

− µ

, (1)

where µ

is the mean point of the cluster S

. The

weighted sum of squares is deﬁned as follows:

− µ

∑

l=1

− µ

)

, (2)

where w

is used to specify the importanceness of

each dimension of data variable X

. In auto insurance

pricing, a typical focus on determining w

is to eval-

uate the importanceness of each pricing factor.

Often each dimension of data variable X

needs

to be scaled, i.e., a normalization procedure needs

Spatially Constrained Clustering to Deﬁne Geographical Rating Territories

to be applied before clustering. We assume that X

has been normalized. Speciﬁcally, in our case (i.e.,

when d = 3), µ

= (µ

,µ

)

⊤

corresponds to mean

value of ith center of designed territory and X

)

⊤

is the vector consisting of the standard-

ized loss cost x

, latitude x

and longitude x

the jth FSA, and w

is the weight applied to dth di-

mension of data variable. In this work, without loss

of generality, we take w

=1 and we allow w

take different values. The idea is to deﬁne a relativ-

ity measure between loss cost and geographical loca-

tion as w

. When w

=1, the loss cost is deemed to be

as important as geographical information, while when

takes a value greater (less) than 1 the loss cost is

more (less) important than geographical information

in a clustering.

One can also use K-medoid clustering instead of

K-mean. The major difference between these two

approaches is estimate of the center of each cluster.

The K-mean clustering determines each cluster’s cen-

ter using the arithmetics means of each data charac-

teristic, while the K-medoid clustering uses the actu-

ally data points in a given cluster as the center. For

our clustering problem, it does not make any essen-

tial difference, which clustering method is selected,

as we aim for grouping only. Similarly, the hierar-

chical clustering, which seeks to build a hierarchy of

clusters, can also be considered.

2.2 Spatially Constrained Clustering

The K-mean or K-medoid clustering does not neces-

sarily lead to clustering results that satisfy the cluster

contiguity requirement. In this case, spatially con-

strained clustering is needed as all clusters are re-

quired to be spatially contiguous. We start from an

initial clustering. We assume that each cluster from

the initial clustering will contain only a few non-

contiguous points, and we just need to re-allocate

these points following an initial clustering. To re-

allocate these non-contiguous points, we ﬁrst iden-

tify them, and then re-allocate them to the closest

(minimal-distance) point within a contiguous clus-

ter. In order to implement this allocation of non-

contiguous points, we propose an approach that is

based on Delaunay triangulation (Recchia, 2010;

Renka, 1996). In mathematics, a Delaunay triangu-

lation for a set P of points in a plane is a triangulation,

denoted by DT(P), such that no point in P is inside the

circumcircle of any triangle in DT(P). If a cluster P is

in DT(P) and DT(P) forms a convex hull (Preparata

and Hong, 1977), the clustering then satisﬁes the con-

tiguity constraint. In order to construct a DT, we pro-

pose the following procedure:

1. We ﬁrst do K-mean clustering as an initial cluster-

ing.

2. Based on the obtained clustering results from the

previous step, we ﬁnd all points that are entirely

surrounded by points from other clusters.

3. We then ﬁnd the neighboring point at minimal dis-

tance to the point that has no neighbors in the

same cluster. We called the associated cluster as a

new cluster.

4. The points that have no neighbors are then reallo-

cated to new clusters.

It is possible that the reallocated points may still be

isolated, thus this entire routine should be iterated un-

til we ﬁnd that no such isolated point exists. Note that

this implementation is purely based on algorithm we

develop and the boundary created for each cluster is

often not corresponding to the geographical bound-

ary of each basic rating unit. However, based on this

results, one should be able to further reﬁne them to

ensure that the boundary of cluster is determined by

the boundary of FSAs.

2.3 Choice of the Number of Clusters

In data clustering, the number of clusters needs to be

determined ﬁrst. In this work, the number of clus-

ters represents the number of territories. Finding opti-

mal number of clusters becomes especially challeng-

ing in high dimensional scenarios where visualiza-

tion of data is difﬁcult. In order to be statistically

sound, several methods including average silhouette

(Rousseeuw, 1987) and gap statistic (R. Tibshirani

and Hastie, 2001) have been proposed for estimating

the number of clusters. The silhouette width of an

observation i is deﬁned as

s(i) =

b(i) − a(i)

max{a(i),b(i)}

, (3)

where a(i) is the average distance between i and all

other observations in the same cluster and b(i) is the

minimum average distance between i to other obser-

vations in different clusters. Observations with large

s(i) (almost 1) are well-clustered, observations with

small s(i) (around 0) tend to lie between two clus-

ters and observations with negative s(i) are probably

placed in a wrong cluster.

Varying the total number of clusters from 1 to the

maximum total number of clusters K

max

, the observed

data can be clustered using any algorithm including

K-mean. Next average silhouette can be used to esti-

mate the number of components. For a given number

of clusters K, the overall average silhouette width for

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

clustering can be calculated as

¯s =

∑

i=1

s(i)

. (4)

The number of clusters which gives the largest aver-

age silhouette width is used to estimate the optimal

number of clusters. Note that, the optimal number

of clusters may not be used eventually in practice,

given the fact that the real-world data is complex and

contains high level of variation. Often it is the case

that the optimal number of clusters provides a start-

ing point for clustering work and each cluster from

optimal clustering may be partitioned further in order

to improve the results.

Gap statistics, proposed by Tibshirani et al.

(2001), is another resampling-based approach for

determining the optimal number of clusters. This

method compares the observed distributionof the data

samples to a null reference distribution (such as uni-

form distribution). The number K of clusters is se-

lected such that K is the smallest value whose differ-

ence (i.e., gap) is statistically signiﬁcant. The gap at

selected number K of clusters is deﬁned as follows

Gap(K) = E

∗

[log(W

)] − log(W

) (5)

where W

is the within-cluster sum of squares and E

∗

is the expectation under a sample size of N from the

reference null distribution.

In order to compute the gap statistics,

∗

[log(W

)] needs to be determined ﬁrst. This

is done by resampling from a given reference null

distribution and using A different reference distribu-

tions, as null distributions. Varying the total number

of clusters from 1 to K

max

, both the observed data and

the reference data are clustered. E

∗

[log(W

)] and the

standard deviation σ

are estimated as follows

∗

[log(W

)] =

∑

a=1

log(W

) (6)

and

∑

a=1

{log(W

) − E

∗

[log(W

)]}

1/2

. (7)

The clustering result from the selected K clusters

is said to be statistically signiﬁcantly different from

the null reference if

Gap(K) ≥ Gap(K + 1) − σ

K+1

1+ 1/A. (8)

The optimal K is the smallest value of K that achieves

this statistical signiﬁcance. Similarly to the silhou-

ette statistic, due to the natural complexity and high

level of variation from real data as well as lack of

the certainty in selecting null distributions, the mean

and standard deviation computed by using A reference

distributions may lead to a signiﬁcant bias. So in prac-

tice, the optimal number of clusters determined by (8)

only provides a starting point for further clustering

analysis of the data. In order to more systematically

select the suitable number of clusters, we propose a

reﬁned approach based on (8). From (8), we derive

the following expression.

∆G(K) = Gap(K + 1) − Gap(K) ≤ σ

K+1

1+ 1/A.

(9)

Often A is large, therefore (10) suggest that an optimal

number is the smallest K which satisﬁes the corre-

sponding incremental Gap statistics which is less than

one standard deviation. This leads to the following

simpliﬁed version of (10)

∆G(K) = Gap(K + 1) − Gap(K) ≤ σ

K+1

. (10)

Furthermore, ∆G(K) ﬂuctuates with K and the ﬂuctu-

ation will distort the estimate of the optimal number

of clusters. Instead of looking for the smallest K that

satisﬁes the equation (11) based on the empirical pat-

tern, one can estimate the signal component of ∆G(K)

by imposing a power-law relationship.

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

K−Medoid without Scaling

(a) K-Medoid

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

K−Mean without Scaling

(b) K-Mean

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

K−Medoid with Scaling

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

K−Mean with Scaling

(d) K-Mean

Figure 1: The clustering results using equal weight (i.e.,

=1) for the data with and without scaling. Each cluster

has the same color, and there are 10 clusters determined by

average silhouette method.

Spatially Constrained Clustering to Deﬁne Geographical Rating Territories

Table 1: The summary of model performance when the clustering results ﬁtted to alinear model. An insurance claim frequency

of 2% and full credibility of 1082 claims are assumed for computing credibility of the minimum size of clusters, denoted by

min(E

K Std.Dev adjusted R

min(E

) min(

min(E

)∗2%

1082

,1)

0.5 4 672.8 0.5159 30,041 0.7452

0.6 4 672.8 0.5159 30,041 0.7452

0.7 3 867.6 0.1951 97,359 1

0.8 4 800.2 0.3152 54,092 0.9999

0.9 4 672.8 0.5159 30,041 0.7452

1.0 4 672.8 0.5159 30,041 0.7452

1.1 2 815.5 0.2889 159,966 1

1.2 2 815.5 0.2889 159,966 1

1.3 2 815.5 0.2889 159,966 1

1.4 2 815.5 0.2889 159,966 1

1.5 2 815.5 0.2889 159,966 1

1.6 7 408.3 0.8217 9,252 0.4135

0 5 10 15 20 25 30

100 200 300 400

Number of clusters K

Total within−clusters sum of squares

Figure 2: The selection of number of cluster based on el-

bow method using total within-clusters sum of squares. The

vertical dotted line is at K = 7, which is suggested by the

method.

3 RESULTS

In this section, we present the results of analysis using

a real data set from an auto insurance regulator. The

data consists of geographical information in terms of

FSAs, loss cost for each FSA, and exposures of risk

for each FSA. The number of exposures will be used

for credibility weighted and is not passed to a clus-

tering algorithm. Since the geo-coder takes only the

input of zip codes or postal codes, we ﬁrst collect

all postal codes that are associated with each FSA.

Within each FSA, the postal codes are geo-coded. We

then use the geo-coding of postal codes within each

FSA to estimate the geo-coding of the given FSA sim-

5 10 15 20 25

0.70 0.75 0.80 0.85

Numbers of clusters K

Gap

Figure 3: The selection of number of clusters based on the

gap statistics. The vertical dotted line is at K = 7. The

vertical red bounded line indicates one standard deviation

region.

ply by taking the average of geo-coding along each

dimension. The obtained Latitude, Longitude and its

associated loss cost for each FSA becomes the input

of clustering algorithm.

Scaling of input data is an important procedure for

clustering. High data scale is not necessarily more

important than low scale of data. Here we demon-

strate the impact of scaling on clustering by compar-

ing results obtained from both with and without scal-

ing methods. In the case of without scaling of data

method, the clustering results shown in Figure 1 in

which K-mean and K-medoid methods are used, re-

spectively, suggest that the partitioning of input data

is not successful because the contiguity constraint of

clustering is not met. The interior of a cluster con-

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

Before

Figure 4: Convex hull plot of clusters obtained from the

K-mean clustering without re-allocation of isolated points.

−79.8 −79.7 −79.6 −79.5 −79.4 −79.3 −79.2

43.6 43.7 43.8 43.9

Longititude

Latitude

After

Figure 5: Convex hull plot of clusters obtained from the

K-mean clustering with re-allocation of isolated points.

tains elements from the other clusters. Even though

the sum of squares between the groups has explained

98.9% of the total sum of squares for a partition of

10 clusters, this partition of data does not satisfy the

insurance pricing regulation because of the contigu-

ity issue. When the input data is scaled, the clus-

tering results are more promising because they bet-

ter satisfy the contiguity constraint requirement. The

cost of having this better result is a lower proportion

of between groups variation, which is decreased from

98.9% to 81.5%. However, the contiguity constraint

is still not fully satisﬁed, because there exists some

points are isolated and within another cluster.

In order to produce the results of Figure 1, a

choice of the number of clusters was required. In this

paper, three methods including elbow method, aver-

age silhouette and gap statistic were used. For the

input data with scaling, 7 clusters were suggested by

both elbow method (qualitatively determined) and av-

erage silhouette method. The result of using elbow

method is shown in Figure 2. The choice of the num-

ber of clusters is based on the smallest K that is as-

sociated with insigniﬁcant decrease of within cluster

sum of squares. The selection of the number of clus-

ters using gap statistic is presented in Figure 3. When

applying either gap statistic and elbow method, the

optimal number of clusters should be determined with

care. For instance, gap statistic suggests the choice

of minimum K when the increment ∆G is larger than

K+1

. When this choice is applied to our data, the

gap statistic suggests the number of clusters to be one,

which does not make sense as it suggest that no clus-

tering is required. This result is apparently due to the

heavy distortion caused by the underlying uncertainty

of the estimate of gap statistic and the standard devi-

ation. Therefore, a more suitable choice of K should

be made based on the overall pattern of gap statistic

with respect to K number of clusters. One should fo-

cus on the signal component of G(K) and select the

one that ﬁrst approaches the stable state of gap statis-

tic. When this rule is applied, the similar number is

obtained to the ones obtained by the elbow method or

average silhouette method.

The clustering results are further analyzed by ﬁt-

ting the data to ANOVA linear model. The ANOVA

model standard deviation and adjusted R

are ob-

tained from comparing the performance of model ﬁt-

ting, in terms of predictive power (through adjusted

) and the model reliability (by looking at the model

standard deviation). From these results one can see

that the best performance is obtained when w

=1.6,

which means that the loss cost needs to be given more

weight. In this case, 7 clusters are suggested by the al-

gorithm in order to achieve the statistical soundness,

however this leads to a lower credibility. In calcu-

lating credibility, a 2% car insurance claim frequency

and full credibility of 1082 claims are assumed. This

further conﬁrms that when the loss cost is given more

weight, the clustering is done mainly based on the

loss cost, and to satisfy the contiguity constraint, more

clusters may be needed.

To demonstrate the improvement of using spa-

tially constrained clustering that we proposed, we ﬁrst

apply the K-mean clustering with w

=1 (i.e., being

equally important between the geophysical location

and the loss cost) using 22 clusters. The 22 clus-

ters were used by the regulator who owns the data

Spatially Constrained Clustering to Deﬁne Geographical Rating Territories

to come up with a design of territory for regulation

purpose. The result is shown in Figure 4. From this

result one can see that there are still many clusters,

such as 3, 13, 15, 21 and 18 (indicated within convex

hull), which do not satisfy the contiguity constraint

completely. Thus, we apply the proposed procedure

discussed in the methodology section to further reﬁne

the results. From the output shown in Figure 5, all

the clusters form convex hull. Thus, the contiguity

constraint is satisﬁed.

4 DISCUSSIONS AND

CONCLUDING REMARKS

In this work, spatially constrained clustering of the in-

surance loss cost was studied. The FSAs represented

by their computed geocoding, and their associated in-

surance loss costs are the input of clustering algo-

rithms. The geocoding does not require a big extra ef-

fort as it can be easily obtained from some geo-coders

using Global Positioning System (GPS). In geocod-

ing of an FSA, each co-ordinate of centroid is deter-

mined by using the mean value of either latitude or

longitude value of the total postal codes within each

FSA. The geocoding and the loss cost values must be

standardized before using them in the clustering al-

gorithm. The standardization procedure is just a re-

location and re-scaling of each variable, i.e. loss cost,

latitude and longitude. The method of Delaunay tri-

angulation is used to ensure that the contiguity con-

straint is satisﬁed. In fact, the contiguity constraint

has many other applications in the earth and social

sciences and in image processing (Recchia, 2010). It

has been demonstrated that the spatially constrained

clustering is a promising approach for clustering in-

surance loss costs as it is able to satisfy the contigu-

ity constraint while implementing clustering. In the

presented work, to ensure clustering to be statistically

sound, advanced statistical approaches including av-

erage silhouette statistic and Gap statistic were used

to determine the number of clusters. The presented

work is based on data for all insurance coverage, it

may be interesting to see how the loss cost change

from one sub-coverage to another one. This needs to

be done by investigating sub-coverage data. Also, in

order to quantify the homogeneity of clustering, en-

tropy based method may be considered in future re-

search as it can measure how uniformity of the distri-

bution of loss cost is. The uniformity is what insur-

ance company expect to ensure that policyholders are

responsible for extract their cost.

ACKNOWLEDGEMENT

The Authors acknowledges support from NSERC

(Natural Sciences and Engineering Research Council

of Canada).

REFERENCES

A.C. Yeo, K.A. Smith, R. W. and Brooks, M. (2001). Clus-

tering technique for risk classiﬁcation and prediction

of claim costs in the automobile insurance industry.

Intelligent Systems in Accounting, Finance and Man-

agement, 10:39 – 50.

Grize, Y. (2015). Applications of statistics in the ﬁeld of

general insurance: An overview. International Statis-

tical Review, 83:135–159.

Peck, R. and Kuan, J. (1983). A statistical model of individ-

ual accident risk prediction using driver record, terri-

tory and other biographical factors. Accident Analysis

and Prevention, 15:371–393.

Preparata, F. and Hong, S. (1977). Convex hulls of ﬁnite

sets of points in two and three dimensions. Commun.

ACM, 20:87–93.

R. Tibshirani, G. W. and Hastie, T. (2001). Estimating the

number of clusters in a data set via the gap statistic.

Journal of the Royal Statistical Society: Series B (Sta-

tistical Methodology), 63:411–423.

Recchia, A. (2010). Contiguity-constrained hierarchical ag-

glomerative clustering using sas. Journal of Statistical

Software, 33:1–8.

Renka, R. (1996). Algorithm 772: Stripack: a

constrained two-dimensional delaunay triangulation

package. ACM Transactions on Mathematical Soft-

ware, 22:416–434.

Rousseeuw, P. (1987). Silhouettes: a graphical aid to the

interpretation and validation of cluster analysis. Com-

putational and Applied Mathematics, 20:53 – 65.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods