← Back to home

Insurance Risk Segmentation

Mar 2026 · 7 min read · MSc AI, Kristiania University College — Smart Analysis and Decision Making

Python scikit-learn K-Means PCA Hierarchical Clustering pandas matplotlib SciPy

Insurance pricing is fundamentally a risk estimation problem. A policyholder who makes many claims costs the insurer money; one who never claims is profitable. The challenge is that you cannot know in advance which type of customer you are dealing with — you can only look at observable characteristics and infer risk.

This project used the freMTPL2freq dataset — a standard actuarial benchmark containing 678,013 French motor third-party liability insurance policies — to ask: do policyholders naturally group into distinct risk profiles based on their characteristics, and do those groupings predict claim behaviour?

The key constraint was that this had to be done without using claim counts as an input feature. Segmentation had to emerge from policyholder attributes alone, with claims used only for post-hoc validation.

The dataset has 12 features per policy. I selected five numerical features for clustering: vehicle power (VehPower), vehicle age (VehAge), driver age (DrivAge), bonus-malus coefficient (BonusMalus — a French regulatory risk score), and population density (Density). Density was log-transformed to compress its heavy right skew, then all features were standardised using StandardScaler.

I evaluated three algorithms: K-Means on the raw 5D feature space, K-Means on PCA-reduced representations (testing 3 and 4 principal components), and hierarchical clustering with Ward linkage on a 2,000-point subsample (required because hierarchical clustering's O(n²) memory complexity makes it infeasible on the full 678k rows).

The number of clusters was selected through two independent methods — the elbow method on within-cluster sum of squares, and silhouette analysis — both of which pointed to K=3 as optimal.

PCA analysis: Five principal components were computed. The variance decomposition showed no sharp elbow — PC1 explained 30.6%, PC2 22.1%, PC3 19.9%, PC4 17.3%, PC5 10.2% — meaning the feature space has no dominant axis of variation. PCA-3 (the first three components) provided the best trade-off between dimensionality reduction and information retention for downstream clustering.

Clustering comparison: Four configurations were benchmarked using internal validity indices:

  • K-Means on original 5D space: Silhouette 0.197, Davies-Bouldin 1.688
  • K-Means on PCA-3: Silhouette 0.260 (best), Davies-Bouldin 1.318 (best)
  • K-Means on PCA-4: Silhouette 0.212, Davies-Bouldin 1.580
  • Hierarchical / Ward: Silhouette 0.219, Davies-Bouldin 1.551

K-Means on PCA-3 was the clear winner across all metrics. Dimensionality reduction before clustering removed noise from the feature space and improved cluster separation.

Validation: Kruskal-Wallis tests confirmed that claim frequency differences across the three clusters were highly statistically significant (p ≪ 0.001), meaning the segments weren't artefacts of the algorithm — they tracked real differences in policyholder behaviour.

Policies

678,013

Best Silhouette

0.260

Optimal K

3

The three clusters told a coherent actuarial story:

Cluster 0 — Rural Mature Low-Risk (46.9%, 318,130 policies): Average driver age 50.8, bonus-malus 52.3, low population density (2,132). Claim frequency 0.190 — the lowest of the three groups. Older, experienced drivers in lower-density areas pose the least risk.

Cluster 1 — Young Urban High-Risk (22.6%, 153,563 policies): Average driver age 31.6, bonus-malus 84.2, urban density (2,451). Claim frequency 0.351 — nearly double Cluster 0. Young drivers in cities with elevated bonus-malus scores are the highest-risk segment.

Cluster 2 — Urban Mature Moderate-Risk (30.4%, 206,320 policies): Average driver age 47.6, bonus-malus 53.1, lower density (778). Claim frequency 0.289 — intermediate. Mature drivers but in a different geographic and risk profile from Cluster 0.

These three profiles map cleanly onto risk tiers an actuary would recognise. The segmentation emerged purely from unsupervised learning — no claim data was used as input.

The most practically useful insight was that PCA before K-Means is not just a trick to speed up computation — it materially improves cluster quality by removing correlated noise dimensions. The gap between PCA-3 and the raw 5D clustering was larger than I expected.

I also learned to be more careful about what "interpretable" means in unsupervised settings. Internal indices (silhouette, Davies-Bouldin) tell you how tight and well-separated clusters are, but they say nothing about whether those clusters are meaningful. The Kruskal-Wallis validation step was essential to confirm that the algorithm had found something real, not just a tidy mathematical partition.

The hierarchical clustering experiment was humbling. Even on a 2,000-point subsample, it produced comparable results to K-Means on the full dataset — suggesting that for well-separated clusters, you often don't need all the data. But its O(n²) scaling makes it impractical for production insurance datasets, where K-Means scales linearly and wins on deployment feasibility.