Claims Prediction with BQML

Did you know you could develop a machine learning model using SQL? Google’s cloud data warehouse, Big Query, includes SQL support for machine learning with extensions like CREATE MODEL – by analogy with SQL DDL statement CREATE TABLE.

If you’re like me, you’re probably thinking, “why on Earth would I ever use SQL for machine learning?” Google’s argument is that a lot of data people are handy with SQL, not so much with Python, and the data is already sitting in a SQL-based warehouse.

Big Query ML features all the popular model types from classifiers to matrix factorization, including an automated model picker called AutoML. There’s also the advantage of cloud ML in general, which is that you don’t have to build a special rig (I built two) for GPU support.

In this article, I am going to work a simple insurance problem using BQML. My plan is to provide an overview that will engage both the Python people and the SQL people, so that both camps will get better results from their data warehouse.

Ingest data via Google Cloud Storage
Transformation and modeling in Big Query
Access the results from a Vertex AI notebook

By the way, I have placed much of the code in a public repo. I love grabbing up code samples from Analytics Vidhya and Towards Data Science, so this is my way of giving back.

Case Study: French Motor Third-Party Liability Claims

We’re going to use the French car insurance data from Wüthrich, et al., 2020. They focus on minimizing the loss function (regression loss, not insurance loss) and show that decision trees outperform linear models because they capture interaction among the variables.

There are a few ways to handle this problem. While Wüthrich treats it as a straightforward regression problem, Lorentzen, et al. use a composition of two linear models, one for claim frequency and a second for claim severity. As we shall see, this approach follows the structure of the data.

Lorentzen focus on the Gini index as a measure of fitness. This is supported by Frees, and also by the Allstate challenge, although it does reduce the problem to a ranking exercise. We are going to follow the example of Dal Pozzolo, and train a classifier to deal with the imbalance issue.

Ingesting Query Data via Google Cloud Storage

First, create a bucket in GCS and upload the two CSV files. They’re mirrored in various places, like here. Next, in Big Query, create a dataset with two tables, Frequency and Severity. Finally, execute this BQ LOAD script from the Cloud Shell:

bq load \
--source_format=CSV \
--autodetect \
--skip_leading_rows=1 \
french-cars:french_mtpl.Frequency \
gs://french_mtpl2/freMTPL2freq.csv

The last two lines are syntax for the table and the GCS bucket/file, respectively. Autodetect works fine for the data types, although I’d rather have NUMERIC for Exposure. I have included JSON schemas in the repo.

It’s the most natural thing in the world to specify data types in JSON, storing this schema in the bucket with the data, but BQ LOAD won’t use it! To utilize the schema file, you must create and load the table manually in the browser console.

Wüthrich specifies a number of clip levels, and Lorentzen implements them in Python. I used SQL. This is where we feel good about working in a data warehouse. We have to JOIN the Severity data and GROUP BY multiple claims per policy, and SQL is the right tool for the job.

BEGIN
SET @@dataset_id = 'french_mtpl'; 
 
DROP TABLE IF EXISTS Combined;
CREATE TABLE Combined AS
SELECT F.IDpol, ClaimNb, Exposure, Area, VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas, Density, Region, ClaimAmount
FROM
    Frequency AS F
LEFT JOIN (
  SELECT
    IDpol,
    SUM(ClaimAmount) AS ClaimAmount
  FROM
    Severity
  GROUP BY
    IDpol) AS S
ON
  F.IDpol = S.IDpol
ORDER BY
  Idpol;
 
UPDATE Combined
SET ClaimNb = 0
WHERE (ClaimAmount IS NULL AND ClaimNb >=1 );
 
UPDATE Combined
SET ClaimAmount = 0
WHERE (ClaimAmount IS NULL);
 
UPDATE Combined
SET ClaimNb = 1
WHERE ClaimNb > 4;
 
UPDATE Combined
SET Exposure = 1
WHERE Exposure > 1;
 
UPDATE Combined
SET ClaimAmount = 200000
WHERE ClaimAmount > 200000;
 
ALTER TABLE Combined
ADD COLUMN Premium NUMERIC;
 
UPDATE Combined
SET Premium = ClaimAmount / Exposure
WHERE TRUE;
 
END

Training a Machine Learning Model with Big Query

Like most insurance data, the French MTPL dataset is ridiculously imbalanced. Of 678,000 policies, fewer than 4% (25,000) have claims. This means that you can be fooled into thinking your model is 96% accurate, when it’s just predicting “no claim” every time.

We are going to deal with the imbalance by:

Looking at a “balanced accuracy” metric
Using a probability threshold
Using class weights

Normally, with binary classification, the model will produce probabilities P and (1-P) for positive and negative. In Scikit, predict_proba gives the probabilities, while predict gives only the class labels – assuming a 0.50 threshold.

Since the Allstate challenge, Dal Pozzolo and others have dealt with imbalance by using a threshold other than 0.50 – “raising the bar,” so to speak, for negative cases. Seeking the right threshold can be a pain, but Big Query supplies a handy slider.

Sliding the threshold moves your false-positive rate up and down the ROC curve, automatically updating the accuracy metrics. Unfortunately, one of these is not balanced accuracy. You’ll have to work that out on your own. Aim for a model with a good, concave ROC curve, giving you room to optimize.

The best way to deal with imbalanced data is to oversample the minority class. In Scikit, we might use random oversampling, or maybe synthetic minority oversampling. BQML doesn’t support oversampling, but we can get the same effect using class weights. Here’s the script:

CREATE OR REPLACE MODEL`french-cars.french_mtpl.classifier1`
    TRANSFORM (
        ML.QUANTILE_BUCKETIZE(VehAge, 10) OVER() AS VehAge,
        ML.QUANTILE_BUCKETIZE(DrivAge, 10) OVER() AS DrivAge,
        CAST (VehPower AS string) AS VehPower,
        ML.STANDARD_SCALER(Log(Density)) OVER() AS Density,
        Exposure,
        Area,
        BonusMalus,
        VehBrand,
        VehGas,
        Region,
        ClaimClass
    )
OPTIONS (
    INPUT_LABEL_COLS = ['ClaimClass'], 
    MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER',
    NUM_PARALLEL_TREE = 200,
    MAX_TREE_DEPTH = 4,
    TREE_METHOD = 'HIST',
    MAX_ITERATIONS = 20,
    DATA_SPLIT_METHOD = 'Random',
    DATA_SPLIT_EVAL_FRACTION = 0.10,
    CLASS_WEIGHTS = [STRUCT('NoClaim', 0.05), ('Claim', 0.95)]
    )  
AS SELECT
  Area,
  VehPower,
  VehAge,
  DrivAge,
  BonusMalus,
  VehBrand,
  VehGas,
  Density,
  Exposure,
  Region, 
  ClaimClass
FROM `french-cars.french_mtpl.Frequency`
WHERE Split = 'TRAIN'

I do some bucketizing, and CAST Vehicle Power to string, just to make the decision tree behave better. Wüthrich showed that it only takes a few levels to capture the interaction effects. This particular classifier achieves 0.63 balanced accuracy. Navigate to the model’s “Evaluation” tab to see the metrics.

The OPTIONS are pretty standard. This is XGBoost behind the scenes. Like me, you may have used the XGB library in Python with its native API or the Scikit API. Note how the class weights STRUCT offsets the higher frequency of the “no claim” case.

I can’t decide if I prefer to split the test set into a separate table, or just segregate it using WHERE on the Split column. Code for both is in the repo. BQML definitely prefers the Split column.

There are two ways to invoke AutoML. One is to choose AutoML as the model type in the SQL script, and the other is to go through the Vertex AI browser console. In the latter case, you will want a Split column. Running AutoML on tabular data costs $22 per server-hour, as of this writing. The cost of regular BQML and data storage is insignificant. Oddly, AutoML is cheaper for image data.

Don’t forget to include the label column in the SELECT list! This always trips me up, because I am accustomed to thinking of it as “special” because it’s the label. However, this is still SQL and everything must be in the SELECT list.

Making Predictions with Big Query ML

Now, we are ready to make predictions with our new model. Here’s the code:

SELECT
    IDpol,
    predicted_ClaimClass_probs,
FROM 
    ML.PREDICT (
    MODEL `french-cars.french_mtpl.classifier1`,
    (
    SELECT
      IDpol,
      BonusMalus,
      Area,
      VehPower,
      VehAge,
      DrivAge,
      Exposure,
      VehBrand,
      VehGas,
      Density,
      Region
    FROM
      `french-cars.french_mtpl.Frequency`
    WHERE Split = 'TEST'))

The model is treated like a FROM table, with its source data in a subquery. Note that we trained on Split = ‘TRAIN’ and now we are using TEST. The model returns multiple rows for each policy, giving the probability for each class:

This is a little awkward to work with. Since we only want the claims probability, we must UNNEST it from its data structure and select prob where label is “Claim.” Support for nested and repeated data, i.e., denormalization, is typical of data warehouse systems like Big Query.

SELECT IDpol, probs.prob 
FROM pred, 
UNNEST (predicted_ClaimClass_probs) AS probs
WHERE probs.label = "Claim"

Now that we know how to use the model, we can store the results in a new table, JOIN or UPDATE an existing table, etc. All we need for the ranking exercise is the probs and the actual Claim Amount.

Working with Big Query Tables in Vertex AI

Finally, we have a task that requires Python. We want to measure, using a Gini index, how well our model ranks claims risk. For this, we navigate to Vertex AI, and open a Jupyter notebook. This is the same as any other notebook, like Google Colab, except that it integrates with Big Query.

from google.cloud import bigquery
client = bigquery.Client(location="US")
sql = """SELECT * FROM `french_mtpl.Combined_Results` """ 
df = client.query(sql).to_dataframe()

The Client class allows you to run SQL against Big Query and write the results to a Pandas dataframe. The notebook is already associated with your GCP project, so you only have to specify the dataset. There is also a Jupyter magic cell command, %%bigquery.

Honestly, I think the hardest thing about Google Cloud Platform is just learning your way around the console. Like, where is the “New Notebook” button? Vertex used to be called “AI Platform,” and notebooks are under “Workbench.”

I coded my own Gini routine for the Allstate challenge, but the one from Lorentzen is better, so here it is. Also, if you’re familiar with that contest, Allstate made us plot it upside down. Corrado Gini would be displeased.

The actual claims, correctly sorted, are shown by the dotted line on the chart – a lot of zero, and then 2,500 claims. Claims, as sorted by the model, are shown by the blue line. The model does a respectable 0.30 Gini and 0.62 balanced accuracy.

Confusion Table:
       Pred_1 Pred_0 Total Pct. Correct
True_1   1731    771  2502     0.691847
True_0  29299  36420 65719     0.554178
Accuracy: 0.5592
Balanced Accuracy: 0.6230

Now that we have a good classifier, the next step would be to combine it with a severity model. The classifier can predict which policies will have claims – or the probability of such – and the regressor can predict the amount. Since this is already a long article, I am going to leave the second model as an exercise.

We have seen how to make a simple machine learning model using Big Query ML, starting from a CSV file in Google Cloud Storage, and proceeding through SQL and Python, to a notebook in Vertex AI. We also discussed AutoML, and there’s a bunch of sample code in the repo.

Author: Mark Virag

Management consultant specializing in software solutions for the auto finance industry. View all posts by Mark Virag