Did you know you could develop a machine learning model using SQL? Google’s cloud data warehouse, Big Query, includes SQL support for machine learning with extensions like CREATE MODEL – by analogy with SQL DDL statement CREATE TABLE.
If you’re like me, you’re probably thinking, “why on Earth would I ever use SQL for machine learning?” Google’s argument is that a lot of data people are handy with SQL, not so much with Python, and the data is already sitting in a SQL-based warehouse.
Big Query ML features all the popular model types from classifiers to matrix factorization, including an automated model picker called Auto ML. There’s also the advantage of cloud ML in general, which is that you don’t have to build a special rig (I built two) for GPU support.
In this article, I am going to work a simple insurance problem using BQML. My plan is to provide an overview that will engage both the Python people and the SQL people, so that both camps will get better results from their data warehouse.
- Ingest data via Google Cloud Storage
- Transformation and modeling in Big Query
- Access the results from a Vertex AI notebook
By the way, I have placed much of the code in a public repo. I love grabbing up code samples from Analytics Vidhya and Towards Data Science, so this is my way of giving back.
Case Study: French Motor Third-Party Liability Claims
We’re going to use the French car insurance data from Wüthrich, et al., 2020. They focus on minimizing the loss function (regression loss, not insurance loss) and show that decision trees outperform linear models because they capture interaction among the variables.
There are a few ways to handle this problem. While Wüthrich treats it as a straightforward regression problem, Lorentzen, et al. use a composition of two linear models, one for claim frequency and a second for claim severity. As we shall see, this approach follows the structure of the data.
Lorentzen focus on the Gini index as a measure of fitness. This is supported by Frees, and also by the Allstate challenge, although it does reduce the problem to a ranking exercise. We are going to follow the example of Dal Pozzolo, and train a classifier to deal with the imbalance issue.
Ingesting Query Data via Google Cloud Storage
First, create a bucket in GCS and upload the two CSV files. They’re mirrored in various places, like here. Next, in Big Query, create a dataset with two tables, Frequency and Severity. Finally, execute this BQ LOAD script from the Cloud Shell:
bq load \ --source_format=CSV \ --autodetect \ --skip_leading_rows=1 \ french-cars:french_mtpl.Frequency \ gs://french_mtpl2/freMTPL2freq.csv
The last two lines are syntax for the table and the GCS bucket/file, respectively. Autodetect works fine for the data types, although I’d rather have NUMERIC for Exposure. I have included JSON schemas in the repo.
It’s the most natural thing in the world to specify data types in JSON, storing this schema in the bucket with the data, but BQ LOAD won’t use it! To utilize the schema file, you must create and load the table manually in the browser console.
Wüthrich specifies a number of clip levels, and Lorentzen implements them in Python. I used SQL. This is where we feel good about working in a data warehouse. We have to JOIN the Severity data and GROUP BY multiple claims per policy, and SQL is the right tool for the job.
BEGIN SET @@dataset_id = 'french_mtpl'; DROP TABLE IF EXISTS Combined; CREATE TABLE Combined AS SELECT F.IDpol, ClaimNb, Exposure, Area, VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas, Density, Region, ClaimAmount FROM Frequency AS F LEFT JOIN ( SELECT IDpol, SUM(ClaimAmount) AS ClaimAmount FROM Severity GROUP BY IDpol) AS S ON F.IDpol = S.IDpol ORDER BY Idpol; UPDATE Combined SET ClaimNb = 0 WHERE (ClaimAmount IS NULL AND ClaimNb >=1 ); UPDATE Combined SET ClaimAmount = 0 WHERE (ClaimAmount IS NULL); UPDATE Combined SET ClaimNb = 1 WHERE ClaimNb > 4; UPDATE Combined SET Exposure = 1 WHERE Exposure > 1; UPDATE Combined SET ClaimAmount = 200000 WHERE ClaimAmount > 200000; ALTER TABLE Combined ADD COLUMN Premium NUMERIC; UPDATE Combined SET Premium = ClaimAmount / Exposure WHERE TRUE; END
Training a Machine Learning Model with Big Query
Like most insurance data, the French MTPL dataset is ridiculously imbalanced. Of 678,000 policies, fewer than 4% (25,000) have claims. This means that you can be fooled into thinking your model is 96% accurate, when it’s just predicting “no claim” every time.
We are going to deal with the imbalance by:
- Looking at a “balanced accuracy” metric
- Using a probability threshold
- Using class weights
Normally, with binary classification, the model will produce probabilities P and (1-P) for positive and negative. In Scikit, predict_proba gives the probabilities, while predict gives only the class labels – assuming a 0.50 threshold.
Since the Allstate challenge, Dal Pozzolo and others have dealt with imbalance by using a threshold other than 0.50 – “raising the bar,” so to speak, for negative cases. Seeking the right threshold can be a pain, but Big Query supplies a handy slider.
Sliding the threshold moves your false-positive rate up and down the ROC curve, automatically updating the accuracy metrics. Unfortunately, one of these is not balanced accuracy. You’ll have to work that out on your own. Aim for a model with a good, concave ROC curve, giving you room to optimize.
The best way to deal with imbalanced data is to oversample the minority class. In Scikit, we might use random oversampling, or maybe synthetic minority oversampling. BQML doesn’t support oversampling, but we can get the same effect using class weights. Here’s the script:
CREATE OR REPLACE MODEL`french-cars.french_mtpl.classifier1` TRANSFORM ( ML.QUANTILE_BUCKETIZE(VehAge, 10) OVER() AS VehAge, ML.QUANTILE_BUCKETIZE(DrivAge, 10) OVER() AS DrivAge, CAST (VehPower AS string) AS VehPower, ML.STANDARD_SCALER(Log(Density)) OVER() AS Density, Exposure, Area, BonusMalus, VehBrand, VehGas, Region, ClaimClass ) OPTIONS ( INPUT_LABEL_COLS = ['ClaimClass'], MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER', NUM_PARALLEL_TREE = 200, MAX_TREE_DEPTH = 4, TREE_METHOD = 'HIST', MAX_ITERATIONS = 20, DATA_SPLIT_METHOD = 'Random', DATA_SPLIT_EVAL_FRACTION = 0.10, CLASS_WEIGHTS = [STRUCT('NoClaim', 0.05), ('Claim', 0.95)] ) AS SELECT Area, VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas, Density, Exposure, Region, ClaimClass FROM `french-cars.french_mtpl.Frequency` WHERE Split = 'TRAIN'
I do some bucketizing, and CAST Vehicle Power to string, just to make the decision tree behave better. Wüthrich showed that it only takes a few levels to capture the interaction effects. This particular classifier achieves 0.63 balanced accuracy. Navigate to the model’s “Evaluation” tab to see the metrics.
The OPTIONS are pretty standard. This is XGBoost behind the scenes. Like me, you may have used the XGB library in Python with its native API or the Scikit API. Note how the class weights STRUCT offsets the higher frequency of the “no claim” case.
I can’t decide if I prefer to split the test set into a separate table, or just segregate it using WHERE on the Split column. Code for both is in the repo. BQML definitely prefers the Split column.
There are two ways to invoke Auto ML. One is to choose Auto ML as the model type in the SQL script, and the other is to go through the Vertex AI browser console. In the latter case, you will want a Split column. Running Auto ML on tabular data costs $22 per server-hour, as of this writing. The cost of regular BQML and data storage is insignificant. Oddly, Auto ML is cheaper for image data.
Don’t forget to include the label column in the SELECT list! This always trips me up, because I am accustomed to thinking of it as “special” because it’s the label. However, this is still SQL and everything must be in the SELECT list.
Making Predictions with Big Query ML
Now, we are ready to make predictions with our new model. Here’s the code:
SELECT IDpol, predicted_ClaimClass_probs, FROM ML.PREDICT ( MODEL `french-cars.french_mtpl.classifier1`, ( SELECT IDpol, BonusMalus, Area, VehPower, VehAge, DrivAge, Exposure, VehBrand, VehGas, Density, Region FROM `french-cars.french_mtpl.Frequency` WHERE Split = 'TEST'))
The model is treated like a FROM table, with its source data in a subquery. Note that we trained on Split = ‘TRAIN’ and now we are using TEST. The model returns multiple rows for each policy, giving the probability for each class:
This is a little awkward to work with. Since we only want the claims probability, we must UNNEST it from its data structure and select prob where label is “Claim.” Support for nested and repeated data, i.e., denormalization, is typical of data warehouse systems like Big Query.
SELECT IDpol, probs.prob FROM pred, UNNEST (predicted_ClaimClass_probs) AS probs WHERE probs.label = "Claim"
Now that we know how to use the model, we can store the results in a new table, JOIN or UPDATE an existing table, etc. All we need for the ranking exercise is the probs and the actual Claim Amount.
Working with Big Query Tables in Vertex AI
Finally, we have a task that requires Python. We want to measure, using a Gini index, how well our model ranks claims risk. For this, we navigate to Vertex AI, and open a Jupyter notebook. This is the same as any other notebook, like Google Colab, except that it integrates with Big Query.
from google.cloud import bigquery client = bigquery.Client(location="US") sql = """SELECT * FROM `french_mtpl.Combined_Results` """ df = client.query(sql).to_dataframe()
The Client class allows you to run SQL against Big Query and write the results to a Pandas dataframe. The notebook is already associated with your GCP project, so you only have to specify the dataset. There is also a Jupyter magic cell command, %%bigquery.
Honestly, I think the hardest thing about Google Cloud Platform is just learning your way around the console. Like, where is the “New Notebook” button? Vertex used to be called “AI Platform,” and notebooks are under “Workbench.”
I coded my own Gini routine for the Allstate challenge, but the one from Lorentzen is better, so here it is. Also, if you’re familiar with that contest, Allstate made us plot it upside down. Corrado Gini would be displeased.
The actual claims, correctly sorted, are shown by the dotted line on the chart – a lot of zero, and then 2,500 claims. Claims, as sorted by the model, are shown by the blue line. The model does a respectable 0.30 Gini and 0.62 balanced accuracy.
Confusion Table: Pred_1 Pred_0 Total Pct. Correct True_1 1731 771 2502 0.691847 True_0 29299 36420 65719 0.554178 Accuracy: 0.5592 Balanced Accuracy: 0.6230
Now that we have a good classifier, the next step would be to combine it with a severity model. The classifier can predict which policies will have claims – or the probability of such – and the regressor can predict the amount. Since this is already a long article, I am going to leave the second model as an exercise.
We have seen how to make a simple machine learning model using Big Query ML, starting from a CSV file in Google Cloud Storage, and proceeding through SQL and Python, to a notebook in Vertex AI. We also discussed Auto ML, and there’s a bunch of sample code in the repo.