Claims Prediction with BQML

Did you know you could develop a machine learning model using SQL?  Google’s cloud data warehouse, Big Query, includes SQL support for machine learning with extensions like CREATE MODEL – by analogy with SQL DDL statement CREATE TABLE.

If you’re like me, you’re probably thinking, “why on Earth would I ever use SQL for machine learning?”  Google’s argument is that a lot of data people are handy with SQL, not so much with Python, and the data is already sitting in a SQL-based warehouse.

Big Query ML features all the popular model types from classifiers to matrix factorization, including an automated model picker called AutoML.  There’s also the advantage of cloud ML in general, which is that you don’t have to build a special rig (I built two) for GPU support.

In this article, I am going to work a simple insurance problem using BQML.  My plan is to provide an overview that will engage both the Python people and the SQL people, so that both camps will get better results from their data warehouse.

  1. Ingest data via Google Cloud Storage
  2. Transformation and modeling in Big Query
  3. Access the results from a Vertex AI notebook

By the way, I have placed much of the code in a public repo.  I love grabbing up code samples from Analytics Vidhya and Towards Data Science, so this is my way of giving back.

Case Study: French Motor Third-Party Liability Claims

We’re going to use the French car insurance data from Wüthrich, et al., 2020.  They focus on minimizing the loss function (regression loss, not insurance loss) and show that decision trees outperform linear models because they capture interaction among the variables.

There are a few ways to handle this problem.  While Wüthrich treats it as a straightforward regression problem, Lorentzen, et al. use a composition of two linear models, one for claim frequency and a second for claim severity.  As we shall see, this approach follows the structure of the data.

Lorentzen focus on the Gini index as a measure of fitness.  This is supported by Frees, and also by the Allstate challenge, although it does reduce the problem to a ranking exercise.  We are going to follow the example of Dal Pozzolo, and train a classifier to deal with the imbalance issue.

Ingesting Query Data via Google Cloud Storage

First, create a bucket in GCS and upload the two CSV files.  They’re mirrored in various places, like here.  Next, in Big Query, create a dataset with two tables, Frequency and Severity.  Finally, execute this BQ LOAD script from the Cloud Shell:

bq load \
--source_format=CSV \
--autodetect \
--skip_leading_rows=1 \
french-cars:french_mtpl.Frequency \
gs://french_mtpl2/freMTPL2freq.csv

The last two lines are syntax for the table and the GCS bucket/file, respectively.  Autodetect works fine for the data types, although I’d rather have NUMERIC for Exposure.  I have included JSON schemas in the repo.

It’s the most natural thing in the world to specify data types in JSON, storing this schema in the bucket with the data, but BQ LOAD won’t use it!  To utilize the schema file, you must create and load the table manually in the browser console.

Wüthrich specifies a number of clip levels, and Lorentzen implements them in Python.  I used SQL.  This is where we feel good about working in a data warehouse.  We have to JOIN the Severity data and GROUP BY multiple claims per policy, and SQL is the right tool for the job.

BEGIN
SET @@dataset_id = 'french_mtpl'; 
 
DROP TABLE IF EXISTS Combined;
CREATE TABLE Combined AS
SELECT F.IDpol, ClaimNb, Exposure, Area, VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas, Density, Region, ClaimAmount
FROM
    Frequency AS F
LEFT JOIN (
  SELECT
    IDpol,
    SUM(ClaimAmount) AS ClaimAmount
  FROM
    Severity
  GROUP BY
    IDpol) AS S
ON
  F.IDpol = S.IDpol
ORDER BY
  Idpol;
 
UPDATE Combined
SET ClaimNb = 0
WHERE (ClaimAmount IS NULL AND ClaimNb >=1 );
 
UPDATE Combined
SET ClaimAmount = 0
WHERE (ClaimAmount IS NULL);
 
UPDATE Combined
SET ClaimNb = 1
WHERE ClaimNb > 4;
 
UPDATE Combined
SET Exposure = 1
WHERE Exposure > 1;
 
UPDATE Combined
SET ClaimAmount = 200000
WHERE ClaimAmount > 200000;
 
ALTER TABLE Combined
ADD COLUMN Premium NUMERIC;
 
UPDATE Combined
SET Premium = ClaimAmount / Exposure
WHERE TRUE;
 
END

Training a Machine Learning Model with Big Query

Like most insurance data, the French MTPL dataset is ridiculously imbalanced.  Of 678,000 policies, fewer than 4% (25,000) have claims.  This means that you can be fooled into thinking your model is 96% accurate, when it’s just predicting “no claim” every time.

We are going to deal with the imbalance by:

  • Looking at a “balanced accuracy” metric
  • Using a probability threshold
  • Using class weights

Normally, with binary classification, the model will produce probabilities P and (1-P) for positive and negative.  In Scikit, predict_proba gives the probabilities, while predict gives only the class labels – assuming a 0.50 threshold.

Since the Allstate challenge, Dal Pozzolo and others have dealt with imbalance by using a threshold other than 0.50 – “raising the bar,” so to speak, for negative cases.  Seeking the right threshold can be a pain, but Big Query supplies a handy slider.

Sliding the threshold moves your false-positive rate up and down the ROC curve, automatically updating the accuracy metrics.  Unfortunately, one of these is not balanced accuracy.  You’ll have to work that out on your own.  Aim for a model with a good, concave ROC curve, giving you room to optimize.

The best way to deal with imbalanced data is to oversample the minority class.  In Scikit, we might use random oversampling, or maybe synthetic minority oversampling.  BQML doesn’t support oversampling, but we can get the same effect using class weights.  Here’s the script:

CREATE OR REPLACE MODEL`french-cars.french_mtpl.classifier1`
    TRANSFORM (
        ML.QUANTILE_BUCKETIZE(VehAge, 10) OVER() AS VehAge,
        ML.QUANTILE_BUCKETIZE(DrivAge, 10) OVER() AS DrivAge,
        CAST (VehPower AS string) AS VehPower,
        ML.STANDARD_SCALER(Log(Density)) OVER() AS Density,
        Exposure,
        Area,
        BonusMalus,
        VehBrand,
        VehGas,
        Region,
        ClaimClass
    )
OPTIONS (
    INPUT_LABEL_COLS = ['ClaimClass'], 
    MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER',
    NUM_PARALLEL_TREE = 200,
    MAX_TREE_DEPTH = 4,
    TREE_METHOD = 'HIST',
    MAX_ITERATIONS = 20,
    DATA_SPLIT_METHOD = 'Random',
    DATA_SPLIT_EVAL_FRACTION = 0.10,
    CLASS_WEIGHTS = [STRUCT('NoClaim', 0.05), ('Claim', 0.95)]
    )  
AS SELECT
  Area,
  VehPower,
  VehAge,
  DrivAge,
  BonusMalus,
  VehBrand,
  VehGas,
  Density,
  Exposure,
  Region, 
  ClaimClass
FROM `french-cars.french_mtpl.Frequency`
WHERE Split = 'TRAIN'

I do some bucketizing, and CAST Vehicle Power to string, just to make the decision tree behave better.  Wüthrich showed that it only takes a few levels to capture the interaction effects.  This particular classifier achieves 0.63 balanced accuracy.  Navigate to the model’s “Evaluation” tab to see the metrics.

The OPTIONS are pretty standard.  This is XGBoost behind the scenes.  Like me, you may have used the XGB library in Python with its native API or the Scikit API.  Note how the class weights STRUCT offsets the higher frequency of the “no claim” case.

I can’t decide if I prefer to split the test set into a separate table, or just segregate it using WHERE on the Split column.  Code for both is in the repo.  BQML definitely prefers the Split column.

There are two ways to invoke AutoML.  One is to choose AutoML as the model type in the SQL script, and the other is to go through the Vertex AI browser console.  In the latter case, you will want a Split column.  Running AutoML on tabular data costs $22 per server-hour, as of this writing.  The cost of regular BQML and data storage is insignificant.  Oddly, AutoML is cheaper for image data.

Don’t forget to include the label column in the SELECT list!  This always trips me up, because I am accustomed to thinking of it as “special” because it’s the label.  However, this is still SQL and everything must be in the SELECT list.

Making Predictions with Big Query ML

Now, we are ready to make predictions with our new model.  Here’s the code:

SELECT
    IDpol,
    predicted_ClaimClass_probs,
FROM 
    ML.PREDICT (
    MODEL `french-cars.french_mtpl.classifier1`,
    (
    SELECT
      IDpol,
      BonusMalus,
      Area,
      VehPower,
      VehAge,
      DrivAge,
      Exposure,
      VehBrand,
      VehGas,
      Density,
      Region
    FROM
      `french-cars.french_mtpl.Frequency`
    WHERE Split = 'TEST'))

The model is treated like a FROM table, with its source data in a subquery.  Note that we trained on Split = ‘TRAIN’ and now we are using TEST.  The model returns multiple rows for each policy, giving the probability for each class:

This is a little awkward to work with.  Since we only want the claims probability, we must UNNEST it from its data structure and select prob where label is “Claim.” Support for nested and repeated data, i.e., denormalization, is typical of data warehouse systems like Big Query.

SELECT IDpol, probs.prob 
FROM pred, 
UNNEST (predicted_ClaimClass_probs) AS probs
WHERE probs.label = "Claim"

Now that we know how to use the model, we can store the results in a new table, JOIN or UPDATE an existing table, etc.  All we need for the ranking exercise is the probs and the actual Claim Amount.

Working with Big Query Tables in Vertex AI

Finally, we have a task that requires Python.  We want to measure, using a Gini index, how well our model ranks claims risk.  For this, we navigate to Vertex AI, and open a Jupyter notebook.  This is the same as any other notebook, like Google Colab, except that it integrates with Big Query.

from google.cloud import bigquery
client = bigquery.Client(location="US")
sql = """SELECT * FROM `french_mtpl.Combined_Results` """ 
df = client.query(sql).to_dataframe()

The Client class allows you to run SQL against Big Query and write the results to a Pandas dataframe.  The notebook is already associated with your GCP project, so you only have to specify the dataset.  There is also a Jupyter magic cell command, %%bigquery.

Honestly, I think the hardest thing about Google Cloud Platform is just learning your way around the console.  Like, where is the “New Notebook” button?  Vertex used to be called “AI Platform,” and notebooks are under “Workbench.”

I coded my own Gini routine for the Allstate challenge, but the one from Lorentzen is better, so here it is.  Also, if you’re familiar with that contest, Allstate made us plot it upside down.  Corrado Gini would be displeased.

The actual claims, correctly sorted, are shown by the dotted line on the chart – a lot of zero, and then 2,500 claims.  Claims, as sorted by the model, are shown by the blue line.  The model does a respectable 0.30 Gini and 0.62 balanced accuracy.

Confusion Table:
       Pred_1 Pred_0 Total Pct. Correct
True_1   1731    771  2502     0.691847
True_0  29299  36420 65719     0.554178
Accuracy: 0.5592
Balanced Accuracy: 0.6230

Now that we have a good classifier, the next step would be to combine it with a severity model.  The classifier can predict which policies will have claims – or the probability of such – and the regressor can predict the amount.  Since this is already a long article, I am going to leave the second model as an exercise.

We have seen how to make a simple machine learning model using Big Query ML, starting from a CSV file in Google Cloud Storage, and proceeding through SQL and Python, to a notebook in Vertex AI.  We also discussed AutoML, and there’s a bunch of sample code in the repo.

Provider Support for Digital Retail

A couple of press releases caught my attention last week.  The first one was APCO Acquires Strategic Diversified.  So, what’s new about that?  F&I providers have been acquiring agencies ongoingly, in parallel with consolidation among car dealers.  What caught my attention was this, from CEO Rob Volatile, “the additional resources APCO will provide, particularly in digital retailing, will help our dealers thrive in the changing times ahead.”

“The additional resources APCO will provide, particularly in digital retailing, will help our dealers thrive in the changing times ahead.” 

By now, everyone understands that small dealer groups don’t have the resources to compete effectively in digital retail.  This includes even the mighty Larry Miller group.  As CEO Steve Starks said of the Asbury sale, “we had grown the business about as large as we could without having an over-the-top digital retail strategy.”

Product providers, agents, and finance sources must have value-enhancing digital skills – which brings me to the second press release, Assurant Unveils Omnichannel Sales Optimization Suite.  This, again, is not new.  All providers have some kind of digital outreach program.  I served on the digital retail team at Safe-Guard.  In addition to my main job of growing the API business, we provided research, content, and coaching to our clients – not unlike Assurant’s offering.

What caught my attention was the high-profile announcement including, as McKinsey recommends, senior leadership for digital transformation.  This same SVP, Martin Jenns, is quoted here in an Automotive News roundup.

How F&I Providers Can Support Digital Retail

So, what can a product provider do to support “omnichannel sales optimization?”  I asked some of my pals in digital retail.  Definitely the training and API capabilities, plus digital content.  Cited as leaders were Assurant and JM&A.

“Providers should pay specific attention to how their products will present on a digital retail platform.”

The best advice came from AutoFi’s Matt Orlando, who told me, “providers should pay specific attention to how their products will present on a digital retail platform.”  That is, instead of the (completely different) experience on a portal or a menu.

For example, a service contract may have more than one hundred combinations of coverage, term, deductible, and other options.  That doesn’t work for an online consumer.  Providers should apply some analytics, Matt said, and transmit only the most-likely rates.

Short, snappy videos are the preferred digital content.  Digital retail vendors will generally set these up on request although, in my experience, it’s better if the provider can transmit the latest content via API and in various formats.  See my REST Primer for F&I.

  • White papers – Do some research, find success stories, and write informative long-form articles. Also, promotional content like newsletters, roadmaps, infographics, and this eBook.
  • Coaching – For your non-reading clients, be prepared with live-delivery content – and people.
  • API capabilities – Invest in advanced API capabilities like real-time analytics and digital content. Push the envelope of what digital retail can present.
  • Digital content – Produce digital content as video, image, and rich text (not HTML) for each level of your product hierarchy.
  • Resource center – Make all of this available to your clients using a purpose-built microsite – and people.

I remember back when the old-timers used to say that protection products “are not bought, they’re sold.”  Well, digital F&I results now exceed those in-store, with some platforms reliably above 2.0 product index.

In the midst of an inventory shortage, dealers must sell more product on fewer vehicles – and product providers must be part of the solution.

Edtech Unicorns and JIT Training

Udemy went IPO last week, and PitchBook just published a note on the category, so I thought to write about my positive experiences with Coursera.  Online learning is segmented by subject, level, and quality of instruction.  See the research note for a complete rundown.

The edtech boom has not waned now that most schools and universities are again meeting in person. 

Coursera is oriented toward college credit and professional certification.  My instructor for neural nets, Coursera co-founder Andrew Ng, is a professor at Stanford.  They offer online degree programs in conjunction with major universities.  For example, you can earn a Master’s in Data Science through CU Boulder.

I was intrigued by that, but … I have a specific business problem to solve, and I already have grad-level coursework in statistics.  It doesn’t make sense for me to sit through STAT 561 again.  For me, the “all you can eat” plan is a better value at $50 per month.

What I need, today, is to move this code off my laptop and into the cloud.  For that, I can take the cloud deployment class.  If I run into problems with data wrangling, there’s a class for that, too.  This reminds me of that scene in The Matrix, where Trinity learns to fly a helicopter.

People can gain the skills they need, as and when they need them – not as fast as Trinity, but fast enough to keep up with evolving needs on the job.  I think this is the future of education, and 37 million students agree with me.

Network Effects in Dealer Systems

Last month, I wrote that the recent acquisitions of several Digital Retail vendors were driven by the need to accumulate dealer data for predictive analytics.  Today, I’d like to discuss another of Professor Rogers’ five themes, “network effects,” and how it applies to F&I software.

We’ll consider a hypothetical company that supplies admin software for F&I products, and also sells one or more dealer systems.  Having two distinct, but related, customer groups will allow us to explore “cross-side” network effects.

If the value of being in the network increases with the size of the network, as it often does, then there is a positive network effect.  Social networks are the model case.  The more people who are on Facebook, the more valuable Facebook is to its users (and its advertisers).

This is the textbook definition of “network effects,” but it’s only one part of what Iansiti and Lakhani call Strategic Network Analysis.  Below is a handy outline.  This article will walk through the outline using our hypothetical company – and some real ones from my experience.

Network Strategy Checklist

  1. Network effects (good) – Value grows as the square of the node count … maybe.
  2. Learning effects (good) – There is valuable data to be gleaned from the network.
  3. Clustering (bad) – You can be picked apart, one cluster at a time.
  4. Synergies (good) – Your business includes another network that talks to this one.
  5. Multihoming (bad) – Easy for customers to use multiple networks.
  6. Disintermediation (bad) – Easy for customers to go around your network.
  7. Bridging (good) – Opportunity to connect your network to others.

By the end of this article, you will understand how networking relates to the data concept from the earlier article, and how to apply it to your own software.

Speaking of vocabulary, let’s agree that “network” simply means all of the customers connected to your software, even if they aren’t connected to each other.  It will be our job to invent positive network effects for the company.

The early thinking about networks dealt with actual communication networks, where adding the nth telephone made possible n-1 new connections.  This gave rise to Metcalfe’s Law, which says that the value of a network increases with the square of its size.

Working Your Network

If you are supporting a “peer-to-peer” activity among your dealers, like Bruce Thompson’s auction platform, Car Offer, then you have Metcalfe’s Law working for you.  By the way, Bruce’s company was among those in the aforementioned wave of acquisitions.

If you are supporting a dealer-to-dealer activity, like Bruce Thompson’s auction platform, then you have Metcalfe’s Law working for you. 

Research has shown that naturally occurring networks, like Facebook, do not exhibit Metcalfe-style connectivity.  They exhibit clustering, and have far fewer than O(n2) links.  Clustering is bad – point #3, above – because it makes your network vulnerable to poaching.

Even if you don’t have network effects, per se, you can still organize learning effects using your dealers’ data.  Let’s say you have a reporting system that shows how well each dealer did on PVR last month.  Add some analytics, and you can show that although he has improved by 10%, he is still in the bottom quintile among medium-sized Ford dealers.

That’s descriptive analytics.  To make it prescriptive, let’s say our hypothetical company also operates a menu system.  Now, we can use historical data to predict which F&I product is most likely to be sold on the next deal.  The same technique can be applied to Digital Retail, desking, choosing a vehicle, etc.

Note that we have data from our reporting system doing analytics for our menu system – and pooled across dealers.  Any data we can accumulate is fair game.  This is why I recently advised one of my clients to “start hoarding now” for a prospective AI project.

Cross-Side Network Effects

So far, we’ve covered points 1-3 for our hypothetical company’s dealer network.  I’ll leave their provider network as an exercise for the reader, and move on to point #4.  This is where your business serves two groups, and its value to group A increases with the size of group B.

I like to say “cross-side” because that clearly describes the structure.  Iansiti and Lakhani say “synergy.”  Another popular term is “marketplace,” as in Amazon Marketplace, which I don’t like as much because of its end-consumer connotation.

It’s hard to bootstrap a network, and it’s twice as hard to bootstrap a marketplace. 

Is there an opportunity for cross-side effects between dealers and F&I providers?  Obviously ­– this is the business model I devised for Provider Exchange Network ten years ago.  Back then, it was voodoo magic, but a challenger today would face serious problems.

It’s hard to bootstrap a network, and it’s twice as hard to bootstrap a marketplace.  In the early days at PEN, we had exactly one (1) dealer system, which did not attract a lot of providers.  This, in turn, did not attract a lot of dealer systems.  Kudos to Ron Greer for breaking the deadlock.

Worse, while PEN is a “pure play” marketplace, our hypothetical software company sells its own menu system.  This will deter competing menu systems from coming onboard.  I’ll take up another of Professor Rogers’ themes, “working with your competitors,” in a later post.

Finally, network effects are a “winner takes all” proposition.  Once everybody is on Facebook, it’s hard to enroll them into another network.  That’s not to say it can’t be done.  Brian Reed’s F&I Express successfully created a dealer-to-provider marketplace that parallels PEN.

This brings us to point #5, “multihoming.”  Most F&I product providers are willing to be on multiple networks.  When I was doing this job for Safe-Guard, we ran most of our traffic through PEN, but also F&I Express and Stone Eagle, plus a few standalone menu systems.

The cost of multihoming is felt more by the dealer systems, which are often small and struggle to develop multiple connections.  On the other hand, Maxim and Vision insisted on connecting to us directly.  This is point #6, “disintermediation.”

New Kinds of Traffic

Fortunately for our hypothetical company, Digital Retail is driving the need for new kinds of traffic between providers and dealer systems.  This means new transaction types or, technically, new JSON payloads.  Transmitting digital media is one I’ve encountered a few times.  Custom (AI-based) pricing is another.

Digital Retail is driving the need for new kinds of traffic between providers and dealer systems. 

Controlling software at both ends of the pipeline would allow them to lead the market with the new transaction types.  Key skills are the ability to manage a network and develop a compelling interface (yes, an API can be “compelling”).

As before, note that the same concepts apply for a dealer-to-lender network, like Route One.  There is even a provider-to-lender network right here in Dallas.  Two, if you count Express Recoveries.

So, now you have examples of Strategic Network Analysis from real-world F&I software.  This is one of the methods the Virag Consulting website means when it says “formal methods” to place your software in its strategic context.  

If you’ve read this far, you are probably a practitioner yourself, and I hope this contributes to your success.  It should also advance the ongoing discussion of data and analytics in dealer systems.