What is Accuracy?

Suppose you have tested positive for a rare and fatal disease, and your doctor tells you the test is 90% accurate.  Is it time to put your affairs in order?  Fortunately, no.  “Accuracy” means different things to different people, and it’s surprisingly easy to misinterpret.

What the 90% means to your doctor is that if ten people have the disease, then the test will detect nine of them.  This is the test’s “sensitivity.”  Sensitivity is important because you want to detect as many cases as possible, for early treatment.

On the other hand, like Paul Samuelson’s joke about the stock market having predicted nine of the last five recessions, sensitivity doesn’t tell you anything about the rate of false positives.  

If you’re into machine learning, you probably noticed that sensitivity is the same as “recall.”  Data scientists use several different measures of accuracy.  For starters, we have precision, recall, naïve accuracy, and F1 score.

There are many good posts on how to measure accuracy (here’s one) but few that place it in the Bayesian context of medical testing.  My plan for this article is to briefly review the standard accuracy metrics, introduce some notation, and then connect them to the inference calculations.

Accuracy Metrics for Machine Learning

First, here is the standard “confusion matrix” for binary classification.  It shows how test results fall into four categories: True Positives, True Negatives, False Positives, and False Negatives.  Total actual positives and negatives are P and N, while total predicted are and .

These are not only definitions, they’re numbers that express probabilities like the sensitivity formula, above.  This notation will come in handy later.  The standard definition of accuracy is simply the number of cases which were labeled correctly – true positives and true negatives – divided by the total population.

Unfortunately, this simple formula breaks down when the data is imbalanced.  I care about this because I work with insurance data, which is notoriously imbalanced.  The same goes for rare diseases, like HIV infection – which afflicts roughly 0.4% of people in the U.S.  Doctors use a metric called “specificity.”

The FP term in the denominator penalizes the model for false positives.  You can think of specificity as “recall for negatives.”  Doctors want a test with high sensitivity for screening, and then a more specific test for confirmation.  A good explainer from a medical perspective is here.

In a machine learning context, you want to optimize something called “balanced accuracy.”  This is the average of sensitivity and specificity.  For more on imbalanced data and machine learning, see my earlier post.

Bayes Theorem and Medical Testing

Bayes’ Theorem is a slick way to express a conditional probability in terms of its converse.  It allows us to convert “is this true given the evidence?” into “what would be the evidence if this were true?”

This kind of reasoning is obviously important for interpreting medical test results, and most people are bad at it.  I’m one of them.  I can never apply Bayesian reasoning without first making the diagram:

In this diagram, A is the set of people who have the disease and B is the set of people who have tested positive.  U is the universe of people that we’ve tested.  We have to make this stipulation because, in real life, you can’t test everyone.

We might assume that the base rate of disease in the wide world is A/U, but we only know about the people we’ve tested.  They may be self-selecting to take the test because they have risk factors, and this would lead us to overestimate the base rate.

Even within our tidy, tested universe, we can only estimate A by means of our imperfect test.  This is where some probability math comes in handy.  The true positives, people who tested positive and in fact have the disease, are the intersection of sets A and B.  Here they are, using conditional probability:

That is, the probability of testing positive if you’re sick, P(B|A), times the base probability of being sick, P(A).  Again, though, P(A) can be found only through inference – and medical surveillance.  Take a moment and think about how you would obtain these statistics in real life.

Mostly, you are going to watch the people who tested positive, set B, to see which ones develop symptoms.  The Bayesian framework gives you four variables to play with – five, counting the intersection set itself – so you can solve for P(A) in terms of the other ones:

That is, the probability of being sick if you’ve tested positive, P(A|B), times the probability of testing positive, P(B).  We know P(B) because we know how many people we’ve tested, U, and how many were positive.  Now that we’re in a position to solve for P(A) let’s bring back the other notation.

Accuracy Metrics and Bayes Theorem

Machine learning people use the accuracy metrics from the first section, above, while statistics people use the probability calculations from this second section.  I think it’s useful, especially given imbalanced medical (or insurance) data, to combine the two.

Now, we can rewrite the two conditional probability calculations, above, in terms of accuracy.  Set A = P, set B = P̂ , and the various metrics describe how they overlap.

And:

Giving our sick group as:

Finally, since you’re still worried about your positive test result … let’s assume the disease has a base rate of 1% – twice as virulent as HIV.  Recall that we never said what the test’s specificity was.  Since the test has good sensitivity, 90%, let’s say that specificity is weak, only 50%.

You are among 504 patients who tested positive.  Of these, only nine actually have the disease.  Your probability of being one of the nine is P(A|B).  This is the test’s precision, which works out to 1.8%.

Claims Prediction with BQML

Did you know you could develop a machine learning model using SQL?  Google’s cloud data warehouse, Big Query, includes SQL support for machine learning with extensions like CREATE MODEL – by analogy with SQL DDL statement CREATE TABLE.

If you’re like me, you’re probably thinking, “why on Earth would I ever use SQL for machine learning?”  Google’s argument is that a lot of data people are handy with SQL, not so much with Python, and the data is already sitting in a SQL-based warehouse.

Big Query ML features all the popular model types from classifiers to matrix factorization, including an automated model picker called Auto ML.  There’s also the advantage of cloud ML in general, which is that you don’t have to build a special rig (I built two) for GPU support.

In this article, I am going to work a simple insurance problem using BQML.  My plan is to provide an overview that will engage both the Python people and the SQL people, so that both camps will get better results from their data warehouse.

  1. Ingest data via Google Cloud Storage
  2. Transformation and modeling in Big Query
  3. Access the results from a Vertex AI notebook

By the way, I have placed much of the code in a public repo.  I love grabbing up code samples from Analytics Vidhya and Towards Data Science, so this is my way of giving back.

Case Study: French Motor Third-Party Liability Claims

We’re going to use the French car insurance data from Wüthrich, et al., 2020.  They focus on minimizing the loss function (regression loss, not insurance loss) and show that decision trees outperform linear models because they capture interaction among the variables.

There are a few ways to handle this problem.  While Wüthrich treats it as a straightforward regression problem, Lorentzen, et al. use a composition of two linear models, one for claim frequency and a second for claim severity.  As we shall see, this approach follows the structure of the data.

Lorentzen focus on the Gini index as a measure of fitness.  This is supported by Frees, and also by the Allstate challenge, although it does reduce the problem to a ranking exercise.  We are going to follow the example of Dal Pozzolo, and train a classifier to deal with the imbalance issue.

Ingesting Query Data via Google Cloud Storage

First, create a bucket in GCS and upload the two CSV files.  They’re mirrored in various places, like here.  Next, in Big Query, create a dataset with two tables, Frequency and Severity.  Finally, execute this BQ LOAD script from the Cloud Shell:

bq load \
--source_format=CSV \
--autodetect \
--skip_leading_rows=1 \
french-cars:french_mtpl.Frequency \
gs://french_mtpl2/freMTPL2freq.csv

The last two lines are syntax for the table and the GCS bucket/file, respectively.  Autodetect works fine for the data types, although I’d rather have NUMERIC for Exposure.  I have included JSON schemas in the repo.

It’s the most natural thing in the world to specify data types in JSON, storing this schema in the bucket with the data, but BQ LOAD won’t use it!  To utilize the schema file, you must create and load the table manually in the browser console.

Wüthrich specifies a number of clip levels, and Lorentzen implements them in Python.  I used SQL.  This is where we feel good about working in a data warehouse.  We have to JOIN the Severity data and GROUP BY multiple claims per policy, and SQL is the right tool for the job.

BEGIN
SET @@dataset_id = 'french_mtpl'; 
 
DROP TABLE IF EXISTS Combined;
CREATE TABLE Combined AS
SELECT F.IDpol, ClaimNb, Exposure, Area, VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas, Density, Region, ClaimAmount
FROM
    Frequency AS F
LEFT JOIN (
  SELECT
    IDpol,
    SUM(ClaimAmount) AS ClaimAmount
  FROM
    Severity
  GROUP BY
    IDpol) AS S
ON
  F.IDpol = S.IDpol
ORDER BY
  Idpol;
 
UPDATE Combined
SET ClaimNb = 0
WHERE (ClaimAmount IS NULL AND ClaimNb >=1 );
 
UPDATE Combined
SET ClaimAmount = 0
WHERE (ClaimAmount IS NULL);
 
UPDATE Combined
SET ClaimNb = 1
WHERE ClaimNb > 4;
 
UPDATE Combined
SET Exposure = 1
WHERE Exposure > 1;
 
UPDATE Combined
SET ClaimAmount = 200000
WHERE ClaimAmount > 200000;
 
ALTER TABLE Combined
ADD COLUMN Premium NUMERIC;
 
UPDATE Combined
SET Premium = ClaimAmount / Exposure
WHERE TRUE;
 
END

Training a Machine Learning Model with Big Query

Like most insurance data, the French MTPL dataset is ridiculously imbalanced.  Of 678,000 policies, fewer than 4% (25,000) have claims.  This means that you can be fooled into thinking your model is 96% accurate, when it’s just predicting “no claim” every time.

We are going to deal with the imbalance by:

  • Looking at a “balanced accuracy” metric
  • Using a probability threshold
  • Using class weights

Normally, with binary classification, the model will produce probabilities P and (1-P) for positive and negative.  In Scikit, predict_proba gives the probabilities, while predict gives only the class labels – assuming a 0.50 threshold.

Since the Allstate challenge, Dal Pozzolo and others have dealt with imbalance by using a threshold other than 0.50 – “raising the bar,” so to speak, for negative cases.  Seeking the right threshold can be a pain, but Big Query supplies a handy slider.

Sliding the threshold moves your false-positive rate up and down the ROC curve, automatically updating the accuracy metrics.  Unfortunately, one of these is not balanced accuracy.  You’ll have to work that out on your own.  Aim for a model with a good, concave ROC curve, giving you room to optimize.

The best way to deal with imbalanced data is to oversample the minority class.  In Scikit, we might use random oversampling, or maybe synthetic minority oversampling.  BQML doesn’t support oversampling, but we can get the same effect using class weights.  Here’s the script:

CREATE OR REPLACE MODEL`french-cars.french_mtpl.classifier1`
    TRANSFORM (
        ML.QUANTILE_BUCKETIZE(VehAge, 10) OVER() AS VehAge,
        ML.QUANTILE_BUCKETIZE(DrivAge, 10) OVER() AS DrivAge,
        CAST (VehPower AS string) AS VehPower,
        ML.STANDARD_SCALER(Log(Density)) OVER() AS Density,
        Exposure,
        Area,
        BonusMalus,
        VehBrand,
        VehGas,
        Region,
        ClaimClass
    )
OPTIONS (
    INPUT_LABEL_COLS = ['ClaimClass'], 
    MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER',
    NUM_PARALLEL_TREE = 200,
    MAX_TREE_DEPTH = 4,
    TREE_METHOD = 'HIST',
    MAX_ITERATIONS = 20,
    DATA_SPLIT_METHOD = 'Random',
    DATA_SPLIT_EVAL_FRACTION = 0.10,
    CLASS_WEIGHTS = [STRUCT('NoClaim', 0.05), ('Claim', 0.95)]
    )  
AS SELECT
  Area,
  VehPower,
  VehAge,
  DrivAge,
  BonusMalus,
  VehBrand,
  VehGas,
  Density,
  Exposure,
  Region, 
  ClaimClass
FROM `french-cars.french_mtpl.Frequency`
WHERE Split = 'TRAIN'

I do some bucketizing, and CAST Vehicle Power to string, just to make the decision tree behave better.  Wüthrich showed that it only takes a few levels to capture the interaction effects.  This particular classifier achieves 0.63 balanced accuracy.  Navigate to the model’s “Evaluation” tab to see the metrics.

The OPTIONS are pretty standard.  This is XGBoost behind the scenes.  Like me, you may have used the XGB library in Python with its native API or the Scikit API.  Note how the class weights STRUCT offsets the higher frequency of the “no claim” case.

I can’t decide if I prefer to split the test set into a separate table, or just segregate it using WHERE on the Split column.  Code for both is in the repo.  BQML definitely prefers the Split column.

There are two ways to invoke Auto ML.  One is to choose Auto ML as the model type in the SQL script, and the other is to go through the Vertex AI browser console.  In the latter case, you will want a Split column.  Running Auto ML on tabular data costs $22 per server-hour, as of this writing.  The cost of regular BQML and data storage is insignificant.  Oddly, Auto ML is cheaper for image data.

Don’t forget to include the label column in the SELECT list!  This always trips me up, because I am accustomed to thinking of it as “special” because it’s the label.  However, this is still SQL and everything must be in the SELECT list.

Making Predictions with Big Query ML

Now, we are ready to make predictions with our new model.  Here’s the code:

SELECT
    IDpol,
    predicted_ClaimClass_probs,
FROM 
    ML.PREDICT (
    MODEL `french-cars.french_mtpl.classifier1`,
    (
    SELECT
      IDpol,
      BonusMalus,
      Area,
      VehPower,
      VehAge,
      DrivAge,
      Exposure,
      VehBrand,
      VehGas,
      Density,
      Region
    FROM
      `french-cars.french_mtpl.Frequency`
    WHERE Split = 'TEST'))

The model is treated like a FROM table, with its source data in a subquery.  Note that we trained on Split = ‘TRAIN’ and now we are using TEST.  The model returns multiple rows for each policy, giving the probability for each class:

This is a little awkward to work with.  Since we only want the claims probability, we must UNNEST it from its data structure and select prob where label is “Claim.” Support for nested and repeated data, i.e., denormalization, is typical of data warehouse systems like Big Query.

SELECT IDpol, probs.prob 
FROM pred, 
UNNEST (predicted_ClaimClass_probs) AS probs
WHERE probs.label = "Claim"

Now that we know how to use the model, we can store the results in a new table, JOIN or UPDATE an existing table, etc.  All we need for the ranking exercise is the probs and the actual Claim Amount.

Working with Big Query Tables in Vertex AI

Finally, we have a task that requires Python.  We want to measure, using a Gini index, how well our model ranks claims risk.  For this, we navigate to Vertex AI, and open a Jupyter notebook.  This is the same as any other notebook, like Google Colab, except that it integrates with Big Query.

from google.cloud import bigquery
client = bigquery.Client(location="US")
sql = """SELECT * FROM `french_mtpl.Combined_Results` """ 
df = client.query(sql).to_dataframe()

The Client class allows you to run SQL against Big Query and write the results to a Pandas dataframe.  The notebook is already associated with your GCP project, so you only have to specify the dataset.  There is also a Jupyter magic cell command, %%bigquery.

Honestly, I think the hardest thing about Google Cloud Platform is just learning your way around the console.  Like, where is the “New Notebook” button?  Vertex used to be called “AI Platform,” and notebooks are under “Workbench.”

I coded my own Gini routine for the Allstate challenge, but the one from Lorentzen is better, so here it is.  Also, if you’re familiar with that contest, Allstate made us plot it upside down.  Corrado Gini would be displeased.

The actual claims, correctly sorted, are shown by the dotted line on the chart – a lot of zero, and then 2,500 claims.  Claims, as sorted by the model, are shown by the blue line.  The model does a respectable 0.30 Gini and 0.62 balanced accuracy.

Confusion Table:
       Pred_1 Pred_0 Total Pct. Correct
True_1   1731    771  2502     0.691847
True_0  29299  36420 65719     0.554178
Accuracy: 0.5592
Balanced Accuracy: 0.6230

Now that we have a good classifier, the next step would be to combine it with a severity model.  The classifier can predict which policies will have claims – or the probability of such – and the regressor can predict the amount.  Since this is already a long article, I am going to leave the second model as an exercise.

We have seen how to make a simple machine learning model using Big Query ML, starting from a CSV file in Google Cloud Storage, and proceeding through SQL and Python, to a notebook in Vertex AI.  We also discussed Auto ML, and there’s a bunch of sample code in the repo.

Edtech Unicorns and JIT Training

Udemy went IPO last week, and PitchBook just published a note on the category, so I thought to write about my positive experiences with Coursera.  Online learning is segmented by subject, level, and quality of instruction.  See the research note for a complete rundown.

The edtech boom has not waned now that most schools and universities are again meeting in person. 

Coursera is oriented toward college credit and professional certification.  My instructor for neural nets, Coursera co-founder Andrew Ng, is a professor at Stanford.  They offer online degree programs in conjunction with major universities.  For example, you can earn a Master’s in Data Science through CU Boulder.

I was intrigued by that, but … I have a specific business problem to solve, and I already have grad-level coursework in statistics.  It doesn’t make sense for me to sit through STAT 561 again.  For me, the “all you can eat” plan is a better value at $50 per month.

What I need, today, is to move this code off my laptop and into the cloud.  For that, I can take the cloud deployment class.  If I run into problems with data wrangling, there’s a class for that, too.  This reminds me of that scene in The Matrix, where Trinity learns to fly a helicopter.

People can gain the skills they need, as and when they need them – not as fast as Trinity, but fast enough to keep up with evolving needs on the job.  I think this is the future of education, and 37 million students agree with me.

What is “Real” AI?

Clients ask me this all the time.  They want to know if a proposed new system has the real stuff, or if it’s snake oil.  It’s a tough question, because the answer is complicated.  Even if I dictate some challenge questions, their discussion with the sales rep is likely to be inconclusive.

The bottom line is that we want to use historical data to make predictions.  Here are some things we might want to predict:

  • Is this customer going to buy a car today? (Yes/No)
  • Which protection product is he going to buy? (Choice)
  • What will be my loss ratio? (Number)

In Predictive Selling for F&I, I discussed some ways to predict product sales.  The classic example is to look at LTV and predict whether the customer will want GAP.  High LTV, more likely.  Low LTV, less likely.  With historical data and a little math, you can write a formula to determine the GAP-sale probability.

What is predictive analytics?

If you’re using statistics and one variable, that’s not AI, but it is a handy predictive model just the same.  What if you’re using a bunch of variables, as with linear regression?  Regression is powerful, but it is still an analytical method.

The technical meaning of analytical is that you can solve the problem directly using math, instead of another approach like iteration or heuristics.  Back when I was designing “payment rollback” for MenuVantage, I proved it was possible to algebraically reverse our payment formulas – possible, but not practical.  It made more sense to run the calculations forward, and use iteration to solve the problem.

You can do simple linear regression on a calculator.  In fact, they made us do this in business school.  If you don’t believe me – HP prints the formulas on the back of their HP-12 calculator.  So, while you can make a damned good predictive model using linear regression, it’s still not AI.  It’s predictive analytics.

By the way, “analytics” is a singular noun, like “physics.”  No one ever says “physics are fun.”  Take that, spellcheck!

What is machine learning?

The distinctive feature of AI is that the system generates a predictive model that is not reachable through analysis.  It will trundle through your historical data using iteration to determine, say, the factor weights in a neural network, or the split values in a decision tree.

“Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.”

The model improves with exposure to more data (and tuning) hence Machine Learning.  This is very powerful, and will serve for a working definition of “real” AI.

AI is an umbrella term that includes Machine Learning but also algorithms, like expert systems, that don’t learn from experience.  Analytics includes statistical methods that may make good predictions, but these also do not learn.  There is nothing wrong with these techniques.

Here are some challenge questions:

  • What does your model predict?
  • What variables does it use?
  • What is the predictive model?
  • How accurate is it?

A funny thing I learned reading forums like KD Nuggets is that kids today learn neural nets first, and then they learn about linear regression as the special case that can be solved analytically.

What is a neural network?

Yes, the theory is based on how neurons behave in the brain.  Image recognition, in particular, owes a lot to the dorsal pathway of the visual cortex.  Researchers take this very seriously, and continue to draw inspiration from the brain.  So, this is great if your client happens to be a neuroscientist.

My client is more likely to be a technology leader, so I will explain neural nets by analogy with linear regression.  Linear regression takes a bunch of “X” variables and establishes a linear relationship among them, to predict the value of a single dependent “Y” variable.  Schematically, that looks like this:

Now suppose that instead of one linear equation, you use regression to predict eight intermediate “Z” variables, and then feed those into another linear model that predicts the original “Y.” Every link in the network has a factor weight, just as in linear regression.

Apart from some finer points (like nonlinear activation functions) you can think of a neural net as a stack of interlaced regression models.

You may recall that linear regression works by using partial derivatives to find the minimum of an error function parametrized by the regression coefficients.  Well, that’s exactly what the neural network training process does!

What is deep learning?

This brings us to one final buzzword, Deep Learning.  The more layers in the stack, the smarter the neural net.  There’s no danger of overdoing it, because the model will learn to skip redundant layers.  The popular image recognition model, ResNet152 has – you guessed it – 152 layers.

So, it’s deep.  It also sounds cool, as if the model is learning “deeply” which, technically, I suppose it is.  This is not relevant for our purposes, so ignore it unless it affects accuracy.