analytics – Virag Consulting

Applied AI for Auto Finance

Maybe it’s my “availability bias,” but AI seemed to be the theme of last week’s Auto Finance Summit. There was one dedicated session while others, like residual risk and subprime credit, had AI in the background.

The exhibit hall featured the usual AI-based businesses: Upstart, Zest, SAS, et al. In today’s post, I’ll summarize what I learned from the conference.

One panelist framed AI as “the same thing we’ve done with credit scores for the last thirty years.” While technically incorrect – no one would describe a scorecard as AI – this framing has merit. As with credit scoring, any AI model must be monitored for “drift,” and continually retrained.

Welcome to ML Ops

This brings us to MLOps or, as Experian calls it, Model Ops – and it’s not easy. Experian reports that 55% of models never make it into production. Their survey, which I have been excerpting on my Twitter feed, is filled with stats like this.

MLOps is like Dev Ops, only you have to version the data as well, and the code is guaranteed to rust.

Here is how I described MLOps to an engineering manager: Think of the work your team does to control code and manage a pipeline. MLOps is like that, only you have to version the data as well, and the code is guaranteed to rust.

There are good no-code AI tools, like Google AutoML, which I wrote about here, SageMaker, and SAS Viya. As an old-school Python coder, I was gratified to see that these tools are in the minority.

Not Just for Credit Scoring Anymore

The “same as credit scores” framing is instructive in another way, too. This is where lenders first learned the power of predictive analytics, and began to build a capability that now includes targeted marketing, fraud detection, behavioral scoring, and more.

While generative AI is an exciting and rapidly advancing technology, the other applications of AI … continue to account for the majority of the overall potential value – McKinsey

There was a “gotcha” moment in one session, where the panelists had to admit they’re not doing anything with Generative AI. But, why would they? There are at least a dozen low-hanging use cases for AI classifiers and regressors, as I mentioned in What Is Real AI?

The AI/ML Development Journey

The most practical advice came from CUDL President Brian Hamilton. Brian reminded the audience not to overlook Robotic Process Automation. This was the only reference to RPA that I heard, and it got me thinking of the broader AI context. A typical journey might look something like this:

Data Wrangling
Predictive Analytics
Process Automation
Data Engineering
Machine Learning
Model Ops
Generative AI

Gen AI might appear in an automation role, as chatbot or search engine, but – certainly for financial services, which is swimming in metrics – the early use cases will be predictive.

While McKinsey estimates the impact of Gen AI around $3 trillion per year, this is on top of $11 trillion for non-gen AI and analytics. The McKinsey study is here, and the Microsoft maturity framework is here.

Conferences like this are a great way to see what other people are doing with AI but, in the end, you must decide how best to deploy it toward your own business needs. It’s a vehicle, not a destination.

Know Your Time Series

I recently saw this chart (below) from PNAS. It’s one of those popular psych studies that asks “does having more money make people happier?” They discovered that higher annual income does indeed make people happier, but in a logarithmic relationship. This reinforces something I wrote in Sensitivity Testing Model Assumptions, namely: know your time series.

In today’s post, I’ll explain about log scales using an example from stock trading, and then circle back to the happiness data. I’ll wind up with an example from my own experience, contrasting exponential versus polynomial functions. You may also want to check out my earlier posts on seasonal adjustment, and Bayesian probability.

“When interpreting these results, it bears repeating that well-being rose approximately linearly with log(income), not raw income”

Reported “life satisfaction” and “well-being” increase linearly with log income, not straight income. That is, the next notch up in happiness requires an order of magnitude more money. Previous studies had found a plateau around $75,000, with little or no increase in happiness after that.

So, this is a new finding – or is it? Take another look at that income scale. If that were a straight scale, the chart would show diminishing returns from additional income. To enjoy steadily increasing happiness, you have to earn exponentially increasing income.

Log Scaling for Stock Charts

Here is a chart of Carvana during 2018, when the stock was rising rapidly. Comparing the tiny candles in March with the longer ones in June and beyond, you might conclude that the stock had become more volatile. The average daily trading range increased from one dollar to three over the period.

But a dollar when the stock is at $20 is not the same as a dollar when the stock is at $60. To have the candles represent percentage change, you must set the price scale to logarithmic. See, in the chart below, how the price intervals get closer together as they proceed up the scale.

Whenever a stock chart covers a wide price range, you’re better off using a logarithmic scale. You may recall from school that adding logs is the same as multiplying the numbers. So, a linear scale shows additive change, and a log scale shows relative change.

log ab = log a + log b

Take another look at the first Carvana chart above. Stock traders call that “going parabolic.” Parabolic growth, also known as “quadratic,” is another rapid growth trend, easily confused with exponential growth. I did a quick regression analysis, and both models fit the Carvana data pretty well.

Pro tip: Never use “exponential” to describe something that’s not a time series. Some people seem to think it just means “big,” as in “last month was exponential!”

The point to “know your time series” is to understand the mechanisms underlying your data. Exponential growth comes from compounding, like if you increase sales by ten percent, and then you increase the new, higher, base by another ten percent – and you keep doing that.

Steadily Increasing Happiness

I’ll provide an example of quadratic growth later, but first let’s finish up the PNAS chart. I think of this as a time series because I picture someone earning steadily more income over their career (the data is actually different people at different income levels).

When I say “steadily more income,” I mean exponentially. Note that each tick mark on the PNAS income scale doubles the value. This is a log scale, like the Carvana price scale, above.

Many real-world metrics are based on log scales, like decibels and the Richter scale. An earthquake of magnitude 6.0 on the Richter scale is ten times as powerful as a 5.0.

The chart below shows what this blessed career looks like. If you start making $15,000 at age 18 and double your salary every six years or so, then you will experience steadily increasing “well-being.” My red line is the same red line as in the PNAS chart.

I think showing the data as someone’s career is a good way to tell the story. Income and well-being are shown together, with straight scales, and mediated by the hypothetical age. On the other hand, the correlation has disappeared. To show that, we must apply a log scale to the income series:

This is why the authors make clear that, to enjoy steadily increasing (linear) happiness, you must earn steadily increasing (exponential) income. To put it another way, if you only earn increasing (linear) income, then you will have only increasing (log) happiness.

Many real-world metrics are based on log scales, like decibels and the Richter scale. An earthquake of magnitude 6.0 on the Richter scale is ten times as powerful as a 5.0.

Polynomial versus Exponential Functions

Once upon a time, way back when databases had size constraints, I observed that parabolic growth in BMW Financial lease transactions would pose a danger to the database. I ran a regression analysis, calculated when the database would fail, and sent a memo to my boss.

I also worked out a mitigation strategy, but let’s stick with, “the database will blow up on April 21,” for dramatic effect. Instantly my office filled up with expensive auditors and consultants.

“No, it’s not a malfunction.”

“No, it’s not growing exponentially.”

Lease transactions were growing quadratically, which is why I chose this example. If new leases are steady at 1,200 per month, that’s a flat line. Total rows in the lease table will thus increase by 1,200 per month. That’s a sloping line.

Now, if each lease generates roughly two transactions per month, then total transactions will be a parabola. Readers with a little calculus will recognize this as integrating, twice, from the constant rate of new leases to the second-order rate of transactions.

This is the essence of “know your time series.” The regression analysis showed quadratic growth, unequivocally, and it was also supported by how we expected the data to behave.

The auditors milled around for a while, charged us about a million bucks, and decided I was panicking over nothing because the database wasn’t really going to blow up until April 29. Not to mention my mitigation strategy.

Big O Notation

In the example above, I showed that a linear function of a linear function is a parabola, also known as a “quadratic” or second-order polynomial. Compose another linear function on top of that, and it’s a third-order polynomial.

Excel has a handy feature, shown below, for fitting polynomial functions of different order. This is a little dangerous, because you can easily find a fit without thinking about it. It’s not enough to get a good R-Square (fit) value. You must understand why the data behaves as it does.

This “order” thing is sufficiently important to data analysts that they have a notation for it, called “Big O.” The quadratic example we just worked through would be O(n²) or “order n-squared.” I notice that Excel will also fit a Power Law, or “Pareto” series. That will have to wait for a later post.

Data Lakes Explained

Last month, I wrote an explainer on AI and it was well-received, so here is one on data lakes. If you already know the concepts, you may still find this framing helpful in client discussions. Our audience this time is the CFO, or maybe the CMO, and our motivation is that their analytical needs are not well-served by the transactional database.

Transactional Processing with a Relational Database

The data that runs your business – most of it, anyway – is probably stored in a relational database like Microsoft’s venerable SQL Server. Without going into details about the “relational” structure, the key is that this database is optimized for the daily operations of the business.

New policies are booked, premiums collected, and claims paid. These are transactions that add, change, or delete records. There are also “read only” operations, like producing invoices, but the database is designed primarily for transaction processing.

A well-designed transactional database will resist anomalies

A well-designed transactional database will resist anomalies, like a line item with no invoice, or two sales of the same item. The database designer will have used a technique called normalization, breaking the data up into smallish tables with relationships that enforce integrity.

Think of how your chart of accounts is organized. Everything you need to account for is broken down to the lowest relevant level, and then rolled up for reporting. Every journal entry hits two accounts, debit and credit, so that they’re kept in balance. Your meticulously normalized database is kind of like that.

When a customer places an order, a row is added to the Order table. You don’t need to open the Customer table unless there’s a change to the customer. Built around these normalized tables is the machinery of indexes, clusters, and triggers, which support speed and integrity.

Pro Tip: Take time to confirm that the transactional database is stable and supporting the business satisfactorily. You don’t want to start building pipelines and then discover there’s a problem with your data source.

Analytical Processing with a Data Warehouse

Transaction processing involves adding and changing data, with carefully limited scope. Analytical processing, by contrast, is mostly reading data – not changing it – and holistic in scope. To support this, the data must be copied into a separate database and denormalized.

Let’s say you want to know whether Dent protection sells better as a standalone product, or as part of a bundle – corrected for the number of dealers who don’t offer the bundle, and segmented by the vehicle’s make and price range.

You could run this query against the transactional database, but it would be difficult. The query is complicated enough without having to piece together data from multiple tables. The normalization which served so well for transaction processing is now an obstacle.

Confession: I am a normalization bigot. I bought C.J. Date’s textbook, read the original papers in the ACM journal, and even coded Bernstein’s algorithm. To me, organized data is normalized data, and de-normalizing is like leaving your clothes on the floor.

So, this is a good guide to denormalization. Everything we learned not to do in relational databases – wide tables, nested data, repeating groups – is useful here.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers

Analytical work requires not only a new database design, but a new database system. Out goes SQL Server and now we have Big Query, Redshift, and Snowflake. You may hear this buzzword, OLAP, which means “online analytical processing.” This concept was invented for marketing purposes, to describe the new category of software.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers (see sidebar). Just kidding about the hearts and clovers. Also, while your transactional database may be running SQL Server “on premise,” the analytical database will almost certainly be on a cloud service from Amazon, Microsoft, or Google.

To be honest, not everyone needs an OLAP database. As CIO for BMW Financial Services, I did not recommend one because our analytical workload was small, at the time, and could be served adequately without a lot of new gear and expensive consultants. Since then, I have gone over to the side of the consultants.

Sidebar: What’s an OLAP Cube?

In the early days of analytical processing, software vendors thought it would be a good idea to use a multidimensional data structure called a hypercube. Think of a typical spreadsheet, with rows representing an income statement and one column for each month. That’s two dimensions. Now, add a stack of spreadsheets, one for each region. That makes three dimensions, like a cube. I put myself through grad school working at Comshare, one of the first OLAP software vendors. It supported seven dimensions. That’s a hypercube. Nowadays, there are better data structures, and this leads to some confusion. Older analysts may assume that if they’re doing OLAP, then they must be using a cube. They may use the term “OLAP cube” to mean any analytical database, even though cubes have largely been replaced by newer structures.

Pooling Data in a Data Lake

You can think of the data lake as a way station between the transactional database and the data warehouse. We want to collect all the data into a common repository before loading it into the data warehouse.

Why not simply extract, transform, and load data straight from the transactional database? Well, we could, but it would be brittle. Any change on either side would require an update to the pipeline. The data lake decouples the OLTP and OLAP data stores.

The data lake serves the very important function of storing all the data, in whatever format, whether or not it’s amenable to organization. The term’s originator, James Dixon, wanted to suggest a large volume of data with no preconceived organization.

The key thing is to collect all the data in one place, and think about organization later. This calls for an “object data store,” like Google Cloud Storage. GCP and AWS both use “buckets.” You get the idea – this is where you leave your clothes on the floor.

Most of your data will indeed be structured data coming from the transactional database, and on its way into the OLAP database – but not all of it. Here are some real-life examples I have encountered:

- Logs of API traffic. Details of who is using our ecommerce API, including copies of the payload for each request and response.

- Text snippets. A file of the several paragraphs that make our standard Texas contract different from the one in Wisconsin, so that we can produce new contracts automatically. Same goes for product copy on the web site.

- Telephone metadata. A list of timestamps, durations, phone numbers, and extensions for all calls in the call center, both inbound and outbound.

These examples are better served by special-purpose databases like Hadoop, Bigtable, and Mongo. It’s best to take stock of all the data your analysts might need, broadly speaking, and start collecting it before you go too far with designing the OLAP database.

What is Accuracy?

Suppose you have tested positive for a rare and fatal disease, and your doctor tells you the test is 90% accurate. Is it time to put your affairs in order? Fortunately, no. “Accuracy” means different things to different people, and it’s surprisingly easy to misinterpret.

What the 90% means to your doctor is that if ten people have the disease, then the test will detect nine of them. This is the test’s “sensitivity.” Sensitivity is important because you want to detect as many cases as possible, for early treatment.

On the other hand, like Paul Samuelson’s joke about the stock market having predicted nine of the last five recessions, sensitivity doesn’t tell you anything about the rate of false positives.

If you’re into machine learning, you probably noticed that sensitivity is the same as “recall.” Data scientists use several different measures of accuracy. For starters, we have precision, recall, naïve accuracy, and F1 score.

There are many good posts on how to measure accuracy (here’s one) but few that place it in the Bayesian context of medical testing. My plan for this article is to briefly review the standard accuracy metrics, introduce some notation, and then connect them to the inference calculations.

Accuracy Metrics for Machine Learning

First, here is the standard “confusion matrix” for binary classification. It shows how test results fall into four categories: True Positives, True Negatives, False Positives, and False Negatives. Total actual positives and negatives are P and N, while total predicted are P̂ and N̂.

These are not only definitions, they’re numbers that express probabilities like the sensitivity formula, above. This notation will come in handy later. The standard definition of accuracy is simply the number of cases which were labeled correctly – true positives and true negatives – divided by the total population.

Unfortunately, this simple formula breaks down when the data is imbalanced. I care about this because I work with insurance data, which is notoriously imbalanced. The same goes for rare diseases, like HIV infection – which afflicts roughly 0.4% of people in the U.S. Doctors use a metric called “specificity.”

The FP term in the denominator penalizes the model for false positives. You can think of specificity as “recall for negatives.” Doctors want a test with high sensitivity for screening, and then a more specific test for confirmation. A good explainer from a medical perspective is here.

In a machine learning context, you want to optimize something called “balanced accuracy.” This is the average of sensitivity and specificity. For more on imbalanced data and machine learning, see my earlier post.

Bayes Theorem and Medical Testing

Bayes’ Theorem is a slick way to express a conditional probability in terms of its converse. It allows us to convert “is this true given the evidence?” into “what would be the evidence if this were true?”

This kind of reasoning is obviously important for interpreting medical test results, and most people are bad at it. I’m one of them. I can never apply Bayesian reasoning without first making the diagram:

In this diagram, A is the set of people who have the disease and B is the set of people who have tested positive. U is the universe of people that we’ve tested. We have to make this stipulation because, in real life, you can’t test everyone.

We might assume that the base rate of disease in the wide world is A/U, but we only know about the people we’ve tested. They may be self-selecting to take the test because they have risk factors, and this would lead us to overestimate the base rate.

Even within our tidy, tested universe, we can only estimate A by means of our imperfect test. This is where some probability math comes in handy. The true positives, people who tested positive and in fact have the disease, are the intersection of sets A and B. Here they are, using conditional probability:

That is, the probability of testing positive if you’re sick, P(B|A), times the base probability of being sick, P(A). Again, though, P(A) can be found only through inference – and medical surveillance. Take a moment and think about how you would obtain these statistics in real life.

Mostly, you are going to watch the people who tested positive, set B, to see which ones develop symptoms. The Bayesian framework gives you four variables to play with – five, counting the intersection set itself – so you can solve for P(A) in terms of the other ones:

That is, the probability of being sick if you’ve tested positive, P(A|B), times the probability of testing positive, P(B). We know P(B) because we know how many people we’ve tested, U, and how many were positive. Now that we’re in a position to solve for P(A) let’s bring back the other notation.

Accuracy Metrics and Bayes Theorem

Machine learning people use the accuracy metrics from the first section, above, while statistics people use the probability calculations from this second section. I think it’s useful, especially given imbalanced medical (or insurance) data, to combine the two.

Now, we can rewrite the two conditional probability calculations, above, in terms of accuracy. Set A = P, set B = P̂ , and the various metrics describe how they overlap.

And:

Giving our sick group as:

Finally, since you’re still worried about your positive test result … let’s assume the disease has a base rate of 1% – twice as virulent as HIV. Recall that we never said what the test’s specificity was. Since the test has good sensitivity, 90%, let’s say that specificity is weak, only 50%.

You are among 504 patients who tested positive. Of these, only nine actually have the disease. Your probability of being one of the nine is P(A|B). This is the test’s precision, which works out to 1.8%.