Data Lakes Explained

Last month, I wrote an explainer on AI and it was well-received, so here is one on data lakes.  If you already know the concepts, you may still find this framing helpful in client discussions.  Our audience this time is the CFO, or maybe the CMO, and our motivation is that their analytical needs are not well-served by the transactional database.

Transactional Processing with a Relational Database

The data that runs your business – most of it, anyway – is probably stored in a relational database like Microsoft’s venerable SQL Server.  Without going into details about the “relational” structure, the key is that this database is optimized for the daily operations of the business.

New policies are booked, premiums collected, and claims paid.  These are transactions that add, change, or delete records.  There are also “read only” operations, like producing invoices, but the database is designed primarily for transaction processing.

A well-designed transactional database will resist anomalies

A well-designed transactional database will resist anomalies, like a line item with no invoice, or two sales of the same item.  The database designer will have used a technique called normalization, breaking the data up into smallish tables with relationships that enforce integrity.

Think of how your chart of accounts is organized.  Everything you need to account for is broken down to the lowest relevant level, and then rolled up for reporting.  Every journal entry hits two accounts, debit and credit, so that they’re kept in balance.  Your meticulously normalized database is kind of like that.

When a customer places an order, a row is added to the Order table.  You don’t need to open the Customer table unless there’s a change to the customer.  Built around these normalized tables is the machinery of indexes, clusters, and triggers, which support speed and integrity.

Pro Tip: Take time to confirm that the transactional database is stable and supporting the business satisfactorily.  You don’t want to start building pipelines and then discover there’s a problem with your data source.

Analytical Processing with a Data Warehouse

Transaction processing involves adding and changing data, with carefully limited scope.  Analytical processing, by contrast, is mostly reading data – not changing it – and holistic in scope.  To support this, the data must be copied into a separate database and denormalized.

Let’s say you want to know whether Dent protection sells better as a standalone product, or as part of a bundle – corrected for the number of dealers who don’t offer the bundle, and segmented by the vehicle’s make and price range.

You could run this query against the transactional database, but it would be difficult.  The query is complicated enough without having to piece together data from multiple tables.  The normalization which served so well for transaction processing is now an obstacle.

Confession: I am a normalization bigot.  I bought C.J. Date’s textbook, read the original papers in the ACM journal, and even coded Bernstein’s algorithm.  To me, organized data is normalized data, and de-normalizing is like leaving your clothes on the floor.

So, this is a good guide to denormalization.  Everything we learned not to do in relational databases – wide tables, nested data, repeating groups – is useful here.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers

Analytical work requires not only a new database design, but a new database system.  Out goes SQL Server and now we have Big Query, Redshift, and Snowflake.  You may hear this buzzword, OLAP, which means “online analytical processing.”  This concept was invented for marketing purposes, to describe the new category of software.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers (see sidebar).  Just kidding about the hearts and clovers.  Also, while your transactional database may be running SQL Server “on premise,” the analytical database will almost certainly be on a cloud service from Amazon, Microsoft, or Google.

To be honest, not everyone needs an OLAP database.  As CIO for BMW Financial Services, I did not recommend one because our analytical workload was small, at the time, and could be served adequately without a lot of new gear and expensive consultants.  Since then, I have gone over to the side of the consultants.

Sidebar: What’s an OLAP Cube?

In the early days of analytical processing, software vendors thought it would be a good idea to use a multidimensional data structure called a hypercube. Think of a typical spreadsheet, with rows representing an income statement and one column for each month. That’s two dimensions. Now, add a stack of spreadsheets, one for each region. That makes three dimensions, like a cube. I put myself through grad school working at Comshare, one of the first OLAP software vendors. It supported seven dimensions. That’s a hypercube. Nowadays, there are better data structures, and this leads to some confusion. Older analysts may assume that if they’re doing OLAP, then they must be using a cube. They may use the term “OLAP cube” to mean any analytical database, even though cubes have largely been replaced by newer structures.

Pooling Data in a Data Lake

You can think of the data lake as a way station between the transactional database and the data warehouse.  We want to collect all the data into a common repository before loading it into the data warehouse.

Why not simply extract, transform, and load data straight from the transactional database?  Well, we could, but it would be brittle.  Any change on either side would require an update to the pipeline.  The data lake decouples the OLTP and OLAP data stores.

The data lake serves the very important function of storing all the data, in whatever format, whether or not it’s amenable to organization.  The term’s originator, James Dixon, wanted to suggest a large volume of data with no preconceived organization.

The key thing is to collect all the data in one place, and think about organization later.  This calls for an “object data store,” like Google Cloud Storage.  GCP and AWS both use “buckets.”  You get the idea – this is where you leave your clothes on the floor.

Most of your data will indeed be structured data coming from the transactional database, and on its way into the OLAP database – but not all of it.  Here are some real-life examples I have encountered:

    • Logs of API traffic. Details of who is using our ecommerce API, including copies of the payload for each request and response.
    • Text snippets. A file of the several paragraphs that make our standard Texas contract different from the one in Wisconsin, so that we can produce new contracts automatically.  Same goes for product copy on the web site.
    • Telephone metadata. A list of timestamps, durations, phone numbers, and extensions for all calls in the call center, both inbound and outbound.

These examples are better served by special-purpose databases like Hadoop, Bigtable, and Mongo.  It’s best to take stock of all the data your analysts might need, broadly speaking, and start collecting it before you go too far with designing the OLAP database.

Speculation on Fractal Programming Language

We flew east out of Panama City, and I looked down on the faceted green hills of the Cordillera de San Blas, perhaps for the last time.  In the sky were statistically similar puffs of white cumulus cloud.  Naturally, I was thinking of fractals.

I had spent the past few days coding technical analysis indicators in Python, which reminded me of coding same in C#.  This, in turn, reminded me that although the TA community talks a lot about geometric repetition, we have yet to invent a single fractal indicator, much less a trading strategy.

I write my trading strategies in C# on the MultiCharts platform.  Its procedures for time series data look a lot like the vector-oriented syntax of Python.  Here is how to do Bollinger bands in each:

  • StandardDeviationCustom(length, devs)
  • df[price].rolling(length).std() * devs

I have to admit not having much intuition about vector operations.  Matrices and summations just look like for loops to me – clearly an obstacle to the proper appreciation of Python.  I have worked with SAS and SYSTAT, though, so Python at the command prompt seems natural.

What I noticed with the Python exercise is that the classic TA indicators were designed with an iterative mindset, reflecting the programming languages of the day – Sapir’s theory, again – and so I imagine that the reason we have no fractal indicators is that our language can’t express them.

Here are some basic things I would expect from a fractal-oriented programming language:

  • Create a dataset from a generator function
  • Derive fractal metrics, like the Hausdorff dimension
  • Compare two datasets for statistical similarity
  • Compare a dataset to subsets of itself

Admittedly, I have only a cursory notion of how this would work.  That’s why this piece has “speculation” in the title.  Meanwhile, I will continue plugging away in C# and Python.

Predictive Selling in F&I

We have all seen how Amazon uses predictive selling, and now this approach is finding its way into our industry.  In this article I compare and contrast different implementations, and discuss how the technique may be better suited to online than to the F&I suite.

If you read Tom Clancy, you might like Lee Childs.  If you bought a circular saw, you might need safety goggles.  To draw these inferences, Amazon scans for products that frequently occur together in the order histories of its customers.  You can imagine that given their volume of business, Amazon can fine-tune the results by timeframe, department, price, and so on.

The effectiveness of predictive selling depends on two things: the strength of your algorithms, and the depth of your database.  Automotive Mastermind claims to use “thousands of data points,” mined from the DMS, social media, and credit bureaus.  An online auto retailer or platform site (see my taxonomy here) will also have data about which web pages the customer viewed.  Your typical F&I menu is lucky if it can read data from the DMS.

The face of predictive selling in F&I is the automated interview.  We all know the standard questions:

  • How long do you plan on keeping the car?
  • How far do you drive to work?
  • Do you park the car in a garage?
  • Do you drive on a gravel road?
  • Do you transport children or pets?

A system that emulates the behavior of an expert interviewer is called, appropriately, an “expert system.”  I alluded to expert systems for F&I here, in 2015, having proposed one for a client around the same time.  This is where we can begin to make some distinctions.

Rather than a set of canned questions, a proper expert system includes a “rules editor” wherein the administrator can add new questions, and an “inference engine” that collates the results.  Of course, the best questions are those you can answer from deal data, and not have to impose on the customer.

A data scientist may mine the data for buying patterns, an approach known as “analytics,” or she may have a system to mine the data automatically, an approach known as “machine learning.”  You know you have good analytics when the system turns up an original and unanticipated buying pattern.  Maybe, for example, customers are more or less likely to buy appearance protection based on the color of their vehicle.

At the most basic level, predictive selling is about statistical inference.  Let’s say your data mining tells you that, of customers planning to keep the car more than five years, 75% have bought a service contract.  You may infer that the next such customer is 75% likely to follow suit, which makes the service contract a better pitch than some other product with a 60% track record.  One statistic per product hardly rises to the level of “analytics,” but it’s better than nothing.

Another thing to look at is the size of the database.  If our 75% rule for service contract is based on hundreds of deals, it’s probably pretty accurate.  If it’s based on thousands of deals, that’s better.  Our humble data scientist won’t see many used, leased, beige minivans unless she has “big data.”  Here is where a dealer group that can pool data across many stores, or an online selling site, has an advantage.

If you are implementing such a system, you not only have a challenge getting enough data, you also have to worry about contaminating the data you’ve got.  You see, pace Werner Heisenberg, using the data also changes the data.  Customers don’t arrive in F&I already familiar with the products, according to research from IHS.

Consider our service contract example.  Your statistics tell you to present it only for customers keeping their vehicle more than five years.  That now becomes a self-fulfilling prophecy.  Going forward, your database will fill up with service contract customers who meet that criterion because you never show it to anyone else.

You can never know when a customer is going to buy some random product.  This is why F&I trainers tell you to “present every product to every customer, every time.”  There is a technical fix, which is to segregate your sample data (also known as “training data” for machine learning) from your result data.  The system must flag deals where prediction was used to restrict the presentation, and never use these deals for statistics.

Doesn’t that mean you’ll run out of raw data?  It might, if you don’t have a rich supply.  One way to maintain fresh training data is periodically to abandon prediction, show all products, let the F&I manager do his job, and then put that deal into the pool of training data.

Customers complete a thinly disguised “survey” while they’re waiting on F&I, which their software uses to discern which products to offer and which ones the customer is most likely to buy based upon their responses.

Regulatory compliance is another reason F&I trainers tell you to present every product every time.  Try telling the CFPB that “my statistics told me not to present GAP on this deal.”  There’s not a technical fix for that.

One motivation for the interview approach, versus a four-column menu, is that it’s better suited to form factors like mobile and chat.  This is a strong inducement for the online selling sites.  In the F&I suite, however, the arguments are not as strong.  Trainers are uniformly against the idea that you can simply hand over the iPad and let it do the job for you.

No, I have not gone over to the Luddites.  This article offers advice to people developing (or evaluating) predictive selling systems, and most of the advantages accrue to the online people.  The “home court advantage” in the F&I suite is that you can do a four-column menu, and there is a professional there to present it.

New Consolidation Stats from NADA

I chose consolidation for the first of my megatrends series, because it’s the least controversial.  Everyone seems to know it’s happening, and the records and rankings in Automotive News are dominated by big groups.

Ten years down the road, we don’t want to be the 13-point dealership group feeling that pain from the larger groups the way the smaller ones are now

This year, for the first time, NADA Data takes a look at consolidation.  Probably the best single number to look at is the ratio of rooftops to dealers, which represents the average number of stores in a dealer group.  This has grown from 1.8 to 2.2 over the last nine years – not exactly a revolution.  I was a little surprised to see such small numbers, but this is an artifact of how NADA presents the data.

NADA, logically enough, presents the number of dealers owning a group of a given size.  I would have preferred to see the number of stores, not owners, in each category.  This is a better reflection of the market coverage.  To show the distinction, I plotted the total count of both rooftops and owners.  You can see that, while the number of rooftops is recovering since 2010, the number of dealers is not.

Next, I recast the data in terms of rooftops.  The number of rooftops belonging to groups of ten or more has almost doubled over the period, from 12.2% to 21.3%.

Below, I have plotted the number of rooftops in three tiers, by size of the dealer group to which they belong.  The 2 to 10 tier has been remarkably stable, numbering roughly 8,200.  The single points have been in steady decline, losing 2,500 over the period.

Dealers know that single points are vulnerable to market shocks and competitive pressure, if for no other reason than being tied to a single make.  On present trends, we can expect them to vanish entirely within ten or fifteen years.