Data Lakes Explained

Last month, I wrote an explainer on AI and it was well-received, so here is one on data lakes.  If you already know the concepts, you may still find this framing helpful in client discussions.  Our audience this time is the CFO, or maybe the CMO, and our motivation is that their analytical needs are not well-served by the transactional database.

Transactional Processing with a Relational Database

The data that runs your business – most of it, anyway – is probably stored in a relational database like Microsoft’s venerable SQL Server.  Without going into details about the “relational” structure, the key is that this database is optimized for the daily operations of the business.

New policies are booked, premiums collected, and claims paid.  These are transactions that add, change, or delete records.  There are also “read only” operations, like producing invoices, but the database is designed primarily for transaction processing.

A well-designed transactional database will resist anomalies

A well-designed transactional database will resist anomalies, like a line item with no invoice, or two sales of the same item.  The database designer will have used a technique called normalization, breaking the data up into smallish tables with relationships that enforce integrity.

Think of how your chart of accounts is organized.  Everything you need to account for is broken down to the lowest relevant level, and then rolled up for reporting.  Every journal entry hits two accounts, debit and credit, so that they’re kept in balance.  Your meticulously normalized database is kind of like that.

When a customer places an order, a row is added to the Order table.  You don’t need to open the Customer table unless there’s a change to the customer.  Built around these normalized tables is the machinery of indexes, clusters, and triggers, which support speed and integrity.

Pro Tip: Take time to confirm that the transactional database is stable and supporting the business satisfactorily.  You don’t want to start building pipelines and then discover there’s a problem with your data source.

Analytical Processing with a Data Warehouse

Transaction processing involves adding and changing data, with carefully limited scope.  Analytical processing, by contrast, is mostly reading data – not changing it – and holistic in scope.  To support this, the data must be copied into a separate database and denormalized.

Let’s say you want to know whether Dent protection sells better as a standalone product, or as part of a bundle – corrected for the number of dealers who don’t offer the bundle, and segmented by the vehicle’s make and price range.

You could run this query against the transactional database, but it would be difficult.  The query is complicated enough without having to piece together data from multiple tables.  The normalization which served so well for transaction processing is now an obstacle.

Confession: I am a normalization bigot.  I bought C.J. Date’s textbook, read the original papers in the ACM journal, and even coded Bernstein’s algorithm.  To me, organized data is normalized data, and de-normalizing is like leaving your clothes on the floor.

So, this is a good guide to denormalization.  Everything we learned not to do in relational databases – wide tables, nested data, repeating groups – is useful here.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers

Analytical work requires not only a new database design, but a new database system.  Out goes SQL Server and now we have Big Query, Redshift, and Snowflake.  You may hear this buzzword, OLAP, which means “online analytical processing.”  This concept was invented for marketing purposes, to describe the new category of software.

Analytical data is stored in cubes, stars, snowflakes, hearts, and clovers (see sidebar).  Just kidding about the hearts and clovers.  Also, while your transactional database may be running SQL Server “on premise,” the analytical database will almost certainly be on a cloud service from Amazon, Microsoft, or Google.

To be honest, not everyone needs an OLAP database.  As CIO for BMW Financial Services, I did not recommend one because our analytical workload was small, at the time, and could be served adequately without a lot of new gear and expensive consultants.  Since then, I have gone over to the side of the consultants.

Sidebar: What’s an OLAP Cube?

In the early days of analytical processing, software vendors thought it would be a good idea to use a multidimensional data structure called a hypercube. Think of a typical spreadsheet, with rows representing an income statement and one column for each month. That’s two dimensions. Now, add a stack of spreadsheets, one for each region. That makes three dimensions, like a cube. I put myself through grad school working at Comshare, one of the first OLAP software vendors. It supported seven dimensions. That’s a hypercube. Nowadays, there are better data structures, and this leads to some confusion. Older analysts may assume that if they’re doing OLAP, then they must be using a cube. They may use the term “OLAP cube” to mean any analytical database, even though cubes have largely been replaced by newer structures.

Pooling Data in a Data Lake

You can think of the data lake as a way station between the transactional database and the data warehouse.  We want to collect all the data into a common repository before loading it into the data warehouse.

Why not simply extract, transform, and load data straight from the transactional database?  Well, we could, but it would be brittle.  Any change on either side would require an update to the pipeline.  The data lake decouples the OLTP and OLAP data stores.

The data lake serves the very important function of storing all the data, in whatever format, whether or not it’s amenable to organization.  The term’s originator, James Dixon, wanted to suggest a large volume of data with no preconceived organization.

The key thing is to collect all the data in one place, and think about organization later.  This calls for an “object data store,” like Google Cloud Storage.  GCP and AWS both use “buckets.”  You get the idea – this is where you leave your clothes on the floor.

Most of your data will indeed be structured data coming from the transactional database, and on its way into the OLAP database – but not all of it.  Here are some real-life examples I have encountered:

    • Logs of API traffic. Details of who is using our ecommerce API, including copies of the payload for each request and response.
    • Text snippets. A file of the several paragraphs that make our standard Texas contract different from the one in Wisconsin, so that we can produce new contracts automatically.  Same goes for product copy on the web site.
    • Telephone metadata. A list of timestamps, durations, phone numbers, and extensions for all calls in the call center, both inbound and outbound.

These examples are better served by special-purpose databases like Hadoop, Bigtable, and Mongo.  It’s best to take stock of all the data your analysts might need, broadly speaking, and start collecting it before you go too far with designing the OLAP database.

The Power of Experience

I have been rereading Gary Klein’s landmark book on decision-making, Sources of Power.  Klein’s genius was something other sciences take for granted: field work.  Klein and his team spent years studying how experts make high-stakes decisions in real life.  This is truly “what they don’t teach you in business school.”

The short version is that formal methods for decision making are rarely used in real-life conditions.  Indeed, the people studied by Klein were not even conscious of making decisions.  They just knew what to do.  When a surgeon must make a snap decision, with someone’s life on the line, there’s no time for a weighted-factor analysis.

Most research on decision-making bleaches out the importance of prior experience

Klein points out that most psychology research, in an effort to produce controlled conditions, bleaches out the importance of prior experience.  If you do all your research in a laboratory, then you will only learn how people make decisions in a laboratory – not in combat, say, or a forest fire.

Like his better-known colleagues Kahneman and Tversky, much of Klein’s research was funded by military organizations.  They would like their gunners and squadron leaders not to make fatal blunders under fire.  Also included are doctors, firefighters, and nuclear power plant operators.

The power of experience seems obvious enough, but Klein figured out exactly how it works, in a framework called the Recognition-Primed Decision Model.  This consists of using imagination plus experience to generate possible courses of action, and then conducting mental simulations to predict the likely results.

Sources of Power

Various “sources of power” follow from the model:

  • Expert Intuition
  • Mental Simulation
  • Finding Leverage Points
  • Detecting Anomalies
  • Reasoning by Analogy
  • Anticipating Intentions

What we think of as intuition is really expert recognition.  One firefighter recounted a narrow escape because he’d had a “premonition” the building he was working in was about to collapse.  This might have been a warning from God – or it might have been the million subtle cues he was unconsciously observing.

This may seem like a different realm from business, where we have ample time to make decision trees, compute expected values, perform cost-benefit analyses, and – there’s always time for one more Big Four consulting study.  This is an illusion, however.  Whether they know it or not, managers are under constant pressure to make decisions and take action faster than their competitors.

A good plan, executed right now, beats a perfect plan executed next week.

My mentor at AutoNation, Kevin Westfall, had a plaque in his office with this quote from General George S. Patton, “a good plan, executed right now, is far better than a perfect plan executed next week.”  Kevin and I had both arrived from our previous employer with some impatience over their decision protocols.

Recognition-Primed Decision Making

In an area that could easily devolve into pop psychology, I was impressed by Klein’s scientific rigor.  Every study is cross-checked, blind, double-blind, sanitized, etc.  Every result is turned into a training program, and then the trainees are tested.  In one project, his team redesigned the user interface for a computerized weapons system, making its operators 20% more effective.

Since experience is so powerful, Klein takes up the question of how best to gain it.  That is, what are the key lessons from the old-timers in various domains?  In the infantry, this might mean knowing how fast your squad can move over terrain, what their best range is for engagement, and being able to gauge those distances by eye.

The cornerstone of the book is the RPD framework, and then Klein spends a chapter on each “source of power,” plus his research methods and training programs.  If that sounds like too much psychology for you, skip the text and just read the case studies.  They’re amazing.

GPS Trackers and OBD Ports

While I was working at Safe-Guard, in 2018, we adopted and co-branded a GPS tracker from ZAZ.  Shortly thereafter, we learned that our crosstown rival, EasyCare, was backing another such product, called SAVY.

Of the two, SAVY had more consumer-friendly features in their mobile app, which I feel is decisive.  This point of strategic positioning is the focus of today’s post.

Neither hookup did well, demonstrating that providers of “paper” F&I products are ill-equipped to deploy hardware.  I took the installer training, just for grins, with Hector Delgado.  So, at least I have a useful skill to fall back on.

I also consulted briefly for LoJack, in 2012, helping them sort out issues around preloading – issues they solved, ultimately, by selling the brand to Spireon.  Old-timers will recall LoJack used to work on radio.  It’s GPS now, like all the others.  “New LoJack by Spireon” is, in fact, old Spireon plus the stronger brand name.  The field today consists of:

    • Ikon
    • LoJack
    • Recover
    • SAVY
    • ZAZ

The model for all of these is that the dealer installs the tracking devices and uses them for lot management, and then sells them through to customers as theft protection.  They’re often sold as a nonnegotiable “preload,” which makes sense from the dealer’s perspective because it would cost another $50 of technician time to remove the device if the customer doesn’t want it.  You can see how consumer mobile app-appeal figures into our story.

If the device is drawing power from an OBD port, it can report the vehicle’s battery condition along with its location.  There’s a lot more you can do with OBD data, but manufacturers can be prickly about connecting to those other pins.  The typical device consists of the GPS chip, a cell modem, and an accelerometer.  You may have noticed that your iPhone also includes these parts, but not the OBD plug.

Speaking of those other pins, subprime lenders and BHPH dealers can wire the device to do starter interrupt.  That is, the OBD-powered devices.  The Recover device I saw at NADA is battery powered.  The argument for a battery-powered device is that it’s easier to install.  The opposing argument is around battery life, especially if you are selling it through, and the advanced capabilities available to an OBD scanner.

Connected Car Features

This brings me to the consumer features:

    • Service reminders
    • Teen driving
    • Driver performance
    • OBD health scan
    • Dealer inventory
    • Service scheduling
    • Credit application
    • Trip history
    • Recall notification
    • Digital glovebox

The astute reader will note that many of these features also aid the dealer in customer retention.  On the other hand, dealer-friendly features don’t mean a thing if the customer doesn’t use the app.  So, preloading can work against you if F&I fails to upsell the device properly.

Also, as mentioned above, your iPhone can support most of these functions on its own.  I run Life 360, which adds “insurance referral” to the driver performance feature.  The advantage to the dealer-installed device is that it’s physically attached to the vehicle.  By the way, you can buy a home OBD scanner for $30 at Walmart.

The dealer-installed GPS tracker is an amalgam of all these capabilities.  The key to success is exploiting them creatively and packaging them in ways that appeal to the consumer.

What is Accuracy?

Suppose you have tested positive for a rare and fatal disease, and your doctor tells you the test is 90% accurate.  Is it time to put your affairs in order?  Fortunately, no.  “Accuracy” means different things to different people, and it’s surprisingly easy to misinterpret.

What the 90% means to your doctor is that if ten people have the disease, then the test will detect nine of them.  This is the test’s “sensitivity.”  Sensitivity is important because you want to detect as many cases as possible, for early treatment.

On the other hand, like Paul Samuelson’s joke about the stock market having predicted nine of the last five recessions, sensitivity doesn’t tell you anything about the rate of false positives.  

If you’re into machine learning, you probably noticed that sensitivity is the same as “recall.”  Data scientists use several different measures of accuracy.  For starters, we have precision, recall, naïve accuracy, and F1 score.

There are many good posts on how to measure accuracy (here’s one) but few that place it in the Bayesian context of medical testing.  My plan for this article is to briefly review the standard accuracy metrics, introduce some notation, and then connect them to the inference calculations.

Accuracy Metrics for Machine Learning

First, here is the standard “confusion matrix” for binary classification.  It shows how test results fall into four categories: True Positives, True Negatives, False Positives, and False Negatives.  Total actual positives and negatives are P and N, while total predicted are and .

These are not only definitions, they’re numbers that express probabilities like the sensitivity formula, above.  This notation will come in handy later.  The standard definition of accuracy is simply the number of cases which were labeled correctly – true positives and true negatives – divided by the total population.

Unfortunately, this simple formula breaks down when the data is imbalanced.  I care about this because I work with insurance data, which is notoriously imbalanced.  The same goes for rare diseases, like HIV infection – which afflicts roughly 0.4% of people in the U.S.  Doctors use a metric called “specificity.”

The FP term in the denominator penalizes the model for false positives.  You can think of specificity as “recall for negatives.”  Doctors want a test with high sensitivity for screening, and then a more specific test for confirmation.  A good explainer from a medical perspective is here.

In a machine learning context, you want to optimize something called “balanced accuracy.”  This is the average of sensitivity and specificity.  For more on imbalanced data and machine learning, see my earlier post.

Bayes Theorem and Medical Testing

Bayes’ Theorem is a slick way to express a conditional probability in terms of its converse.  It allows us to convert “is this true given the evidence?” into “what would be the evidence if this were true?”

This kind of reasoning is obviously important for interpreting medical test results, and most people are bad at it.  I’m one of them.  I can never apply Bayesian reasoning without first making the diagram:

In this diagram, A is the set of people who have the disease and B is the set of people who have tested positive.  U is the universe of people that we’ve tested.  We have to make this stipulation because, in real life, you can’t test everyone.

We might assume that the base rate of disease in the wide world is A/U, but we only know about the people we’ve tested.  They may be self-selecting to take the test because they have risk factors, and this would lead us to overestimate the base rate.

Even within our tidy, tested universe, we can only estimate A by means of our imperfect test.  This is where some probability math comes in handy.  The true positives, people who tested positive and in fact have the disease, are the intersection of sets A and B.  Here they are, using conditional probability:

That is, the probability of testing positive if you’re sick, P(B|A), times the base probability of being sick, P(A).  Again, though, P(A) can be found only through inference – and medical surveillance.  Take a moment and think about how you would obtain these statistics in real life.

Mostly, you are going to watch the people who tested positive, set B, to see which ones develop symptoms.  The Bayesian framework gives you four variables to play with – five, counting the intersection set itself – so you can solve for P(A) in terms of the other ones:

That is, the probability of being sick if you’ve tested positive, P(A|B), times the probability of testing positive, P(B).  We know P(B) because we know how many people we’ve tested, U, and how many were positive.  Now that we’re in a position to solve for P(A) let’s bring back the other notation.

Accuracy Metrics and Bayes Theorem

Machine learning people use the accuracy metrics from the first section, above, while statistics people use the probability calculations from this second section.  I think it’s useful, especially given imbalanced medical (or insurance) data, to combine the two.

Now, we can rewrite the two conditional probability calculations, above, in terms of accuracy.  Set A = P, set B = P̂ , and the various metrics describe how they overlap.


Giving our sick group as:

Finally, since you’re still worried about your positive test result … let’s assume the disease has a base rate of 1% – twice as virulent as HIV.  Recall that we never said what the test’s specificity was.  Since the test has good sensitivity, 90%, let’s say that specificity is weak, only 50%.

You are among 504 patients who tested positive.  Of these, only nine actually have the disease.  Your probability of being one of the nine is P(A|B).  This is the test’s precision, which works out to 1.8%.