Suppose you have tested positive for a rare and fatal disease, and your doctor tells you the test is 90% accurate. Is it time to put your affairs in order? Fortunately, no. “Accuracy” means different things to different people, and it’s surprisingly easy to misinterpret.

What the 90% means to your doctor is that if ten people have the disease, then the test will detect nine of them. This is the test’s “sensitivity.” Sensitivity is important because you want to detect as many cases as possible, for early treatment.

On the other hand, like Paul Samuelson’s joke about the stock market having predicted nine of the last five recessions, sensitivity doesn’t tell you anything about the rate of *false positives. *

If you’re into machine learning, you probably noticed that sensitivity is the same as “recall.” Data scientists use several different measures of accuracy. For starters, we have precision, recall, naïve accuracy, and F1 score.

There are many good posts on how to measure accuracy (here’s one) but few that place it in the Bayesian context of medical testing. My plan for this article is to briefly review the standard accuracy metrics, introduce some notation, and then connect them to the inference calculations.

## Accuracy Metrics for Machine Learning

First, here is the standard “confusion matrix” for binary classification. It shows how test results fall into four categories: True Positives, True Negatives, False Positives, and False Negatives. Total actual positives and negatives are *P* and *N*, while total predicted are* P̂* and *N̂*.

These are not only definitions, they’re numbers that express probabilities like the *sensitivity* formula, above. This notation will come in handy later. The standard definition of accuracy is simply the number of cases which were labeled correctly – true positives and true negatives – divided by the total population.

Unfortunately, this simple formula breaks down when the data is imbalanced. I care about this because I work with insurance data, which is notoriously imbalanced. The same goes for rare diseases, like HIV infection – which afflicts roughly 0.4% of people in the U.S. Doctors use a metric called “specificity.”

The FP term in the denominator penalizes the model for false positives. You can think of specificity as “recall for negatives.” Doctors want a test with high sensitivity for screening, and then a more *specific* test for confirmation. A good explainer from a medical perspective is here.

In a machine learning context, you want to optimize something called “balanced accuracy.” This is the average of sensitivity and specificity. For more on imbalanced data and machine learning, see my earlier post.

## Bayes Theorem and Medical Testing

Bayes’ Theorem is a slick way to express a conditional probability in terms of its converse. It allows us to convert “is this true given the evidence?” into “what would be the evidence if this were true?”

This kind of reasoning is obviously important for interpreting medical test results, and most people are bad at it. I’m one of them. I can never apply Bayesian reasoning without first making the diagram:

In this diagram, *A* is the set of people who have the disease and *B* is the set of people who have tested positive. *U* is the *universe* of people that we’ve tested. We have to make this stipulation because, in real life, you can’t test everyone.

We might assume that the base rate of disease in the wide world is *A/U*, but we only know about the people we’ve tested. They may be self-selecting to take the test because they have risk factors, and this would lead us to overestimate the base rate.

Even within our tidy, tested universe, we can only estimate *A* by means of our imperfect test. This is where some probability math comes in handy. The *true positives*, people who tested positive and in fact have the disease, are the intersection of sets *A* and *B*. Here they are, using conditional probability:

That is, the probability of testing positive if you’re sick, *P(B|A),* times the base probability of being sick, *P(A)*. Again, though, *P(A)* can be found only through inference – and medical surveillance. Take a moment and think about how you would obtain these statistics in real life.

Mostly, you are going to watch the people who tested positive, set *B*, to see which ones develop symptoms. The Bayesian framework gives you four variables to play with – five, counting the intersection set itself – so you can solve for *P(A)* in terms of the other ones:

That is, the probability of being sick if you’ve tested positive, *P(A|B),* times the probability of testing positive, *P(B)*. We know *P(B)* because we know how many people we’ve tested, *U*, and how many were positive. Now that we’re in a position to solve for *P(A)* let’s bring back the other notation.

## Accuracy Metrics and Bayes Theorem

Machine learning people use the accuracy metrics from the first section, above, while statistics people use the probability calculations from this second section. I think it’s useful, especially given imbalanced medical (or insurance) data, to combine the two.

Now, we can rewrite the two conditional probability calculations, above, in terms of accuracy. Set *A = P*, set *B = P̂* , and the various metrics describe how they overlap.

And:

Giving our sick group as:

Finally, since you’re still worried about your positive test result … let’s assume the disease has a base rate of 1% – twice as virulent as HIV. Recall that we never said what the test’s *specificity* was. Since the test has good sensitivity, 90%, let’s say that specificity is weak, only 50%.

You are among 504 patients who tested positive. Of these, only nine actually have the disease. Your probability of being one of the nine is *P(A|B).* This is the test’s *precision*, which works out to 1.8%.