I’ll admit, when I started reading Ben Goldacre’s “Bad Science” I did so with the intent of poking fun at the pitfalls and pseudoscience of homeopaths and nutritionists. And whilst this has been enjoyable, the aspects of the book that I have found particularly engaging are the subtler statistical nuisances in situations such as how the pharmaceutical industry skews data and how a general misunderstanding of statistics within society can have calamitous outcomes.
It’s the latter that I’d like to think about today, and specifically an example regarding how effective diagnostic tests are given the rarity of the thing they’re diagnosing. Take a disease for example, Goldacre discusses AIDs but any will do, the quality of a diagnostic test is generally scored in terms of its sensitivity and specifity. A test’s sensitivity is its ability to correctly give a positive outcome, given the test subject has the disease, a ‘true positive’. Specificity is a test’s ability to correctly give a negative outcome given the subject does not have the disease, a ‘true negative’. These measures basically tell you how good a test is, but if the thing your testing for is generally rare, then these figures can become distorted. To demonstrate this, I’ve chosen a real-world example.
Example: Benefit Fraud
The amount of benefit handouts claimed fraudulently is a textbook example in which the scale of a problem is perceived to be greater than it is. In 2013, Ipsos MORI surveyed the public and found that the public perception of benefit fraud was 34 times the value suggested by government figures. Call me cynical, but public perception is an important driver of government action, and so means testing has become more stringent. Herein lies our ‘diagnostic test’.
UK government statistics estimate that 1.9% of the benefit expenditure is overpaid through fraud and error. Now, imagine that means testing has specificity and sensitivity of 99%; you catch 99% of fraudsters you test, and 99% of genuine claimants pass. That sounds like a pretty good test, it’s only 1 off 100%, which despite what sport coaches may claim, is as good as it gets.
Yet, remember that only 1.9% of benefit money is claimed fraudulently. That means for every £1000 pounds tested, £19 is fraudulent, of which our 99% sensitive test will reclaim £18.81. “BENEFIT FRAUD DOWN 99%” the headlines read. However, few tests are perfect and neither is ours, it is 99% specific, which means that 1% of the time it gives a ‘false positive’, highlighting a genuine claim as fraudulent. This means that in the same sample of £1000, £10 appears as a false positive and is stripped from a genuine claimant.
Mathematically therefore, despite our test being 99% accurate, only £18.81 of the £28.81 flagged as fraudulent is in fact fraudulent, the rest is false positive. This means that our test is really only around 65% accurate, given that fraud only occurs 1.9% of the time. Suddenly our 99% test doesn’t seem so great, and is in fact stripping away £1 from genuine claimants for every £2 saved.
I’m possibly even being generous by imagining that a bureaucratic assessment could even be 99% accurate, so let’s imagine that it’s 90% accurate. We reclaim £17.10 of the £19 claimed fraudulently, but now we have a false positive rate of 10%, which is £100 for every £1000 tested. Our 90% test is only around 15% accurate when we examine money flagged as fraudulent.
What are the real figures? Again, according to government figures, 0.6% of the 1.9% lost is recovered, suggesting recovery methods have ~32% sensitivity. Meanwhile 1% is ‘underpaid due to fraud or error’ suggesting a 99% specificity. So, for every £1,000 given out as benefits, £13 is lost to fraud, £6 is recovered from fraud, and £10 is underpaid. Measures are therefore about 38% successful, and so almost twice as much is underpaid as is recovered from genuine fraud.
These mathematical symptoms are unavoidable, but can be relieved when the positive result being tested is more common. This is because your false positive rate will always remain the same, however the ratio of true positives will increase, and therefore the test performs better.
What this example shows us is that a false positive cost exists that needs to be weighed up against the likelihood of achieving a true positive, ie. How common is the problem you’re testing for? This is critical whenever we attempt to take preventative action against an issue. As Goldacre points out, this is the reason that preventative legal action cannot be undertaken to prevent people with psychiatric conditions committing murder, in some kind of bizarre ‘Minority Report’ way.
I feel like this is an important train of thought for considering how we should approach a lot of modern issues, is the cost of false positives something we’re willing to accept in order to ‘catch’ the true positives, given the scale of the problem. As a result, we need to decide whether responding to events after they’ve happened may be preferable to trying to prevent them. It also emphasises the need to educate ourselves on the true scale of a problem to critically evaluate how we should deal with it. Lastly, it’s just a nice example of how a big statement like ‘99% ACCURATE’ can become wholly unimpressive under closer inspection.
“Fraud and error in the benefit system: financial year 2015/16 estimates.” – GOV.UK. Retrieved March 19, 2017 (https://www.gov.uk/government/statistics/fraud-and-error-in-the-benefit-system-financial-year-201516-estimates).
“Perceptions are not reality”. Retrieved March 19, 2017, from https://www.ipsos-mori.com/researchpublications/researcharchive/3188/Perceptions-are-not-reality-the-top-10-we-get-wrong.aspx