heuristics on statistics

posted 2023.02.25 •

send feedback

you may have noticed that statistics are over the place these days. just the other day i went to the grocery store and the cashier told me, “did you know apples account for 33% of our fruit sales?” and then i got a haircut and the hairdresser told me that people are five times more likely to wear gray when they ask for their bangs to be trimmed. finally, exhausted by this onslaught of numbers, i watched an NBA game to relax and was told that Devin Booker is shooting 4% better in the second half of the season than in the first! it’s inescapable.

sorry for lying. 100% of the anecdotes in the above paragraphs are not true. but anyway these are the kinds of statements you might encounte when people make claims, and it is probably important to know how much you should believe them.

i’m not, like, a world-class expert in knowing when to believe statistics. but i have read a lot of writing from people who do well-respected statistics work, i’ve done statistical analyses myself, and if it means anything, i got a 5 in AP Stats.¹ what i’m saying is, even if i’m just some guy, you’re probably even more of just some guy.²

example 1: the apple anecdote from above

“did you know apples account for 33% of our fruit sales?” if a cashier tells you this you can probably just believe it. i doubt they ever would, but unless it’s a total lie it is hard for that statement to be misleading. grocery stores sell a lot of fruit, so the sample size is probably big.³ and counting how many fruits are sold and how many of those are apples is pretty straightforward. you might want to clarify if this statement was made about sales from the last week, the last year, or the last 40 years, though, because those would probably have different percentages.

example 2: the hairdresser anecdote from above

“people are five times more likely to wear gray when they ask for their bangs to be trimmed.” if you are not careful you might immediately start thinking, “yeah, gray is a pretty boring color, and boring people might be more likely to trim their bangs.” and off the basis of this statistic you have created an entire causal explanation for why this might be true. but instead this claim should probably set off Bogus Alert bells.

i doubt the hairdresser has been making shirt color tally marks during their job, so let’s assume this is actually an article published in the social science journal of your choice. and maybe the hairdresser just happens to be the erudite sort and read it in that journal.

if you hear this statistic offhand, then you don’t know how the study was actually conducted. so imagine you go read the study and this is what the study says:

we, the researchers, went to a couple local barbershops and tracked 1000 customers. 600 of them asked for their bangs to be trimmed, of which 300 (50%) were wearing a gray shirt. 400 did not ask for their bangs to be trimmed, of which 40 (10%) were wearing a gray shirt. if they were wearing a coat over their shirt we just counted the coat color, and if they were wearing a dress or something instead of a shirt we counted the color of that instead.

this sounds like pretty strong evidence. the sample size is pretty big here, and they define pretty clearly and objectively what “wearing gray” means.

but when you hear this sort of claim, and it sounds sort of surprising, that is almost never what the paper looks like. it’ll probably look something like this:

we, the researchers, went to a barbershop and tracked 90 customers. we wrote down the color of their shirt and the haircut they wanted. 10 asked for their bangs to be trimmed, of which 5 (50%) were wearing gray. 80 did not, of which 8 (10%) were wearing gray.

this isn’t no evidence. but there are a few things that should be weighing against it:

if not explained, you don’t know how they decide whether someone was “wearing gray” (or even “asked for their bangs to be trimmed”).
if they wrote down everyone’s shirt color and haircut choice, there is a good chance they were checking like 8 different shirt colors and 5 different haircuts to see if any of them could find anything interesting. and with 8 * 5 = 40 different combinations, it is pretty likely that one of them would seem interesting, even if it was totally random chance.
most importantly: this claim makes no sense!!! am i really to believe that people who get their bangs trimmed love gray so much that they are 5 times more likely to wear it? and that if this is the case, why has no one in our society ever noticed this?

thankfully no such study has actually been printed. but the lesson here is: if you see a surprising claim, especially if it claims an unexpectedly large effect size, the methodology and sample size had better be pretty good.

interlude: you can only explain 100% of the variance

in statistics we often use a measurement called “variance”, which measures, believe it or not, how much a quantity varies. if you want to know about it in detail you should probably read the wikipedia article about it, but the basic idea is this:

interlude example 1: height and hats

consider a quantity like height. say the average adult is 5 foot 6. and let’s say that you take every adult and compare their height to that average, and take the square of the difference. a 5 foot 3 person is 3 inches off, and 3 squared is 9. and a 6 foot person is 6 inches off, and 6 squared is 36. so you average all those squares across all the adults, and finally you have the variance of adult height. higher variance means people differ more from the average. if everyone had the same height then the variance would be 0; if everyone had wildly different heights the variance would be very high.

now imagine everyone is wearing hats, and you measure the variance of people’s height including their hats. the variance of people’s heights (in inches) might be something like 20. and because some people favor tall top hats, while others prefer thin baseball caps, the variance of how much height their hats add (in inches) might be 5. then assuming there is no correlation between height and hat preference, the variance of people’s height including their hats is 20 + 5 = 25. you can just add it. it’s that simple.⁴

interlude example 2: sports

so let’s say you observe a quantity, and the variance is 20. like imagine that when two teams in the National Sports League play each other, the variance of the home team’s score is 20. so maybe they usually score about 80 points, but it could be 86 or it could be 77, or something else around there, depending on the game.

then let’s break it down by how much variance the home team’s offense contributes, how much the away team’s defense contributes, and how much is just luck. maybe in this league, there is a big difference between good offenses and bad offenses, so the variance of teams’ offenses has a variance of 10 points.⁵ and maybe defense is relatively less distinct between teams, having a variance of only 4 points.⁶ that adds up to 10 + 4 = 14, which is 6 less than 20, so you might say the remaining 6 points of variance come from luck, or how the players were feeling that day, or whatever. the 20 points of variance have to be explained by something, even if some of it is luck.

now imagine your friend tells you that home teams in the National Sports League score 10 points more on even-numbered days than they do on odd-numbered days. and it so happens that about half of the games are on even-numbered days, and half are on odd-numbered days. you immediately know this is total bogus!

how? because this means that the even-odd split accounts for 25 points of variance.⁷ and we already know that the variance of home team scores in the National Sports League is only 20. how is the even-odd split accounting for more variance than actually exists? it can’t! so if your friend tells you that statistic, you can tell your friend to shove it.

but your friend says they misspoke—they actually meant to say that home teams score 6 more points on even days than on odd days. i think you should still be suspicious! that’s 9 points of variance still, which only leaves 20 - 9 = 11 points of variance to account for how good the home team’s offense is, how good the away team’s defense is, luck, whether the players ate a balanced breakfast that day, and so on. and somehow the even-odd split is accounting for 45% of that? seems fishy.

it’s not that this literally couldn’t be true. maybe the league has a long-documented history of scoring more on even days than odd days, over the course of so many games that it’s irrefutable. but if your friend is using a small sample, or is just estimating that number based on their impression, it’s very implausible. you can’t have one thing explain 40% of the variance, and another thing explain 30% of the variance, and a third thing explain 45%. it has to add up to 100%!

the point here isn’t that you need to be doing variance math in your head for every statistic, but to keep in mind that, if you see that Quantity X supposedly has a huge effect on Quantity Y, then all the other quantities can only explain as much as Quantity X doesn’t explain. and if we already thought that Quantity Z had a huge effect, then we might need Quantity X and Quantity Z to duke it out in the ring. because their importance cannot add up to more than 100%.

the interlude is over

thank you for your cooperation.

example 3: wine and cancer

every few weeks a study is published that says something like, “eating broccoli twice a week lowers your risk of cancer by 40%.” these are usually bogus. (broccoli might, in fact, lower your cancer risk, but that tends to mean that the study got lucky.) why? there are many reasons:

as we have learned, the variance can only add up to 100%! saying broccoli lowers your risk of cancer by 40% doesn’t quite mean that it accounts for 40% of the variance in cancer incidence, but given that broccoli consumption is not uncommon, reducing cancer risk by 40% does mean that it’s a pretty important factor. like, way more important than you should expect without a convincing medical explanation for it.⁸
nutrition studies in particular are extra fraught, because it is really hard to track people’s diets accurately. usually they have people write down everything they eat in a food journal, but it takes a long time to get an accurate picture of someone’s diet, and people don’t always give a perfectly accurate account of their actual diet. because of how hard it is to get people to accurately log all their meals, these studies tend to try to get their money’s worth by testing a ton of foods at once. and when you’re doing this many tests, you either raise your standards of evidence so much that everything seems to have no effect, or you lower them enough that many of your effects are purely due to chance. and then you publish the study and news agencies pick the most sensational (probably wrong) conclusion and make it the headline.⁹
if you see something like “twice a week” as a qualifier, that could be a sign that the researchers chose an arbitrary cut off for broccoli intake that maximizes how interesting their results seem.

example 4: the polls

if a reputable pollster says Candidate A is leading Candidate B by 5 points in their senate race, you can probably believe them. this isn’t as simple as the earlier fruit example, because pollsters will often do some statistics to try to make their results more accurate, instead of just counting what percent of people said each candidate’s name. for example, if 50% of the voters in a district are women but only 40% of the people polled were women, they might try to balance it out by weighting the women’s votes higher.

but anyway, big-name pollsters tend to be pretty good at their job, so you can probably trust that their numbers aren’t too far off. they can’t interview every single voter to get an exact percentage, so inevitably they won’t be exactly right, but they’ll probably be close.

example 5: the basketball example from the first paragraph

“Devin Booker is shooting 4% better in the second half of the season than in the first.” if you hear a sports announcer say this, it is probably factually true. but they might follow up with something about how Booker has improved his game and will continue to shoot at this better rate. this is usually wrong or exaggerated. a 4% difference is the difference between going 175-for-350 in the first half and 189-for-350 in the second half. this is not nothing, but it’s well within the range of what can happen due to chance. and given that there are hundreds of NBA players, and several cutoff dates you could choose for first part vs. second part of the season, it’s not hard at all to find someone who’s improved by 4%, or even more than that.

and, as discussed before, you must consider the plausibility that Devin Booker has found the secret sauce that lets him improve his shooting by 4%. 4% is not trivial! NBA players spend years and years in the gym in search of even smaller improvements. maybe he improved by some smaller amount, but 4% is a lot.¹⁰

example 6: what if you’re some kind of web guy

one thing a lot of companies try to do is make more money. and since people use websites a lot, these companies try to design their website so it makes them a lot of money. and sometimes that mean they hire a web guy to tell them which version of their site will make the most money. so the web guy might test two versions of the site, where one of them has a big button that says “Subscribe!” and the other one has a big button that says “Buy Now!”. and then after a month we see that the “Buy Now!” button gets 11% more people to click on it, which means the company is making 11% more money.

“well hold on a minute!” a smart web guy might think. this test is an imperfect proxy for whether the company actually makes more money! and anyway, 11% seems like a big effect, though not a completely unfathomable one. so the web guy might go through their data and code to check a few things:

were both the pages otherwise identical? did the page with the “Buy Now!” button also have some extra code that made the page take longer to load? or was the kerning fine on the “Subscribe!” button but messed up on the “Buy Now!” button?
was the version of the page people got correlated with some other information about them? was one version getting shown disproportionately to people from the UK? to IP addresses in a certain range (which might, say, contain mostly bot traffic)? was one shown disproportionately often during a certain time frame, which might line up with the company running a big promotion?
if people refreshed the site, or left it and revisited it, did they get the same version of the button each time? does it matter?
were all the page visits logged correctly? did some of the “Subscribe!” ones slip through the cracks, or get logged twice?
when people clicked “Buy Now!”, did they give the company money at the same rate that people who clicked “Subscribe!” did? did they cancel their subscriptions at a higher rate?
was the sample size large enough to ensure that the 11% figure isn’t just random chance?
was the Spanish version of the “Buy Now!” button translated correctly? was it included in the results?
was there some sort of math error in the analysis?
or some sort of coding error? what if you ran a script to remove duplicated log entries, but it accidentally removed some of the correct results too?

it is hard to ensure all of these things are done correctly, but that’s why this web guy gets paid the big bucks.

general principles

this section feels a bit condescending to write, but i do need a summary because this post is too long to not have a summary. so here is the summary of what i think when i see a statistic.

the first thing is to consider how believable the claim would be, without considering the statistic at all. if it is not very believable then you should approach the claim with more skepticism.
if a statistic is just an objective measurement, then it is probably true, though sampling error might cause it to be a little off. if you are supposed to infer a cause and effect from it, the cause and effect very much might not be true.
if a cause and effect are implied, consider: if you heard the opposite effect is true, could you come up with a just-as-plausible explanation for that effect?
if someone collects a bunch of statistics but only reports one, consider why they only reported one. (it might not be nefarious, but sometimes it is.)
consider whether the size of the effect is plausible in size. if it is big then you must consider the cause as among the very most important causes! if it is really big, then you must consider the cause as the #1 factor. could it plausibly be as important as claimed?
just because the effect stated is implausibly large doesn’t mean there is no effect. you don’t really know the size of the effect. it could be no effect though. or, like, an effect size of 0.0000001.

and remember to say, “oh gee! a statistic!”

not to mention a 3.8 or above in all my statistics-related classes in college. ↩
if you’re not, then frankly you should have had the werewithal to skip this article knowing i have nothing to teach you. ↩
you might find out it’s more like 31% or 32%, but that’s so close it’s probably not a big deal. ↩
if there is a correlation between height and hat size, then you cannot “just add it”. if tall people like tall hats, your variance will be even higher than 25; if tall people like short hats, it will be lower than 25. ↩
it might be confusing what this means, but imagine this: the best offense scores 85 points in home games on average, the worst offense about 75 points in home games on average, and the other teams are roughly evenly spread out within that range. ↩
similarly, you might expect the best defense to give up 77 points in away games on average, the worst defense to give up about 83 points in away games on average, and the other teams to be roughly evenly spread out in that range. ↩
if even days are 10 points more than odd days, and the split is half-and-half, then the even days are 5 points above average, and odd days are 5 points below average. so every game is 5 points away from the average, and 5 squared is 25. ↩
you might get an explanation like, “it is high in antioxidants.” this is not a convincing explanation, because it fails to explain why every other food with antioxidants doesn’t have a similarly large effect, or why we aren’t all taking antioxidant pills every day to prevent cancer. ↩
this can sort of be overcome by doing a “meta-analysis” and looking at the results of many different studies. meta-analyses still aren’t perfect, but they are better than single studies. ↩
it’s not unfathomable that someone could improve their shooting by 4% or more (like, for example, Tyrese Maxey), especially if their shooting wasn’t amazing to begin with. but 4% is a lot. this might not be clear if you are not familiar with basketball, so, um… consider what might not be clear when you see a statistic about something else you’re not familiar with. ↩