By James Kwak
To make a vast generalization, we live in a society where quantitative data are becoming more and more important. Some of this is because of the vast increase in the availability of data, which is itself largely due to computers. Some is because of the vast increase in the capacity to process data, which is also largely due to computers. Think about Hans Rosling’s TED Talks, or the rise of sabermetrics (the “Moneyball” phenomenon) not only in baseball but in many other sports, or the importance of standardized testing scores in K-12 education, or Karl Rove’s usage of data mining to identify likely supporters, or the FiveThirtyEight revolution in electoral forecasting, or the quantification of the financial markets, or zillions of other examples. I believe one of my professors has written a book about this phenomenon.
But this comes with a problem. The problem is that we do not currently collect and scrub good enough data to support this recent fascination with numbers, and on top of that our brains are not wired to understand data. And if you have a lot riding on bad data that is poorly understood, then people will distort the data or find other ways to game the system to their advantage.
Readers of this blog will all be familiar with the phenomenon of rating subprime mortgage-backed securities and their structured offspring using data exclusively from a period of rising house prices — because those were the only data that were available. But the same issue crops up in many different stories covering different aspects of society.
CompStat, an approach to policing that focuses on tracking detailed crime metrics, was widely credited with helping New York and other cities reduce crime in the 1990s. Last year, This American Life ran a story, based on a police officer’s secret recordings, detailing how in at least one precinct officers were pressured to boost their numbers through dubious arrests and citations. They also found another precinct where serious crimes were reported as less serious crimes in order to make their numbers look better than they really were.
In a recent New York Times story, David Segal describes how law schools massage their metrics to score higher in the US News and World Report rankings. Segal focuses on the tricks that some schools seem to use to boost the number of graduates employed nine months after graduation; for example, some schools apparently hire their own graduates to temporary positions that happen to span the date on which employment rates are measured. The rankings are based on statistics that are defined by the American Bar Association but are self-reported by the schools and not audited by anyone.
The big, well-known example of how the importance of data breeds data manipulation is standardized testing. In the early days of the standardized testing boom, the key statistic was the percentage of students at or above grade level, defined as the fiftieth percentile on some standardized test. (For those wondering if this is circular, the scaled score required to be at the fiftieth percentile is set before the test based on the attributes of the questions included in the test; it is not set after the test based on students’ actual performances.) So one obvious tactic would be to focus on students in roughly the thirtieth to sixtieth percentiles while ignoring the others. Another, more problematic tactic would be to classify as many low-performing students as possible into special education so that they would not be in the denominator. (Then there is blatant cheating, like giving your students more time to take the test or simply correcting their answers afterward — Freakonomics has a chapter on this – since few if any school districts have the capacity or the motivation to oversee the tests rigorously.) Even leaving aside data manipulation issues, there is also the basic problem that test difficulty varies from year to year. The test in year N + 1 is calibrated to be the same difficulty as the test in year N, but this is all based on statistics, and there is this thing called random variation to deal with.
And I recently read Natalie Obiko Pearson’s story in Bloomberg on the problems with greenhouse gas emissions data. Most of the numbers we read are self-reported by countries and the companies in those countries, and even if they are honest (a big if) they are “bottom up” estimates — based on how much fossil fuel is being consumed. But when scientists actually measure changes in greenhouse gases in the atmosphere, they get different results than predicted by the bottom-up estimates. And in all the examples cited in Bloomberg, actual atmospheric measurements are higher than bottom-up estimates. This could be because the article didn’t mention atmospheric measurements that were lower than predicted by official data. But it could also be because both the companies burning the fossil fuels and the countries aggregating the data have the same incentive to underreport: companies because it means they don’t have to buy as many carbon permits and countries because it means they can claim to be under their Kyoto Protocol targets.
Greenhouse gases are a good example of how we think data will help save us — if we can track how much carbon dioxide each company is producing, we can make it pay for that carbon — but we may just not have good enough data. In general, I think the current trend toward using more and more data is a good thing. I mean, what’s the alternative: gut intuition? But this only increases the importance of having good data to begin with. And when some parties benefit from bad data, this can be a big challenge with no easy solution.