What crime statistics, standardized tests, and scientific researchers have in common

by Alan Cohen

I had thought about calling this post “Cohen’s law for predicting distortions in incentivized systems.” Tongue-in-cheek of course – it’s approximate, and thus not really a law. And I don’t like the self-aggrandizing habit of naming a law after oneself. And it would have been a dry title, and you probably wouldn’t be reading this. Nonetheless, this post is about the single most important thing that everyone designing public policy should understand. It is about the principle that makes most public policy fail (or work less well than intended).

Most public policy is designed to achieve certain goals – lowering crime, improving education, advancing scientific knowledge, improving health care etc. And most of the time, these goals are achieved by trying to get the right people to do the right things: police to arrest criminals, teachers to teach well, researchers to perform well, doctors to treat patients well, etc. In order to encourage this, most policy incorporates some form of incentives: tax structure, salary scales, rewards for good performance, and so forth. Police departments are judged by their crime statistics, and in turn find ways to pressure their officers to deliver these stats. In US education policy, No Child Left Behind was supposed to implement standards to encourage schools and teachers to perform better. Researchers who are productive are more likely to get funded for their next research grant. And so forth.

What all of these examples have in common is that the incentives are delivered not for actual good performance, but based on metrics (proxies) of performance. And in each case, this encourages the people involved to work hard to perform well on the metrics, even at the expense of performing worse on the underlying goals.

For crime, the familiar example to many readers will be the TV series The Wire, which clearly demonstrates how the need for police to perform well on statistics often impedes them from doing actual good police work: let’s classify that death as accidental because it will be hard to solve as a homicide. Let’s arrest lots of low-level drug dealers on the corners rather than their bosses because our arrest stats go up. Etc. And this has been well documented in non-fictional police departments as well. There is an underlying goal: reduce crime. But the ways we measure this – number of cases solved, number of arrests made, etc – is crude, and is subject to manipulation by individual officers and departments. In the end, the use of these metrics probably becomes a bigger hindrance to crime fighting than a help.

For standardized tests and education policy, the recent scandal of organized cheating throughout the Atlanta school district is symptomatic – and predictable – based on a very similar situation. School districts are under enormous financial pressure to perform well. School districts have some control over test outcomes – not just through cheating, but also through teaching to the test rather than teaching to improve true educational attainment. So voila, an education system that fails to deliver improved education.

In terms of scientific research, scientists applying for funding are evaluated based on past research productivity, including number of publications, measures of the prestige of the journals in which these publications appeared, dollars of grants obtained, and so forth. As above, when the academic post of a researcher is contingent upon attaining large amounts of funding, the researchers tailor their projects and strategies to play the game – i.e., inflate their stats – rather than just to conduct good research. For example, perhaps I have a study that should be published as one article, but I see a way to break it into two in order to augment my number of publications. This is normal, and most researchers do this to some extent. At the extreme, a small (but still too large) number of researchers fakes data, ignores contrary findings, and otherwise plays fast and loose with the truth in order to win at the game. All of which has an enormous cost to society in terms of wasted money, false scientific leads, and slower research progress.

This shared principle can be summed up with a pseudo-equation (an equation which illustrates a principle, but is not meant to work with actual values or calculations):

D = E(m)×M×(I-I*)

D is the overall distortion we can expect, that is, how far away the result of our policy will be from what we want. It is the product of three quantities. First, E(m) is the error of the metric(s). The better the metrics are at measuring the underlying goal, the less distortion, and vice versa. How well do crime stats measure the true state of crime in a city? How well do standardized tests measure true learning? How well do scientific productivity metrics measure true innovation and contribution to a field? The less well these things are measured, the greater the consequences of using them as proxies for performance. When individuals try to achieve on the metrics rather than on the underlying goal, they will be farther from the goal if the metrics are imprecise. But the metrics don’t have to be too bad for there to be major distortions: scientific productivity metrics are not awful, but are still imprecise enough to introduce substantial problems into the system.

Second, M is the manipulability of the metric. If the players (police departments, police officers, school districts, scientific researchers, etc) have no control over the metric, there is no problem. For example, crime statistics may not in and of themselves be terribly imprecise as a measure of overall crime rates. The problem is that they are highly manipulable (and, once manipulated, become very poor proxies). It is very easy to adjust police department policy to improve the appearance of the stats without changing the underlying reality.

Third is the difference between I and I*, where I is the incentives for the players to follow the metrics, as designed in the system, and I* is the incentives of the players to do a truly good job regardless of the other incentives. Many police officers want to fight crime effectively and won’t let the stats deter them too much from following their conscience. Ditto for teachers and researchers. This idealism is I*. But the stronger the pressure (I) becomes to meet the metrics (losing one’s job, for example) the less the players can follow their consciences (I*). As long as I* is greater than or equal to I, players will naturally do the right thing, and there is no distortion.

This equation may not be designed for actual calculations, but it yields a number of quick insights. To start with, all three of these criteria are necessary for a distortion. If any one is zero (absent), there is no distortion. The metrics must be imprecise and manipulable by the players, and the players must be under pressure to perform based on the metrics.

This insight leads to suggestions for how to design policies to avoid distortions. I* is critical – it is the tendency of most people to want to do their jobs well (especially professionals such as police officers, teachers, researchers, and doctors) even in the absence of strong financial or career incentives. Not all individuals will try to excel, and some weak level of I may help the system along, but it is critical that I not become so strong that it overwhelms I*. For example, researchers who are not funded will lose their jobs, and funding is scarce enough that this is a real fear. This fear makes I very large for scientific research. However, if we were able to guarantee a stable career to most researchers (based on maintaining just a minimum level of productivity), most of these researchers would continue to produce good research anyway, because of I*: they are researchers because they love what they do. If I were weaker, the imprecision and manipulability of the metrics wouldn’t matter much.

In addition to making sure that incentives are not too strong, we can try to reduce both E(m) and M. E(m) can be reduced by finding better metrics, and more of them. For example, in evaluating teacher performance, student test scores are probably not great on their own. However, integrating test scores, student and parent evaluations, and peer evaluations should be much more reliable than test scores alone. (If I’m not mistaken, research has shown that peer evaluations of teachers are remarkably reliable and consistent when ranks, rather than absolute scores, are used). And unlike student test scores, evaluations also have the advantage of being essentially unmanipulable, low M.

Great harm has come from trying to create incentive systems that in the end become highly distorted. The principles outlined here should be relatively straightforward to apply to avoid these problems (even if some systems can never be perfect). One of the big challenges and applications for this approach is the incentives for doctors to treat patients well and cheaply. In this case, I* is very large, but can nonetheless be overcome at least partially by financial incentives to provide more care (as is often the case in the US). Some metrics of doctor performance have been successfully implemented to improve patient outcomes, particularly in the UK, but there is substantial debate about whether these are truly improvements and whether these incentives stop doctors from using their clinical judgement appropriately on a case-by-case basis. The complexity of medical incentives would seem to me to indicate caution, and a greater reliance on I*…

Thoughts? Experiences? As always, share in the comments!