©Cartier Philanthropy/Andrea Borgarello
About a week ago, I was procrastinating on Twitter and came across a flurry of excitement about a new “benchmarking” study ostensibly showing that cash had more impact than an employment training program in Rwanda. This study came on the heels of another benchmarking study purporting to show that cash was better than a nutrition program, which was also in Rwanda.
I took a quick look at the two papers, noted that the employment program had failed to increase employment and the nutrition program had failed to improve nutrition and I tweeted in response that “cash has now outperformed two crap programs that didn’t work. So...don’t do crap programs that don’t work.” OK, that was pretty harsh, and I figured it was now incumbent upon me to do some homework, so I sat down and tried to read the two papers. I say “tried” because I only have 22 years of formal education so a lot of it went right past me. However, I got enough out of the process to say this comfortably:
Cash benchmarking is a solution in search of a problem. And cash didn’t really perform better. And one-time unconditional cash transfers probably shouldn’t be used as a benchmark anyway.
I should say right up front that these are really well-done studies. They were done in good faith, and they surface important issues. I totally trust the numbers, and they provide valuable fodder for reflection and discussion. I happen to disagree with their conclusions.
So.
The fundamental problem in the social sector is not the lack of benchmarks, it’s the pervasive lack of evidence for impact across a broad range of programs and interventions. To even contemplate comparing with cash, you have to have a reasonable estimate of cost-effectiveness, and to get that you of course have to determine both impact and cost. The problem is that this doesn’t happen remotely close to enough. Even if you’re a fan of benchmarking, you have to have something to benchmark against.
As a funder, I need to know what problem a program is trying to solve, what impact it had, and how much it cost to get that impact. It’s my job to decide whether I like that impact and cost or not, because value is a function of what someone is willing to pay. The problem, again, is that too often I can’t get that information. What I need is this:
- What the program is trying to accomplish, in simple, clear terms: “Get youth employed,” “reduce malnutrition in at-risk kids.”
- The basic metrics that will capture the degree to which that happened. Just a couple of the right things is far better than a confusing array of unranked measures.
- Good quality numbers that demonstrate a change (a big-enough sample size, good survey methods, right interval, all that stuff.)
- A persuasive counterfactual that reveals the true impact.
- Cost estimates that allow a credible calculation of the cost per unit impact.
As Dean Karlan talks about in his Goldilocks book, there’s a lot of different ways to get to good estimates of impact, but the problem is that way too few programs even get close. In any case, once I have a credible measure of the cost of impact - say, the cost per additional youth employed - I don’t want to compare it to cash, I want to compare it to other employment programs!
We don’t need benchmarking if we simply and consistently measure enough to judge these programs in terms of what the implementers said they were setting out to do. The title of the IPA report on the first study is “Benchmarking a Nutrition Program Against Cash Transfers in Rwanda.” The report is admirably clear on what problem the program set out to address:
“Rwanda has seen improvements in child nutrition in recent years, but...37 percent of children are anemic and 38 percent of children under 5 are stunted. Malnutrition rates are much higher in rural areas than in urban areas.”
This is a program to improve child nutrition and it should be judged as such. So did it? Nope. It had no effect on child growth, diet diversity, anemia, or consumption. In terms of the problem it set out to address, it was an utter failure and the real lesson should be “don’t do it again”. We don’t need to judge it against anything but what it explicitly set out to accomplish and in that sense it just plain failed.
As to benchmarking, when the researchers took the amount of money that the program cost - $142 - and just gave it to people, that had no effect on child nutrition either. Both failed! The program didn’t work, the cash didn’t work — what did we learn? There’s no useful “benchmarking” here. (The study did have an additional treatment group that got $500, which did result in some nutrition - and mortality - impact, but I’m not sure what comparing a $142 intervention with a $500 transfer is supposed to prove, and even then, there is plenty of reason to doubt the effect will last - see below.).
The situation with the employment study is similar. IPA’s report on the study is titled “Benchmarking Cash to an Employment Program in Rwanda.” The problem they set out to address was that “in Rwanda, about 35% of the youth population is neither employed, nor in training, nor in school.” The program intervention included training on 1) employment soft skills, 2) microenterprise start-up, and 3) microenterprise biz dev.
So how’d it do? Well, the program had no effect on employment, nor did it increase income. In terms of the problem it set out to address, it failed. Completely. We don’t need to benchmark it against anything but its own aspirations and against those, it was a bust. (These two programs really make me worry about the rest of USAID’s portfolio. Perhaps the best outcome of these papers is that they create momentum toward a broad review of the impact of USAID-funded projects and programs.)
Curiously, some have characterized the program as having “worked reasonably well,” because those who completed the program worked 3 more hours a week, had more assets, and saved more. OK, but assets and savings are derived from income, and it seems weird to call it success when people work a lot more (17%) with no increase in income. Maybe there’s something to build on here for a next iteration, but it’s behavior, not impact. The nutrition study was characterized as a partial success because the parents of no-less-malnourished kids had more savings. Yes, if you hurl a kitchen sink’s worth of unranked metrics into a study, you’re bound to find something positive.
Despite the failure of the employment training program, the researchers went ahead and compared it to a cash transfer roughly equivalent to the program’s cost (the cost was $330/participant and the cash transfer was $317). So what happened when they handed youth the equivalent of 160% of their annual income? Well, their income went up (yes, you gave them a bunch of money), consumption went up (yes, they spent the money), assets rose (yes, some of that money was spent on assets). What didn’t happen was a job, and if the cash had any effect on earned income, I couldn’t find it in the paper.
Don’t get me wrong, I’d rather get a bunch of cash than sit in a training program that didn’t get me a job. I’d consume, I’d buy some assets, and for a while, at least, I’d be way better off. But ultimately I’d much rather have a program that actually worked — I’d rather have a stable income (and I think that it’s bogus to refer to a short-term increase in income - i.e. the cash you just handed me - as “impact). The autonomy and choice inherent in cash is the same whether given or earned, and earned income suggests the capacity to generate more of it over time.
Moreover, there’s no real evidence that a one-time unconditional cash transfer (UCT) with no accompanying intervention (which describes the transfers in these two studies - I’ll call them “isolated UCTs”) creates lasting impact equal to the cost of the transfer. It’s not even close. In the employment paper, the authors say the evidence of durable impact from cash transfers is “mixed,” but of the five references provided, three are multiple transfers over a lengthy period (and two of those are conditional), and the other two aren’t cash - they’re cows and food stamps, respectively. They’re mostly irrelevant in this context.
The paper does reference the excellent Blattman et al 9-year study of cash transfers to youth groups in Uganda that did show a 38% increase in income (off a base of $400) at four years out (yay!). However, unlike the isolated UCTs in the benchmarking studies - giving random people a grant out of the blue - 1) these youth in this study had gone through group formation, rudimentary business planning, and a selection process based on a review of that planning, and 2) the impact had dissipated 9 years out (super-impressive that they studied it that long).
Ephemeral impact is often worse than nothing, and the goal of development programs should be impact that is sustained over time. At the very least, if we’re going to use a benchmark, it ought to be something that creates durable impact. And the only study I can find that investigates the long-term impact of the same kind of cash transfer done in the benchmarking studies is the three-year follow-up of GiveDirectly’s isolated UCTs in western Kenya. (Laura and I wrote a piece in SSIR after the short-term study arguing that what really mattered is whether the UCT had significant long-term impact.) They found that, on average, families were not much better off than control subjects, except for having more assets (and that’s $400 worth of assets in the wake of a $700 cash transfer: it doesn’t even break even in terms of durable impact.). The broad short-term “impact” didn’t last, a family’s trajectory wasn’t altered. (There was a pretty vigorous debate over this interpretation. I go with Berk Ozler’s take, but even if you’re more impressed than I am, it’s still pretty weird to base a whole benchmarking movement on one disputed study.) I wish I were more surprised that this disappointing long-term study created about 0.01% of the hoopla that surrounded the original short-term study.
So that’s it. These two benchmarking studies used a “benchmark” that has zero evidence of durable impact commensurate with the original investment. I don’t think we should be using isolated one-time unconditional cash transfers to benchmark anything. Perhaps we shouldn’t even do them until we get more evidence that they accomplish something lasting. It’s nice to give a man an isolated UCT - or a fish, for that matter - but if you’re a funder on the hunt for lasting change, cash benchmarking is not the tool you need.
Joaquin - wow, I love this, and it is nice to learn from someone who has grappled with USAID - and evaluations - much more than I have. This comments string is still a suboptimal way to discuss, but it’s better than twitter, so:
Your thoughts on how USAID might go about this in your paragraphs 2 and 3 are music to my ears.
Paragraph 4: I didn’t realize that some areas of USAID programming - health, as you mention - are much better than others. I do think that USAID is doomed in terms on real change until it begins to integrate the funding of good interventions and good organizations committed to taking them to scale (and that, of course, includes evaluations). USAID could do so much more to ensure than really good stuff scales to achieve its potential.
Your next point about cash - that it represents a one-time investment that mimics the one-time investment in a project - makes sense, but for the durable impact piece. I simply don’t think we should do anything in development - as opposed to humanitarian aid - unless we can make at least a theoretical case for lasting impact. Cash is a good comparator if you’re only looking at short-term impact, but I still don’t think there’s a persuasive case for lasting impact. The hope in a decent employment program is that you’ve armed some with the skills to continue getting employed and if so, cash can’t match it
I don’t think we disagree at all on the need to measure multiple things - it’s just that in the end, some matter way more than others. Ultimately, we - or at least One Acre fund, if they want to get better at what they do - need to know all the things that you list. However, their mission is to “Get farmer out of extreme poverty,” so the make-or-break metric is profit. Nothing else matters unless farmer income increases - not yield, not anything. We go to great lengths to determine the real mission of an intervention - the eight word mission - statement. We’d argue that the point of employment - job or business - is income, so that would be the make-or-break metric. I don’t think there is any justification for calling something a partial success if impact in terms of the key metric = zero.
I don’t think that cash does harm, and I think it is always better than nothing, especially if getting than nothing involves a person’s investment of time and effort into an ineffectual program. I’m curious, though, what it the programs had shown a little bit of impact? What then? In the case of nutrition, the cash had zero effect on nutrition, but way more overall short term benefit. Dow cash have to answer to the same mission - would this mean that the program “won?” Or do we somehow try to compare apples and oranges, which seems like a mess. the only benchmark plausible here is to compare the nutrition program to another one that did succeed. If you “if you don't have good evidence about existing programs” then your all-out priority should be to get some. In almost any sector, there is somebody somewhere doing a great job.
Finally, i so agree with your overall message and approach, and I’d add one thing - you mention five year projects and I gather that while there are midline measures, there’s not much ongoing iteration. I tell our fellows that you should never go into an RCT unless you already know the answer from your own well-designed systems. It really is inexcusable to take five years to find out that you didn’t even have short-term benefits. While I’ve seen a lot of comments for the evaluator community that organizations that self-measure are always wrong, we’ve seen that well-designed systems can get a valid sense of impact. Granted, it’s often a bit less than what is determine by an outside evaluator, but it’s rarely that meaningful a difference.
Probably more that you asked for, but fun to dive into - thanks.