Mulago Foundation | Cash benchmarking: A solution in search of a…

Cash benchmarking: A solution in search of a problem

The problem is a pervasive lack of impact evidence. Benchmarking is beside the point.

by Kevin Starr

Sep 29, 2020

Benchmarking

©Cartier Philanthropy/Andrea Borgarello


About a week ago, I was procrastinating on Twitter and came across a flurry of excitement about a new “benchmarking” study ostensibly showing that cash had more impact than an employment training program in Rwanda. This study came on the heels of another benchmarking study purporting to show that cash was better than a nutrition program, which was also in Rwanda.

I took a quick look at the two papers, noted that the employment program had failed to increase employment and the nutrition program had failed to improve nutrition and I tweeted in response that “cash has now outperformed two crap programs that didn’t work. So...don’t do crap programs that don’t work.” OK, that was pretty harsh, and I figured it was now incumbent upon me to do some homework, so I sat down and tried to read the two papers. I say “tried” because I only have 22 years of formal education so a lot of it went right past me. However, I got enough out of the process to say this comfortably:

Cash benchmarking is a solution in search of a problem. And cash didn’t really perform better. And one-time unconditional cash transfers probably shouldn’t be used as a benchmark anyway.

I should say right up front that these are really well-done studies. They were done in good faith, and they surface important issues. I totally trust the numbers, and they provide valuable fodder for reflection and discussion. I happen to disagree with their conclusions.

So.

The fundamental problem in the social sector is not the lack of benchmarks, it’s the pervasive lack of evidence for impact across a broad range of programs and interventions. To even contemplate comparing with cash, you have to have a reasonable estimate of cost-effectiveness, and to get that you of course have to determine both impact and cost. The problem is that this doesn’t happen remotely close to enough. Even if you’re a fan of benchmarking, you have to have something to benchmark against.

As a funder, I need to know what problem a program is trying to solve, what impact it had, and how much it cost to get that impact. It’s my job to decide whether I like that impact and cost or not, because value is a function of what someone is willing to pay. The problem, again, is that too often I can’t get that information. What I need is this:

  • What the program is trying to accomplish, in simple, clear terms: “Get youth employed,” “reduce malnutrition in at-risk kids.”
  • The basic metrics that will capture the degree to which that happened. Just a couple of the right things is far better than a confusing array of unranked measures.
  • Good quality numbers that demonstrate a change (a big-enough sample size, good survey methods, right interval, all that stuff.)
  • A persuasive counterfactual that reveals the true impact.
  • Cost estimates that allow a credible calculation of the cost per unit impact.

As Dean Karlan talks about in his Goldilocks book, there’s a lot of different ways to get to good estimates of impact, but the problem is that way too few programs even get close. In any case, once I have a credible measure of the cost of impact - say, the cost per additional youth employed - I don’t want to compare it to cash, I want to compare it to other employment programs!

We don’t need benchmarking if we simply and consistently measure enough to judge these programs in terms of what the implementers said they were setting out to do. The title of the IPA report on the first study is “Benchmarking a Nutrition Program Against Cash Transfers in Rwanda.” The report is admirably clear on what problem the program set out to address:

“Rwanda has seen improvements in child nutrition in recent years, but...37 percent of children are anemic and 38 percent of children under 5 are stunted. Malnutrition rates are much higher in rural areas than in urban areas.”

This is a program to improve child nutrition and it should be judged as such. So did it? Nope. It had no effect on child growth, diet diversity, anemia, or consumption. In terms of the problem it set out to address, it was an utter failure and the real lesson should be “don’t do it again”. We don’t need to judge it against anything but what it explicitly set out to accomplish and in that sense it just plain failed.

As to benchmarking, when the researchers took the amount of money that the program cost - $142 - and just gave it to people, that had no effect on child nutrition either. Both failed! The program didn’t work, the cash didn’t work — what did we learn? There’s no useful “benchmarking” here. (The study did have an additional treatment group that got $500, which did result in some nutrition - and mortality - impact, but I’m not sure what comparing a $142 intervention with a $500 transfer is supposed to prove, and even then, there is plenty of reason to doubt the effect will last - see below.).

The situation with the employment study is similar. IPA’s report on the study is titled “Benchmarking Cash to an Employment Program in Rwanda.” The problem they set out to address was that “in Rwanda, about 35% of the youth population is neither employed, nor in training, nor in school.” The program intervention included training on 1) employment soft skills, 2) microenterprise start-up, and 3) microenterprise biz dev.

So how’d it do? Well, the program had no effect on employment, nor did it increase income. In terms of the problem it set out to address, it failed. Completely. We don’t need to benchmark it against anything but its own aspirations and against those, it was a bust. (These two programs really make me worry about the rest of USAID’s portfolio. Perhaps the best outcome of these papers is that they create momentum toward a broad review of the impact of USAID-funded projects and programs.)

Curiously, some have characterized the program as having “worked reasonably well,” because those who completed the program worked 3 more hours a week, had more assets, and saved more. OK, but assets and savings are derived from income, and it seems weird to call it success when people work a lot more (17%) with no increase in income. Maybe there’s something to build on here for a next iteration, but it’s behavior, not impact. The nutrition study was characterized as a partial success because the parents of no-less-malnourished kids had more savings. Yes, if you hurl a kitchen sink’s worth of unranked metrics into a study, you’re bound to find something positive.

Despite the failure of the employment training program, the researchers went ahead and compared it to a cash transfer roughly equivalent to the program’s cost (the cost was $330/participant and the cash transfer was $317). So what happened when they handed youth the equivalent of 160% of their annual income? Well, their income went up (yes, you gave them a bunch of money), consumption went up (yes, they spent the money), assets rose (yes, some of that money was spent on assets). What didn’t happen was a job, and if the cash had any effect on earned income, I couldn’t find it in the paper.

Don’t get me wrong, I’d rather get a bunch of cash than sit in a training program that didn’t get me a job. I’d consume, I’d buy some assets, and for a while, at least, I’d be way better off. But ultimately I’d much rather have a program that actually worked — I’d rather have a stable income (and I think that it’s bogus to refer to a short-term increase in income - i.e. the cash you just handed me - as “impact). The autonomy and choice inherent in cash is the same whether given or earned, and earned income suggests the capacity to generate more of it over time.

Moreover, there’s no real evidence that a one-time unconditional cash transfer (UCT) with no accompanying intervention (which describes the transfers in these two studies - I’ll call them “isolated UCTs”) creates lasting impact equal to the cost of the transfer. It’s not even close. In the employment paper, the authors say the evidence of durable impact from cash transfers is “mixed,” but of the five references provided, three are multiple transfers over a lengthy period (and two of those are conditional), and the other two aren’t cash - they’re cows and food stamps, respectively. They’re mostly irrelevant in this context.

The paper does reference the excellent Blattman et al 9-year study of cash transfers to youth groups in Uganda that did show a 38% increase in income (off a base of $400) at four years out (yay!). However, unlike the isolated UCTs in the benchmarking studies - giving random people a grant out of the blue - 1) these youth in this study had gone through group formation, rudimentary business planning, and a selection process based on a review of that planning, and 2) the impact had dissipated 9 years out (super-impressive that they studied it that long).

Ephemeral impact is often worse than nothing, and the goal of development programs should be impact that is sustained over time. At the very least, if we’re going to use a benchmark, it ought to be something that creates durable impact. And the only study I can find that investigates the long-term impact of the same kind of cash transfer done in the benchmarking studies is the three-year follow-up of GiveDirectly’s isolated UCTs in western Kenya. (Laura and I wrote a piece in SSIR after the short-term study arguing that what really mattered is whether the UCT had significant long-term impact.) They found that, on average, families were not much better off than control subjects, except for having more assets (and that’s $400 worth of assets in the wake of a $700 cash transfer: it doesn’t even break even in terms of durable impact.). The broad short-term “impact” didn’t last, a family’s trajectory wasn’t altered. (There was a pretty vigorous debate over this interpretation. I go with Berk Ozler’s take, but even if you’re more impressed than I am, it’s still pretty weird to base a whole benchmarking movement on one disputed study.) I wish I were more surprised that this disappointing long-term study created about 0.01% of the hoopla that surrounded the original short-term study.

So that’s it. These two benchmarking studies used a “benchmark” that has zero evidence of durable impact commensurate with the original investment. I don’t think we should be using isolated one-time unconditional cash transfers to benchmark anything. Perhaps we shouldn’t even do them until we get more evidence that they accomplish something lasting. It’s nice to give a man an isolated UCT - or a fish, for that matter - but if you’re a funder on the hunt for lasting change, cash benchmarking is not the tool you need.

Comments

Kevin Starr

Joaquin - wow, I love this, and it is nice to learn from someone who has grappled with USAID - and evaluations - much more than I have. This comments string is still a suboptimal way to discuss, but it’s better than twitter, so:

Your thoughts on how USAID might go about this in your paragraphs 2 and 3 are music to my ears.

Paragraph 4: I didn’t realize that some areas of USAID programming - health, as you mention - are much better than others. I do think that USAID is doomed in terms on real change until it begins to integrate the funding of good interventions and good organizations committed to taking them to scale (and that, of course, includes evaluations). USAID could do so much more to ensure than really good stuff scales to achieve its potential.

Your next point about cash - that it represents a one-time investment that mimics the one-time investment in a project - makes sense, but for the durable impact piece. I simply don’t think we should do anything in development - as opposed to humanitarian aid - unless we can make at least a theoretical case for lasting impact. Cash is a good comparator if you’re only looking at short-term impact, but I still don’t think there’s a persuasive case for lasting impact. The hope in a decent employment program is that you’ve armed some with the skills to continue getting employed and if so, cash can’t match it

I don’t think we disagree at all on the need to measure multiple things - it’s just that in the end, some matter way more than others. Ultimately, we - or at least One Acre fund, if they want to get better at what they do - need to know all the things that you list. However, their mission is to “Get farmer out of extreme poverty,” so the make-or-break metric is profit. Nothing else matters unless farmer income increases - not yield, not anything. We go to great lengths to determine the real mission of an intervention - the eight word mission - statement. We’d argue that the point of employment - job or business - is income, so that would be the make-or-break metric. I don’t think there is any justification for calling something a partial success if impact in terms of the key metric = zero.

I don’t think that cash does harm, and I think it is always better than nothing, especially if getting than nothing involves a person’s investment of time and effort into an ineffectual program. I’m curious, though, what it the programs had shown a little bit of impact? What then? In the case of nutrition, the cash had zero effect on nutrition, but way more overall short term benefit. Dow cash have to answer to the same mission - would this mean that the program “won?” Or do we somehow try to compare apples and oranges, which seems like a mess. the only benchmark plausible here is to compare the nutrition program to another one that did succeed. If you “if you don't have good evidence about existing programs” then your all-out priority should be to get some. In almost any sector, there is somebody somewhere doing a great job.

Finally, i so agree with your overall message and approach, and I’d add one thing - you mention five year projects and I gather that while there are midline measures, there’s not much ongoing iteration. I tell our fellows that you should never go into an RCT unless you already know the answer from your own well-designed systems. It really is inexcusable to take five years to find out that you didn’t even have short-term benefits. While I’ve seen a lot of comments for the evaluator community that organizations that self-measure are always wrong, we’ve seen that well-designed systems can get a valid sense of impact. Granted, it’s often a bit less than what is determine by an outside evaluator, but it’s rarely that meaningful a difference.

Probably more that you asked for, but fun to dive into - thanks.

Joaquin CarbonellKevin Starr

This is great Kevin! I really appreciate your deep engagement and thoughtful responses. I have already commanded a lot of your time (and I know you're a busy person!) but suffice it to say we basically agree about how you'd judge the success of an intervention on it's own merits. So I'll just share a couple more reflections about the institutional context at USAID in response to your comments.

I couldn't agree more for the need for piloting, regular testing, and iteration before evaluation. If I could criticize myself in my time at DIV, I would say I pushed some of our grantees towards RCTs before they'd really figured out their operational model. Another problem at USAID, however, is that programs are trying to spend money fast and reach as many 'beneficiaries' as possible before they've really figured the most effective/efficient approach to delivering their intervention.

But the main problem, and the goal of this work, is not to promote cash transfers but to push for more rigorous evaluation within the agency. The reason why relates to your reservation about cash as a comparator to typical programs. You make the very sensible point that USAID shouldn't do things that don't at least theoretically improve long-term outcomes for people in developing countries. The problem at USAID is that most program officers mistake the THEORETICAL case for impact for ACTUAL impact. This leads to a pervasive attitude of "why evaluate something we know theoretically works?". I know it sounds crazy, but I can't tell you how many times I was told exactly that by people in USAID missions around the world when we were trying to set benchmarking studies up (I pitched about 20-25 missions).

Even once the results of the nutrition study came in, there was a kind of magical thinking on the part of many USAID staff that somehow the benefits that failed to materialize in the short-term would be improved in the long term. But if your income, health, dietary diversity, or child anthropometrics don't improve within 10 months of the completion of a program that will not provide any services to you in the future, I struggle to see how those outcomes will improve in the long-term, especially given what we know about the importance of the first 1000 days of a child's life (the time period targeted by the intervention). Indeed, if rigorous evidence of long-term impacts is rare in international development, programs that have no short-term impact but somehow achieve long-term impacts would be even rarer!

So we really wanted to fight the 'received wisdom' at the agency that USAID programs are, as a general proposition, effective. There is little to no evidence to suggest they are because USAID does not rigorously evaluate its own programs. And the evidence that USAID does generate is often suspect. Akazi Kanoze, the predecessor program to Huguka Dukore (the employment program from the second study) was "rigorously evaluated". EDC did an internal RCT suggesting that the program increased employment relative to non-program-recipients - an RCT run by the organization themselves. Then the "performance evaluation" that was essentially contemporaneous with the cash benchmarking study suggested that the program was improving employment outcomes.

But the independent and more rigorous evaluation of the cash-benchmarking trial showed it to be ineffective in improving employment or income! USAID does LOTS of 'internal' and 'performance' evaluation, but almost NO independent evaluation. These studies hopefully show why this kind of evaluation is necessary, cash-counterfactual or not!

I share all of this to paint a picture of the just how much the culture and existing practice of generating/using evidence at the agency is stacked against a) rigorous and independent evaluation, and b) using evidence to CHANGE programming.

Thanks again for engaging on this. I hope this reflections are helpful!

Joaquin Carbonell

Terrific piece Kevin! Compiling some of my twitter comments into a hopefully more coherent form here. I'd love to hear your thoughts in response to some of the points I raise, which are really more about the institutional context of USAID, but overall I think you present some extremely valid criticisms of cash benchmarking.

Starting with a big area of agreement - I think you outline what I WISH the agency would do when it comes to cost-effectiveness analysis:
1. In each programming domain, evaluate against best-in-class program - ones that increase impact the most in terms of our outcomes of interest, relative to their costs
2. Focus funding on best-in-class interventions, but continue testing new approaches against them
If you think of USAID's programming domains (health, agriculture, education, economic growth etc.) as being somewhat fixed by congressional appropriations, there is still a BIG decision about what to do with funds within each domain. The political/organizational calculus of determining what to fund is certainly complex, but I think we agree that in a perfect world, you'd ask "what is the best known way to improve employment and earnings for youth?" And then have that represent a large fraction of your workforce domain programming. Then you'd test new/alternative approaches against THAT.

IF the Agency did this, then there would be no need for cash benchmarking. But the reality of evaluation at USAID is that the Agency generates very little evidence about it's own programs (about 13 impact evaluations per year on $20 billion of programming), most of the impact evaluations are of poor quality, and these evaluations mostly indicate that the USAID programs have no impact on our main outcomes of interest. You're absolutely RIGHT to be concerned about programming quality writ-large at USAID. I would however note that USAID is not a monolith. Global Health does a ton of excellent, evidence-based programming. But there's a lot of riff-raff in nutrition, agriculture and workforce programming - the domains of the first two benchmarking studies.

So is cash-benchmarking the silver-bullet solution to USAID's evidence woes? Of course not, but speaking as one of the cogs in the USAID machine that helped set up these two studies, I outline 3 categories of explanations for why I think the approach is useful: Practical, Theoretical, and Political

Practical motivation:
If you're running two programs head-to-head in a randomized controlled trial, the coordination challenges between the programs are immense. Cash is more operationally practical to run alongside a program with vast management and operational infrastructure like the programs from the first two studies (and it was still really hard)
If most programs have NO or UKNOWN impact, it actually makes sense to compare them to something we know has SOME impact - even if that impact is not large or sustainable as in the case of cash - until you've got a better evidence base to do as you suggest (best-in-class program vs. program X). Cash is one of the MOST evaluated interventions and has been evaluated in a variety of domains. (https://www.odi.org/publications/10505-cash-transfers-what-does-evidence-say-rigorous-review-impacts-and-role-design-and-implementation).

Theoretical motivation:
A first point about cash as a theoretical comparison intervention - as you point out, there are relatively few long-term studies large, lump-sum, stand-alone cash transfers. So maybe we have less confidence in that type of cash intervention. But it actually mimics the type of investment that USAID makes with many (certainly not all or even most) of it's programs - we spin up a massive organization infrastructure, we use it to deliver a one-time intervention that we hope will sustainably improve people's lives, then we move on to a different set of people or different program. So comparing a one-time intervention to cash makes some sense (again, not for all programs).

A second point on which we may disagree - it may make sense to compare programs on more than one outcome. I agree that a list of 30 unprioritized outcomes is not useful at all, but consider the employment program: why do we try to improve employment? To improve income. Taking an example from Mulago's early portfolio and about which you wrote a great SSIR piece: why does One Acre Fund try to improve yields for smallholder farmers? To improve income NET of costs. So 1AF measures more than just yields, they measure input costs+financing, time-cost of training, yields, farmer income, and ultimately farmer profitability. So we would want connected indicators of well-being to improve lest we, in the case of the employment program, push everyone into crappy jobs and call it a day.

So even if cash is a low bar in many cases, it at least provides short-term benefit across a broad range of indicators of well-being. Recall that most rigorously evaluated USAID programs produce no such short-term improvement, let alone SUSTAINABLE improvements which are basically never measured. I echo Michael's questions - what is the appropriate benchmark? It's a tough question, particularly if you don't have good evidence about existing programs.

Political Motivation:
The politics of evaluation at USAID are such that studies are mostly ignored, particularly when they do not show programs in a favorable light. We knew that the "buzz" around cash (for better or worse) would draw attention to these studies and make it hard to ignore. So the (my) hope was that these studies would draw attention to hard questions that the agency has been unwilling to confront:
- Do our programs improve people's lives?
- At what cost are people's lives improved?
- Can we do better than what we're currently doing?

The point was certainly not for cash to "win". The point was to say if you can't do better than just giving people money, think about WHY that is and DO something about it. Fund proven solutions, test promising models, but above all else, learn & IMPROVE, then PROVE youre improving. The business of $20m programs that don't do shit then doing another 5 years of the same program with a few new bells & whistles.. that's gotta change. It is hard to describe such a large organization in broad-brush strokes, but this is a sadly common occurrence at the agency even if reasonable people disagree on the extent of its prevalence.

Theres nothing wrong with failure, this is HARD work. But failing to learn is unacceptable, and thats what the agency's been doing in so many cases. So maybe cash benchmarking is not the right approach to improving programming quality in the long term. But it (hopefully) increased the temperature around the status quo of evidence and cost-effectiveness of USAID programs, which are in desperate need of improvement.

Thanks for bringing attention to these studies and raising important questions about the cash benchmarking approach. My hope in sharing these observations is to think about not just how cash benchmarking can be improved if it will continue (or why it should be scrapped), but about how the agency USES EVIDENCE IN GENERAL to inform decision-making.

Kevin Starr

Hi Michael,

What I’d propose is a paradigm shift. From where I sit USAID is stuck in a project mindset. They fund spot projects all over the place, continually hopping around, never really scaling anything toward its full potential. To get to scale, an idea must go through R&D, early replication, rigorous proof of impact, and disciplined scale-up. that takes an organization dedicated to doing so. USAID should fund those organizations, including the rigorous evaluations.

USAID shouldn’t be designing and dictating anything. Like the rest of us, they should be making a good attempt to find and fund stuff that works. There is a whole world of innovation out there for them to take advantage of.

And is no cross-sector benchmark to be had. It’s been the Holy Grail for a long time now, and this is the latest attempt. You have to measure an organization’s in terms of the problem they’ve set out to address - their mission - and judge value in terms of evidence of impact and its cost. And that value emerges from the experience of funding clusters of similar things: CHWs, smallholder farming interventions, childhood nutrition, etc. What we’ve found is that in doing so, you begin to see what a bargain looks like. and even then you have to take things into context. Addressing malnutrition in rural Afghanistan is going to have a different cost than doing do in Kampala. Impact-obsessed founders have to continually use their judgement to move toward continually better emerging solutions.

Michael Eddy

Hi Kevin,

Really enjoyed reading this! Thanks for taking the time to dig into the weeds here--this stuff is important! As noted on twitter, I'm just trying to understand what you would propose USAID do differently.

I could see several different options:
1) Continue evaluating programs as it already has, with a bar being "does the evaluation find a statistically significant impact"?
2) Use another bar/benchmark that better reflects the opportunity cost of USAIDs money.
3) Some other option

Latest Posts
Categories
Share