AI Engineer
December 11, 2025

Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study) – Yegor Denisov-Blanch, Stanford

Companies are pouring millions into AI for software engineering, but is it paying off? Stanford's Yegor Denisov-Blanch dives into two years of research across 120,000 developers, revealing that while AI promises massive productivity gains, the reality is often more complex, nuanced, and sometimes, even negative. This isn't about if AI works, but how you make it work.

Identify the "One Big Thing":

  • The "One Big Thing" is that while AI can significantly boost software engineering productivity, its actual ROI is often misunderstood, mismeasured, or even negative due to poor implementation, lack of codebase hygiene, and reliance on superficial metrics. True gains require a sophisticated approach to measurement and active management of AI's impact on code quality and engineer behavior.

Extract Themes:

The AI Productivity Paradox: Potential vs. Reality:

  • Quote 1: "The discrepancy between the top performers and the bottom ones is increasing. There's a widening gap... if you're a leader in a company, you definitely need to know in which cohort you are right now so that you can course correct and without measuring the impact of AI on your engineers, you're not going to be able to do this."
  • Quote 2: "AI usage quality matters more than AI usage value."

Codebase Hygiene as the AI Multiplier:

  • Quote 1: "Invest in codebase hygiene to unlock these AI productivity gains."
  • Quote 2: "Clean code amplifies AI gains. Secondly is that you need to manage your codebase entropy... because if you just use AI unchecked, this is going to accelerate this entropy."

Beyond Vanity Metrics: Measuring True AI ROI:

  • Quote 1: "Ideally we would be measuring this based on business outcomes... The problem is that there's too much noise between the treatment... and the result... Therefore, although that would be ideal, unfortunately, I think we need to find alternative paths and the most logical one is to simply look at the engineering outcomes because there is a clear signal."
  • Quote 2: "PRs went up by 14%. But this is inconclusive because more PRs doesn't mean better. We saw that code quality decreased by 9% which is problematic. We saw that effective output didn't increase meaningfully. And then we saw that rework increased by a lot. And so then the question here is what is the ROI of this AI adoption? It might be negative."

Synthesize Insights:

Theme 1: The AI Productivity Paradox: Potential vs. Reality

  • Widening Gap: Early AI adopters are seeing compounding gains, while others fall further behind, creating a "rich get richer" effect. This isn't about access, but how AI is used.
  • Quality Over Quantity: Simply spending more tokens or using AI more frequently doesn't correlate with higher productivity. In fact, there's a "death valley" effect around 10 million tokens/engineer/month where productivity dips. (Analogy: Just like buying more expensive gym equipment doesn't guarantee fitness; it's about how effectively you use it.)
  • Environmental Impact: The quality of the engineering environment (code cleanliness, modularity, documentation, tests) has a strong correlation (R-squared 0.40) with AI productivity gains.
  • Engineer Discretion: Engineers need to know when and when not to use AI. Blind application leads to rejected outputs, heavy rewriting, and ultimately, a loss of trust in AI tools, collapsing potential gains.
  • AI Engineering Practices Benchmark: Stanford developed a framework (Level 0: no AI to Level 4: agentic orchestration) to quantify how teams are using AI, revealing that even with equal access, adoption and usage patterns vary wildly across business units.

Theme 2: Codebase Hygiene as the AI Multiplier

  • Cleanliness is King: A clean, well-structured codebase (high modularity, good tests, documentation) acts as an amplifier for AI tools, making them more effective.
  • Entropy Management: Unchecked AI usage can accelerate "codebase entropy" (tech debt), degrading code cleanliness and negating AI benefits. (Analogy: AI is like a powerful, fast-acting cleaner. If you spray it everywhere without wiping, you just spread the mess faster.)
  • Human-AI Feedback Loop: Humans must actively push back against entropy by maintaining code quality to sustain AI's benefits. It's a continuous effort, not a one-time fix.
  • Task Suitability: AI is better suited for certain tasks (green zone) than others (red zone). Understanding this helps engineers apply AI strategically.

Theme 3: Beyond Vanity Metrics: Measuring True AI ROI

  • Focus on Engineering Outcomes: While business outcomes are the ultimate goal, too much noise (sales, macro, product strategy) makes direct AI-to-business ROI measurement difficult. Engineering outcomes offer a clearer signal.
  • Beware of Goodhart's Law: Relying on easily gamed metrics like Lines of Code (LOC) or Pull Request (PR) counts can lead to perverse incentives. (Analogy: Measuring a chef's productivity by how many ingredients they order, rather than the quality of the meals they produce.)
  • Proposed Framework: Use a primary metric (engineering output, measured by an ML model replicating expert evaluation) and guardrail metrics (rework, quality, tech debt, people/DevOps) to ensure healthy development. Guardrail metrics should be maintained, not maximized.
  • Case Study Shock: A company saw a 14% increase in PRs after AI adoption, but deeper analysis revealed a 9% decrease in code quality, a 2.5x increase in rework, and no meaningful change in effective output. Their perceived ROI was positive, but the actual ROI was likely negative.
  • Retroactive Measurement: It's possible to measure AI impact retroactively using Git history, so companies don't need to set up new experiments and wait.

Filter for Action:

  • For Investors:
    • Warning: Be wary of companies touting AI productivity gains based solely on superficial metrics like PR counts or token usage. Dig deeper into their measurement methodologies.
    • Opportunity: Invest in companies that prioritize codebase hygiene, have sophisticated AI adoption strategies (beyond just providing tools), and use robust engineering outcome metrics. Companies offering tools for measuring AI's impact on code quality and engineer behavior are also interesting.
  • For Builders:
    • Action: Prioritize codebase cleanliness before or concurrently with AI adoption. Treat tech debt as a blocker for AI ROI.
    • Action: Develop internal guidelines and training for engineers on when and how to effectively use AI, fostering trust and preventing misuse.
    • Action: Implement sophisticated measurement frameworks that go beyond simple usage metrics. Focus on engineering output (quality, effective work, rework) rather than just volume. Explore tools like Stanford's open-source AI engineering practices benchmark.
    • Warning: Don't blindly chase AI usage. Focus on quality of usage and the environment in which AI is deployed.
    • Opportunity: Build tools that help companies measure AI's impact on code quality, identify codebase entropy, and guide engineers on optimal AI usage patterns.

New Podcast Alert: Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study) – Yegor Denisov-Blanch, Stanford

Podcast Link: https://www.youtube.com/watch?v=JvosMkuNxF8

Stanford's groundbreaking study of 120,000 developers shatters the myth of guaranteed AI productivity gains in software engineering, revealing that most companies mismeasure ROI and risk negative returns.

The ROI Measurement Conundrum & Stanford's Novel Approach

  • Companies invest millions in AI software engineering tools, yet lack robust methods to quantify their impact. Stanford's research directly addresses this ROI gap, moving beyond anecdotal evidence.
  • The study employs a time-series and cross-sectional analysis across companies, leveraging historical Git data to track changes over time and compare across organizations.
  • A proprietary machine learning model replicates a panel of 10-15 human experts, objectively evaluating code commits on implementation time, maintainability, and complexity.
  • This model, trained on millions of expert evaluations, allows for scalable, objective assessment of code quality and developer output, providing a reliable baseline.
  • The research aims to identify true productivity drivers, benchmark AI practices, and propose a concrete ROI measurement framework for the industry.

"We took the labels of these panels across millions of evaluations and then trained a model to replicate this panel of experts, meaning that we can deploy this at scale." – Yegor Denisov-Blanch

Unpacking AI Productivity Drivers: The Widening Gap

  • Initial findings from 46 AI-using teams, matched with 46 non-AI teams, reveal a median 10% productivity gain, but a stark divergence between top and bottom performers. This disparity highlights critical factors beyond mere tool adoption.
  • The gap between high-performing and struggling AI adopters is widening, creating a "rich gets richer" effect where early success compounds, leaving laggards further behind.
  • AI usage (token spend) shows a loose correlation (0.20) with productivity, even exhibiting a "death valley" effect around 10 million tokens per engineer per month, suggesting quality of use outweighs sheer quantity.
  • Codebase cleanliness (a composite index of tests, types, documentation, modularity, and code quality) correlates strongly (0.40 R-squared) with AI productivity gains, proving a critical prerequisite.
  • Unchecked AI usage accelerates codebase entropy (the natural tendency for a software system's structure and quality to degrade over time without active management), degrading cleanliness and negating benefits; human intervention is crucial to maintain hygiene and maximize AI's utility.

"If you're a leader in a company, you definitely need to know in which cohort you are right now so that you can course correct." – Yegor Denisov-Blanch

Benchmarking AI Engineering Practices & Adoption Patterns

  • Beyond mere usage, how engineers integrate AI dictates its effectiveness. Stanford introduces an "AI Engineering Practices Benchmark" to quantify these patterns and identify best practices.
  • This evolving, open-source tool scans codebases for "AI fingerprints"—traces of how teams utilize AI, quantified by the percentage of active engineering work using specific patterns.
  • The benchmark defines levels of AI integration: Level 0 (no AI), Level 1 (personal, unshared prompts), Level 2 (team-shared prompts/rules), Level 3 (AI autonomously completes specific tasks), and Level 4 (agentic orchestration, where AI systems autonomously manage and coordinate multiple tasks or agents to achieve a complex goal).
  • A case study revealed two business units within the same company, with identical AI access and tools, showed vastly different adoption rates and usage patterns.
  • This disparity underscores that access to AI does not guarantee uniform or effective integration, emphasizing the need for leaders to understand how their engineers are using AI.

"Access to AI and even AI usage doesn't mean or doesn't guarantee that that AI is going to be used in the same way across a company." – Yegor Denisov-Blanch

The Definitive Framework for AI ROI in Software Engineering

  • Measuring AI ROI through direct business outcomes proves too noisy due to confounding variables (sales execution, macro environment, product strategy). The focus must shift to measurable engineering outcomes.
  • Stanford proposes a framework with a Primary Metric and Guardrail Metrics to accurately assess AI's impact on engineering.
  • The Primary Metric is "engineering output," derived from the expert-replicated ML model, not traditional proxies like lines of code (LOC), pull request (PR) counts, or DORA metrics (DevOps Research and Assessment metrics).
  • Guardrail Metrics, which should be maintained at healthy levels but not maximized, include: rework and refactoring, quality/tech debt/risk, and people/DevOps.
  • Goodhart's Law (an observed statistical regularity will collapse once pressure is placed upon it for control purposes) cautions against weaponizing metrics; a balanced set and strong company culture are vital to prevent perverse incentives.
  • Companies can measure AI impact retroactively using Git history, eliminating the need for lengthy prospective experiments and enabling immediate insights into past AI adoption.

"Metrics don't need to be flawless to be useful." – Yegor Denisov-Blanch

Case Study: The Peril of Misleading Metrics

  • A real-world case study demonstrates how relying on superficial metrics can lead to a false sense of AI productivity and potentially negative ROI.
  • A large enterprise team saw a 14% increase in pull requests (PRs) after adopting AI, which, if measured in isolation, would suggest significant productivity gains.
  • However, deeper analysis using Stanford's methodology revealed a 9% decrease in code quality and a 2.5x increase in rework (changing recently written code).
  • Crucially, "effective output" (a proxy for true productivity) did not meaningfully increase, indicating that the higher PR count was not translating into valuable work.
  • This data suggests a potentially negative ROI, despite the apparent PR increase, highlighting the danger of misinterpreting simple metrics.

"Had this company not measured this more thoroughly and simply measured PR counts, they would have thought, hey, we're doing great." – Yegor Denisov-Blanch

Investor & Researcher Alpha

  • Capital Reallocation: Investors should shift capital away from "AI tools for tools' sake" towards solutions that deeply integrate with existing codebases and provide granular, actionable ROI metrics. Prioritize investments in codebase hygiene tools and platforms enabling Level 3/4 AI integration.
  • Emerging Bottleneck: The new bottleneck for AI adoption in software engineering is not solely GPU compute, but critically, codebase quality and the ability to accurately measure AI's impact on engineering outcomes. Companies failing to address technical debt will see AI amplify existing problems.
  • Obsolete Research Directions: Research relying purely on quantitative metrics like Lines of Code (LOC) or Pull Request (PR) counts to assess AI productivity is now obsolete. Future research must incorporate qualitative assessments of code quality, rework, maintainability, and the nuanced "how" of AI integration to yield meaningful insights.

Strategic Conclusion

AI in software engineering offers immense potential, but only with rigorous, outcome-based measurement and an unwavering focus on codebase health. Superficial metrics obscure true impact, risking negative ROI and misdirected investments. The industry must adopt sophisticated, data-driven frameworks to unlock AI's real value and avoid costly missteps, ensuring AI truly augments human engineering.

Others You May Like