AI Coding Agents: What the Data Actually Shows
GitHub Copilot writes 46% of code at Microsoft. Google claims 2.5x productivity gains. Commonwealth Bank won't go back. Here's what's real, what's hype, and what actually works.
Over the past year, I've been building Tech Upkeep and several client projects with Claude Code and Cursor in the loop almost every day. The loudest takes about "AI agents" tend to argue about whether they will replace developers. The more useful question for most teams is narrower: where do these tools help right now, and where do they still waste time?
This piece started as a response to Veith Röthlingshöfer's article "why agents DO NOT write most of our code - a reality check". His team ran a week-long experiment trying to ship a feature almost entirely with agents and concluded that the current tools do not move the needle much. I think that conclusion mostly reflects how they used the tools - as relatively unconstrained agents without strong hooks, rather than as part of a workflow that bakes in tests, linting, and code review.
We now have better data than hot takes. GitHub's controlled experiment with its Office of the Chief Economist found developers completed a set of coding tasks 55% faster with Copilot (1h 11m vs. 2h 41m), with a statistically significant effect size (P = .0017). Vladimir Siedykh's comparison of Claude Code, Copilot, and Cursor walks through this research in more detail.
An independent 2025 review of AI coding assistants by AI for Code comes to a similar conclusion from a different angle. They rank Cursor and GitHub Copilot as joint leaders at 95/100, with Claude Code at 90/100 and JetBrains AI Assistant just behind. The scoring weights code generation quality, developer experience, advanced features, reliability, and value, and highlights Cursor's strength in multi-file refactors and Copilot's strength in IDE integration in their "Best AI Coding Assistants 2025" roundup.
Anthropic looked at hundreds of thousands of coding interactions with Claude and found that 79% of conversations involved automation tasks where the AI actually executed work (editing files, running commands, making commits), not just suggesting snippets. That finding comes from their "Impact of AI on software development" analysisand matches what I see day to day: the biggest gains come when you let the tool handle repetitive changes and checks, not when you ask it to invent a whole system from scratch.
What the Data Actually Tells Us
If you zoom out across GitHub's research, case studies and independent analyses, a few themes show up repeatedly:
GitHub Copilot in Numbers
- •Speed: controlled experiments show ~55% faster task completion with Copilot compared to a control group, as reported in Vladimir Siedykh's summary of the GitHub / Microsoft study.
- •Adoption: Copilot now has tens of millions of users and is used across a large majority of Fortune 100 engineering teams, according to multiple industry summaries of GitHub's own data. The exact figure changes as GitHub publishes new Octoverse and Copilot updates, but the direction of travel is clear.
- •Risk: at least one independent analysis of Copilot output found that a significant share of generated Python samples contained potential security issues. That is a strong argument for keeping human review, static analysis, and security scanning in the loop rather than treating AI output as trusted by default.
Cursor and Claude Code in Numbers
- •Cursor latency: autocomplete suggestions around 320 ms, compared to roughly 890 ms reported for Copilot in one benchmark, which matters if you care about staying in flow. These figures come from a Copilot vs Cursor comparison by Quick AI Directory.
- •Prediction quality: Cursor claims about 25% accuracy when predicting your next edits, which lines up with Vladimir Siedykh's analysis of how well it anticipates repeated patterns in web stacks, as covered in his comparison of Claude Code, Copilot, and Cursor.
- •Claude Code limits: Anthropic has already had to introduce weekly limits on certain Claude Code features because a small fraction of users were running them effectively nonstop. That change was described in Tom's Guide's coverage of Claude Code limitsand is a good proxy for how heavily some teams lean on these tools for automation.
Put simply: in the aggregate, Copilot and similar tools make most developers faster and less frustrated, but they also happily generate flawed or insecure code if you let them. The numbers are good enough that ignoring the tools is becoming a strategic decision, not just a personal preference.
You can see the same pattern in individual company reports. An AI coding assistant rollout at JPMorgan's software group, for example, was credited with roughly 10% to 20% efficiency gains for engineers in one public write-up from Reuters on JPMorgan's coding assistant rollout. The exact numbers will differ on your team, but it is getting harder to argue that there is no signal underneath the hype.
Where AI Coding Tools Actually Help
When you look at how teams use Claude Code, Copilot and Cursor in practice (and not just in marketing demos), the wins are surprisingly consistent.
1. Test Generation That Matches Your Style
The most reliable use case I see is test generation. If you hand-write 2–3 representative tests that show your patterns and expectations, tools like Copilot or Claude Code are very good at filling in the next 10–20. Roughly 70–80% of what they generate is acceptable with light edits. You still make the decisions about coverage and edge cases; the AI just saves you from typing the obvious variations.
Not everyone is enthusiastic about letting AI write tests. Swizec Teller argues that using AI to generate tests can backfire when it produces shallow checks that do not encode real intent or edge cases, and that it can remove the useful "head fake" of developers thinking carefully about behaviour before coding in his essay "Why you shouldn't use AI to write your tests". He suggests using AI instead to fuzz inputs, translate acceptance criteria into higher-level test scaffolds, or write code that satisfies tests you wrote yourself.
I think these views are compatible. In practice, the setup that works is to keep humans responsible for deciding what needs to be tested and why, and to use AI only to expand obvious variations or mechanical cases once that intent is clear.
2. Performance and Correctness Reviews
Having a second pair of eyes on pull requests used to mean finding another engineer with time. Now you can point Claude or Cursor at a diff and ask for specific checks: N+1 queries, unsafe concurrency, missing error handling, confusing public APIs. They will miss things, but they also reliably catch issues that slip past tired humans. I've had Claude flag obvious N+1 queries in review that would have turned into slow endpoints in production.
3. Eliminating "Mechanical" Work
Boilerplate is where AI shines. Updating DTOs and mappers after a schema change, wiring up dependency injection, adding logging in a consistent format, or porting the same pattern across 20 files are tedious, error-prone tasks for humans. Claude Code and Cursor, in particular, are very good at "apply this pattern everywhere in this folder" changes once you show them one good example.
Even strong advocates of AI-assisted programming point out that the benefit depends heavily on how much boilerplate you write, and Karpathy's Nanochat project is a good example. He has talked about using models as an "AI pair programmer" for autocomplete and for rewriting repetitive pieces, but in this case Nanochat was "basically entirely hand-written" and attempts to use Claude or Codex agents were net unhelpful because the repository was too far off the models' training distribution as he explained in Futurism's write-up on Nanochat and vibe coding. That is not an argument against AI assistance; it is a reminder that when your system is unique and low on boilerplate, most of the value still comes from small accelerations and you continue to need a lot of hand-written code and careful design.
4. React and Frontend Cleanups
React codebases accumulate re-render problems, prop drilling, and awkward state management. Tools like Copilot and Cursor are weirdly good at spotting avoidable re-renders, missing useMemo/useCallback usage, and awkward dependency arrays once you prompt them with the right question. They are not a replacement for understanding React's rendering model, but they act like a linters-plus-advisor hybrid that can suggest concrete refactors. If you blindly accept every suggestion, you can still end up with components full of useEffect calls and scattered local state that need a human to simplify and pull back into a sane architecture.
5. Documentation and Developer Notes
AI-generated documentation is rarely publish-ready, but it is excellent as a first draft. Asking Claude or Copilot to write a docstring, ADR outline, or README section from existing code and comments gets you from a blank page to "good enough to edit" in seconds. That small drop in activation energy is the difference between "we should document this" and actually checking in docs.
Where These Tools Still Struggle
The failure modes are also consistent. If you know them upfront, you avoid most of the pain.
Large, Unconstrained Feature Requests
Asking an AI assistant to "build a full user management system" across a real codebase almost always disappoints. The generated code tends to ignore your existing error handling conventions, persistence patterns, telemetry, and domain language. You burn time trying to retrofit the output into your architecture instead of guiding the design yourself. In practice, you get better results by treating AI as an assistant for small, composable steps and keeping system design in human hands.
Codebase-Wide Reasoning and Hallucinated Context
Even tools with strong codebase indexing will sometimes reference functions that were renamed, branches that no longer exist, or APIs that live in a different service. The larger your repository, the more likely it is that a model stitches together patterns that look plausible but don't match reality. You still need basic guardrails: grep before you blindly trust a suggestion, and keep tests close to behaviour.
Overconfidence and Security
The models never say "I'm not sure." That's a problem when empirical studies of tools like Copilot have found that a large share of generated code has security issues. One early paper, "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions," reported vulnerabilities in roughly 40% of the Copilot-generated solutions they evaluated in their benchmark tasks. You cannot safely adopt these tools without solid linting, code review, and automated security scanning. The right framing is simple: assume AI-written code is untrusted until it passes the same checks you'd apply to a junior developer's work.
Closing Some Gaps With Agent Hooks
This is where I think the Octomind piece underestimates what is possible today. If you run an agent once, point it at a feature, and hope for the best, you get exactly the sort of thousand-line, half-working pull requests they describe in Octomind's "why agents DO NOT write most of our code". The picture looks different if you wire agents into a set of predictable hooks and quality gates.
A concrete example is wshobson's "agents" toolkit for Claude Code. It provides a catalog of agent commands and plugins that let you connect Claude to tests, linters, formatters, CI tasks, and external systems, with a configuration file that describes what "good" looks like for your project. Instead of asking an agent to refactor a whole feature in one go, you invoke smaller skills like "update this file and run the tests," "apply this migration pattern across these folders," or "generate tests for this module and run the suite." That does not remove the need for review, but it addresses several of the specific problems the Octomind team ran into.
How Teams Are Actually Using Claude, Copilot, and Cursor
Real workflows look less like "press a button and ship an app" and more like mixing and matching tools for different stages of work.
The Two-Window Setup
A pattern I keep hearing: one monitor runs VS Code with Claude Code in the terminal, the other runs Cursor or plain VS Code with Copilot. Claude handles larger, multi-file changes and automation: "update all the newsletter sources to use this new type" or "add logging around these code paths and run the tests." Cursor or Copilot then handle smaller edits, completions, and refactors while you read and tweak the changes.
A Refactor Pattern I Keep Hearing About
Several teams describe a similar pattern when they talk about using Cursor or Claude Code for larger refactors. A typical example is a payment or billing integration that used to take weeks of careful, manual edits across dozens of files. With agent-style workflows, engineers still plan the design and define the patterns, then let the tool apply the same refactor everywhere while they focus on verifying edge cases and adjusting business logic. The AI does not "understand" the domain, but it is good at consistent pattern application once the team provides a clear template.
A Pragmatic Personal Workflow
On this project, my own loop looks roughly like this:
- •Iterate with agents: start in Claude Code, iterating on a plan and letting it propose concrete changes or drafts until I am happy with the direction.
- •Review diffs, not vibes: read through the actual file diffs and commands the agent wants to run, accept only what makes sense, and discard or edit anything that does not match how I would structure the code.
- •Implementation: once the direction is clear, write the core pieces myself, using AI mainly for autocomplete and obvious boilerplate, and keep tests, linters, and review in the loop.
- •Cleanup: after behaviour is stable, use AI for mechanical refactors, dead code removal, and first-draft documentation while I focus on naming, structure, and edge cases.
A Simple Rollout Plan for a Team
If you're responsible for an engineering team, you don't need a grand "AI strategy" to start seeing value. You do need a deliberate rollout.
Step 1: Pick One Narrow Use Case
Test generation or small refactors in a non-critical service are usually the safest entry points. Measure something concrete: time to write tests, number of incidents, or time from PR opened to merged. Keep expectations modest for the first month.
Step 2: Make Review Rules Explicit
Decide upfront how you'll treat AI suggestions in code review. Some teams tag AI-assisted PRs, others require tests for any AI-written logic, and many forbid committing AI-generated code without human edits in security-sensitive areas. Whatever you choose, write it down and apply it consistently.
Step 3: Let Developers Choose Tools, Standardize the Guardrails
Developers will have preferences: some like Cursor, others prefer Copilot-plus-chat, others live in the terminal with Claude Code. That's fine. Standardize on the guardrails (testing, security scanning, review expectations), not the specific assistant. The market is moving fast; your constraints should survive tool churn.
Step 4: Revisit After 3–6 Months with Real Metrics
After a few months, compare cycle times, incident rates, and subjective satisfaction before and after adoption. GitHub's research and various Copilot statistics suggest meaningful gains are possible, but the distribution is wide. Some teams see large improvements; others see little change because they never moved past "autocomplete, but fancier."
You also do not need to copy the most aggressive rollouts. Meta, for example, has started telling employees that performance appraisals will factor in how effectively they use internal AI tools as reported by the Financial Express. For most engineering teams, a clearer goal is to make AI assistance available, set sensible guardrails, and then measure whether it actually helps, rather than forcing usage for its own sake.
The Practical Takeaway
AI coding tools are already moving a measurable share of real-world development work. GitHub's own numbers and independent analyses are clear about that. They behave less like magic "agents" and more like a new layer in the tooling stack: fast pattern matchers and automation engines that sit beside your editor and CI.
Teams that invest a little time in learning where these tools are reliable for tests, boilerplate, small refactors, and documentation, and where they are not for architecture or security-sensitive logic, already shipping faster with fewer frustrating chores. Teams that ignore them will still be able to ship software, but over the next few years the gap in velocity and morale is likely to become hard to miss.
Get the data that matters
I share real metrics and case studies from engineering teams twice a week. No vendor hype, just what's actually working in production.
Written by Benjamin Loh, curator of Tech Upkeep