We Taught AI to Write Code But We Forgot to Teach It to Think.

Almost half of commercial code is now AI-generated. Code churn has doubled and the engineers who built these systems can't always explain how they work anymore.

Ruben Dominguez

May 14, 2026

The code looks fine. That is the problem.

Clean PR. Tests green. Knot in your stomach.

You read the diff. Properly indented. Variables make sense. Reads like a dream.

Then you try to trace what it actually does from start to finish.

You have no idea.

That is software development in 2026.

AI coding assistants gave us a strange new superpower: we can now generate more correct-looking code than our brains can comprehend.

The scariest problems do not crash the app. They do not throw errors. They slip past review because the code is beautiful.

We are flying blind. Trusting surface signals. Losing our grip on how our own systems work.

A quiet crisis.

📢 A quick word before we get into it.

The article is about code you cannot fully see. The same problem is now hitting production access.

Mercari migrated their in-house JIT access system to Opal. Same zero-touch posture. Zero KTLO. The real reason: AI agents are next.

Every agent now governed like a human identity. Least agency enforced. Circuit breakers in place. Rogue agents isolated. One identity model for humans and agents.

If AI writing the code keeps you up at night, AI calling your production APIs probably should too.

Read the Mercari case study.

1. Faster Doesn’t Mean Better, But We Keep Forgetting That

2. The Debt You Can’t See Grows the Quickest

3. Why Code Review Is Failing When It Matters Most

4. The Rational Trap: Smart Individual Decisions, Disastrous Collective Outcomes

5. What Governance Actually Looks Like When It Works

6. The Uncomfortable Truth About What This Means for Your Team

1. Faster Doesn’t Mean Better, But We Keep Forgetting That

For a while, this is easy to ignore.

Over time, it becomes impossible to pretend it isn’t there. There’s a real tension at the heart of AI-assisted development that most teams haven’t fully sat with yet.

Code churn rising steadily 2020 to 2024, accelerating sharply from 2023 onward marked as AI-influenced. — AI didn’t start the churn. It just turned up the speed. (Image source: Gitclear)

The experience of going faster and the reality of slowing down are not mutually exclusive. These tools have made both possible at once.

Most teams are only tracking one of them.

Where the time actually goes

Writing code has never been the bottleneck in professional software development. Understanding it is. Debugging it is. So is modifying code whose reasoning you can’t fully trace.

AI has made the fast part of development faster while making the slow parts measurably harder.

The numbers make this concrete.

Developers consistently report feeling around 20% faster with AI tools. Measured end to end, through review, integration and production fixes, teams frequently land roughly 19% slower.

That gap stops being surprising once you trace where the time actually goes. Generating code has become cheaper. Verifying and understanding it has become more demanding, because there is simply more of it moving through the system at once.

The planning problem nobody talks about

The deeper issue is what that gap does to how teams make decisions.

You feel faster, so you plan around output.

Deadlines tighten, backlogs look manageable, stakeholders see the commit velocity and feel good about it.

But the constraint doesn’t disappear. It moves into parts of the process that are harder to accelerate and easier to underestimate. The long review discussions, the bugs that take three times as long to isolate because the surrounding logic is unfamiliar, the features that keep spilling into the next sprint.

Uplevel study of 800 developers: GitHub Copilot increased bug rate by 41% with no improvement to efficiency or burnout. — More bugs. Same speed. Still burned out. (Image source: Uplevel)

2. The Debt You Can’t See Grows the Quickest

There’s a point where the issue stops being about speed and starts being about something harder to name.

The system keeps growing.

The code looks clean. Nothing obvious signals that anything is wrong. But if you ask a simple question, “who on this team fully understands this part of the system end to end“ the answers get vague very quickly.

What comprehension debt actually looks like

Google engineer Addy Osmani called this comprehension debt the growing gap between how much code exists and how much anyone genuinely understands.

Unlike traditional technical debt, which announces itself through slow builds and friction you can feel, comprehension debt breeds false confidence.

AI speeds up coding but lowers knowledge scores. Engineers who delegate to AI score 30% lower than those who learn from it. — Faster to finish. Less to show for it. (Image source: Addy Osmani)

The system moves. Tests pass.

Velocity looks fine.

The problem is that movement is happening on top of a layer fewer and fewer people can actually speak to.

Researcher Margaret-Anne Storey documented this with a student team that hit a wall seven weeks in, not because the code was messy, but because nobody could explain why decisions had been made or how the system was supposed to fit together.

The shared theory of the software had evaporated.

Layered pyramid from hardware to team social systems, showing how software and human interaction are interdependent. — It’s never just a code problem.

Margaret-Anne Storey’s Socio-Technical Model illustrates how software development relies on layers of human interaction. AI tools often disrupt the “Emergence” of understanding from the individual to the team level.

The 18-month arc

Multiple teams have now reported the same trajectory.

The first three months feel like a clear win. The team ships faster, backlogs shrink, the integration of AI tools feels like the right call.

By months four through nine, something changes. Reviews take longer. Changes cover more ground than expected. The code still gets approved, but reconstructing the intent behind it takes real effort.

Around months ten to fifteen, a bug takes longer to fix than it should. The code is readable, but tracing how pieces interact eats time nobody has.

By month sixteen or eighteen, teams start to hesitate. Parts of the codebase feel like territory you approach carefully, even when nothing looks obviously wrong.

That’s when it becomes clear. The system is no longer fully legible to the people responsible for it.

Three patterns drive this.

AI models evolve underneath you, so similar prompts produce inconsistent results over time.

Changes arrive larger and less scrutinized than before. And AI-generated code looks competent. Clean, commented, sensibly named. That earns trust on appearance rather than understanding.

None of this shows up on a dashboard. The system appears healthy. But the ability to safely modify it is quietly eroding.

3. Why Code Review Is Failing When It Matters Most

Code review gets a bad reputation. It’s slow, it’s sometimes awkward and nobody loves having their work picked apart on a Friday afternoon.

It was never really about catching bugs. It’s one of the few moments where one AI engineer actually gets inside another’s head.

You start to see how they think.

Why did they made that call and not a different one. Where they decided to stop.

Do that enough times, with the same people, on the same codebase and something you can’t really measure starts to form.

Everyone just kind of knows how the system behaves. Not because it’s documented somewhere.

Because they’ve lived in it together. Lose that and the code doesn’t suddenly break. It just becomes a place nobody fully owns anymore.

Tweet by Jarred Sumner predicting open source repos almost entirely maintained by LLMs in 2026. — And if nobody understands it, who owns it?

The scale problem

That function depends on a certain scale.

Reviews work when a person can reasonably hold the change in their head. Not every detail, but enough to trace the intent and spot where something feels off.

When that boundary is crossed, the nature of the review changes. It becomes less about understanding and more about scanning.

AI-assisted workflows push against that boundary constantly.

A change that might once have been 50 lines shows up as several hundred, covering multiple concerns at once.

Verifying it properly means reconstructing how the pieces fit together, not just checking syntax or style. Most reviewers don’t have the time.

We are already seeing the fallout of this in real-time. Industry outlets like The New Stack recently reported on a growing crisis where engineering teams and open-source maintainers are actively “drowning in AI-generated code.”

Flowchart comparing PR generation (5-10 minutes) vs PR review (60-90+ minutes), showing the growing imbalance between writing and understanding code. — 10 minutes to write it. 90 minutes to understand it. That gap is the problem. (Image source: The New Stack)

So behavior adapts. Reviews become lighter.

People look for obvious issues, rely on tests as a proxy for correctness and move on.

PR size grew from 50 to 320 lines 2021 to 2025, while engineers able to debug without AI dropped from 91% to 38%. — More code. Less understood. Faster to ship. Slower to fix.

Nothing about this feels irresponsible at the moment. The problem is that the standard for approval has changed without anyone explicitly deciding it should.

The generational consequence

Strong engineering has always been built through friction. Write something, watch it break, build intuition through debugging.

When a tool handles most of the implementation, that loop compresses. The code works often enough that you stop interrogating why.

Meme: monkey puppet looking away captioned "when the system goes down and you have to debug the 500 lines of AI code you just blindly approved." — Every engineering team, eventually.

Over time, that reshapes what a team collectively knows.

You still have people who can produce code quickly. What becomes less certain is how many can take a failing system, step through it and find the fix without reaching for the same tools that built it.

Code review sits right at that intersection. Under more pressure than ever, at exactly the moment its role in maintaining shared understanding matters most.

4. The Slow Disaster Nobody Voted For

So far this probably sounds like a tooling problem. Something to fix, tweak, manage better.

But that’s not really what’s going on.

The harder truth is that most of the behaviour driving this makes complete sense.

Everyone is acting rationally.

That’s exactly what makes it so difficult to stop.

The deferral logic

If you assume that models will keep improving, deferring cleanup starts to look like the sensible choice. Why spend time simplifying something today when a better model will be able to read, refactor, or regenerate it more easily in a few months?

The promise of lunar habitats in 1969 vs the reality: NASA spinoff technologies like sunglasses, computer mouse and smoke detectors. — We were promised the moon. We got a better smoke detector. AI cleanup is the same story. (Image source: Gergely Orosz)

The cost of waiting appears to go down over time. So teams push forward.

They build faster, accept a little more opacity and trust that future tools will make sense of it when needed.

Each individual decision is easy to justify, shipping now has immediate value and cleaning up later feels like a smaller, more flexible cost.

Line chart showing perceived manageability declining while actual comprehension debt rises, crossing at month 12. — It feels fine. Until it doesn’t.

When everyone operates under that assumption, the system fills up with code that works but isn’t deeply understood. It doesn’t feel dangerous, because nothing breaks right away.

The subprime logic applied to software

This is the structure of the 2008 mortgage crisis applied to codebases. The risk wasn’t created by obviously reckless decisions.

It built up because the system rewarded short term gains and made future costs look manageable.

Everyone acted in ways that made sense locally.

The failure came from how those decisions stacked together over time and from the assumption that conditions would keep moving in a direction that made the exposure feel safe.

The same structure is playing out here. Teams optimize for delivery because that’s what gets measured.

They rely on improving tools because that has been a reasonable assumption so far.

Over time, they accumulate a codebase that sits just beyond what they can confidently hold, but not far enough beyond to force a correction.

When that boundary is crossed, the options narrow quickly. What looked like flexibility earlier turns into constraint. And the cost you thought you were deferring doesn’t disappear.

When it finally shows up, it doesn’t trickle in. It lands all at once and by then it’s a much bigger mess than it ever needed to be.

Developer working late at a multi-monitor setup with tangled code and sticky notes, surrounded by energy drinks and takeout. — Shipping fast. Understanding it later. Maybe.

5. What Governance Actually Looks Like When It Works

By the time teams feel the drag, it’s already late.

The ones that avoid it tend to behave differently much earlier, often before the generated code becomes a large share of their workflow.

They don’t treat generation as the starting point.

They treat the system as the starting point.

Tangled cables transforming into clean organized lines, representing messy vs structured code. — Someone had to care enough to untangle it.

Architecture first, generation second

You can see this in how high performing teams handle architectural decisions.

There’s usually a record of where the system is headed, often captured as architecture decision records

Not a thick document nobody reads, but actual decisions that got written down and get checked on.

When new code comes in, whoever wrote it, whether a person or a model, it’s expected to fit that direction. Someone senior owns that.

They hold the line.

Most teams don’t work this way.

Architectural ownership tends to be everyone’s responsibility, which usually means it’s nobody’s.

Tiered review, not uniform scrutiny

Review works differently in these teams. Not every change gets treated the same, but the difference is explicit.

Small, low risk changes move quickly.

Larger changes, especially those that introduce new patterns or touch critical paths, are slowed down on purpose.

What matters is not just whether the code runs, but whether the person who wrote it or approved it can explain how it works and why it was structured that way.

That expectation shows up in subtle ways, a pull request is less likely to move forward if the author cannot walk through the logic, even if the tests pass.

And the logic is pretty simple.

AI is great at handling the boring, repetitive stuff.

So let it.

The time you save there goes back into actually understanding the code that matters.

The code that, if it breaks, hurts.

The net cost is roughly neutral. The risk profile is not.

Measuring what actually matters

These teams also watch different numbers.

Velocity exists, but it’s not the only signal that matters.

Code churn, which is how often recently merged code needs to be substantially revised, is a reliable early indicator that comprehension debt is building.

How long it takes to fix a bug in AI generated code tells you a lot.

If it’s taking forever, chances are nobody really understood that code in the first place.

Neither of these metrics is exotic.

Most teams just aren’t tracking them.

The result, counterintuitively, is that these teams end up moving faster over time. Because the system remains understandable, changes carry less hidden risk.

Tiered code review model and velocity crossover: quality tracking teams outpace speed-only teams over 18 months. — Slow down on purpose. Win in the long run.

Engineers spend less time rediscovering how things work.

Decisions hold instead of being undone.

6. The Uncomfortable Truth About What This Means for Your Team

These tools amplify whatever is already there.

Strong foundations, clear ownership, a team that actually understands how the pieces fit together. All of it gets faster and better.

But if those foundations were shaky to begin with, the same acceleration just pushes more code into the places that were already hard to work through.

The gaps don’t stay the same size.

They grow.

Strong teams accelerate, weak teams decline. Hidden gap: metrics look fine while team understanding erodes. — Same tools. Opposite outcomes. The metrics won’t warn you.

Decisions that were slightly unclear become harder to untangle.

Things that could have been sorted out early start to feel permanent.

Judgment matters more, not less

Over time, one thing becomes clear. Your team’s judgment has never mattered more.

Because the leverage on that judgment has never been higher.

Every decision about what to generate, what to accept and what to revisit carries more weight than it used to.

A mistake that once stayed local can now spread through the entire system before anyone realises what happened.

The tools, used carefully, deliver real value.

But the teams that hold onto that value over time are the ones who stay in control of what the code actually means. Not just how much of it exists.

The gap that doesn’t announce itself

For a while, everything still looks like progress.

The codebase grows, features ship, the metrics look fine. But the real gap, how much of the system anyone can actually account for, stays hidden.

It only becomes visible when something forces the question.

By then, the options are already narrower.

The tools did exactly what they were supposed to do. What built up quietly in the background was the cost of treating understanding as something that could always wait.

But it couldn’t.

The AI Corner

Discussion about this post

Ready for more?

The AI Corner

We Taught AI to Write Code But We Forgot to Teach It to Think.

Almost half of commercial code is now AI-generated. Code churn has doubled and the engineers who built these systems can't always explain how they work anymore.

The code looks fine. That is the problem.

Table of Contents

1. Faster Doesn’t Mean Better, But We Keep Forgetting That

Where the time actually goes

The planning problem nobody talks about

2. The Debt You Can’t See Grows the Quickest

What comprehension debt actually looks like

The 18-month arc

3. Why Code Review Is Failing When It Matters Most

The scale problem

The generational consequence

4. The Slow Disaster Nobody Voted For

The deferral logic

The subprime logic applied to software

5. What Governance Actually Looks Like When It Works

Architecture first, generation second

Tiered review, not uniform scrutiny

Measuring what actually matters

6. The Uncomfortable Truth About What This Means for Your Team

Judgment matters more, not less

The gap that doesn’t announce itself

Discussion about this post

Ready for more?