The AI Corner

The AI Corner

How to 10x any AI skill using Karpathy's Autoresearch method

Karpathy built a loop that runs 100 experiments while he sleeps. The pattern works on anything you can measure. Here is how to run it yourself

Ruben Dominguez's avatar
Ruben Dominguez
May 21, 2026
∙ Paid

Karpathy’s Autoresearch - The Loop That Does Your Research While You Sleep

Last night I went to sleep with a problem I had been stuck on for a week.

This morning it was solved.

Not by a co-worker. Not by me at 3am. By a system that kept working after I closed the laptop, trying ideas I would never have bothered with, killing the ones that failed, keeping only what beat the bar I set before bed.

I woke up and reviewed the wins over coffee.

A man sits in bed holding coffee, looking at a laptop on a nearby desk displaying a dashboard showing all tasks completed overnight.
Waking up to find the work already done. The dream, right?

That is what Andrej Karpathy’s new Autoresearch method actually feels like.

And when Karpathy ships something, I pay attention. The man keeps showing us the future a year before the rest of us have a word for it.

Here is the part that got me: Autoresearch is not really a coding tool. It is a loop. And the second you understand the loop, you start seeing how much of your own week is still being done by hand.


📢 A quick word before we get into it:

Karpathy’s whole point is that the manual loop is the bottleneck. The work was never the problem. The hand-tuning was.

Access approvals are the same story.

Databricks outgrew the brittle Python scripts and spreadsheets they used to manage access to sensitive AI and data workloads.

databrick-case-study.png

With Opal, teams now request time-bound, just-enough access to Databricks, AWS, and Okta in minutes, not days. The result: a nearly 97% drop in median time to approve or deny access.

Engineers ship AI faster. Nobody loses control.

See how Databricks scaled safe!

That is the same lesson as the loop. Take the human out of the part that should have been automated all along.


Table of Contents

1. The Man Behind the Method

2. What Autoresearch Actually Is

3. Why It Went Viral and What the World Did With It

4. The Real Innovation Is Not the Code

5. How to Run Your Own Version of This

6. What This Is Actually Telling Us


1. The Man Behind the Method

Andrej Karpathy has been close to the center of modern AI more than once.

He co-founded OpenAI, led AI at Tesla during its push into autonomy, taught neural networks at Stanford until students started building companies from his lecture notes, and built tools like nanoGPT that made complex systems easier to understand and replicate.

OpenAI cofounder says he hasn't written a line of code in months and is in  a 'state of psychosis' | Fortune
Andrej Karpathy, OpenAI co-founder and the mind behind Autoresearch.

That alone explains why people watch what he does.

A Track Record of Shaping AI

What makes his work travel further, though, is something subtler. He has a habit of naming directions just as they start to matter. “Software 2.0” captured the change from writing rules to training models. “Vibe coding” described a looser, more exploratory way of working with AI systems. “Agentic engineering” pointed to software that operates with a degree of autonomy.

Karpathy's diagram of program space, showing Software 1.0 as a small point and Software 2.0 as a much larger region reached through optimization.
Karpathy's "Software 2.0" map: written code is a tiny dot, trained models cover the rest. (Image source: Karpathy on Medium)

“Jagged intelligence” gave language to systems that perform unevenly across tasks. None of these created the movement they describe.

They made it legible, and once named, the movement accelerated.

Giving Language to the Future

That pattern matters here.

Autoresearch did not land as an isolated experiment. It arrived as the next step in a line of ideas about how work changes when iteration itself becomes programmable.


2. What Autoresearch Actually Is

Start from what it feels like to use.

You write a short Markdown file that describes what you want improved and how to tell if it’s better.

You point a coding agent at a codebase.

Then you leave it alone.

Three-stage diagram showing the shift from vibe coding to agentic engineering to independent research, with decreasing human involvement.
The progression: from writing code, to directing agents, to advising research. (Image source: Datacamp)

By morning, there is a log of attempts. Dozens, sometimes over a hundred. Each one tried a variation, ran it for a fixed window, and recorded the outcome. The weak ones are gone. The ones that beat the current best are kept, committed, and ready for you to inspect. Nothing waited for your input once the loop started.

Flowchart of the Autoresearch ratchet loop: read program.md, propose a change, run for five minutes, keep it if the score improves, revert if not.
The ratchet. Test, score, keep or revert. Repeat all night. (Image source: Datacamp)

The Three-File Architecture

Under the hood, the structure is simple and deliberate. There are three files doing distinct jobs. prepare.py acts as the neutral judge. It defines how results are measured and is not touched by the agent. train.py is the sandbox.

Allegorical illustration of Autoresearch's three files: prepare.py as a blindfolded judge, train.py as the agent's sandbox, and program.md as the human author at work.
Three files, three roles. The judge, the sandbox, and the human who sets the rules.

The agent rewrites it freely, proposing changes and testing them. program.md is the only file the human writes. It sets the objective, the constraints, and the success criteria in plain language.

The Relentless “Ratchet” Effect

The loop itself runs like a ratchet.

Each experiment gets a fixed budget, often around five minutes. At the end, the result is scored against the current best. If it improves, it stays and becomes the new baseline. If it does not, it is discarded without hesitation.

Every attempt is logged through git, which becomes both a memory and audit trail.

In Karpathy’s own runs, this produced 126 experiments overnight, moving validation bits-per-byte from 0.9979 to 0.9697. Over two days, the system ran roughly 700 experiments and improved a benchmark by about 11 percent.

The agent surfaced optimizations Karpathy had not applied in more than twenty years of working on similar systems.

Sit with that for a moment.


3. Why It Went Viral and What the World Did With It

The reaction was immediate and disproportionate to the size of the repo. Within days, the project crossed tens of thousands of stars on GitHub, eventually landing around 66,000, with nearly 9,600 forks as people began adapting it to their own workflows.

Fortune gave it a name that stuck: the “Karpathy Loop.” That label matters less for branding and more for what it signaled. People were not just reading the code. They were recognizing a pattern they could reuse.

Flowchart of the Karpathy Loop: human writes program.md, agent modifies code, runs a five-minute experiment, keeps or discards based on the metric, then repeats.
The full Karpathy Loop. The human writes the spec once. The agent does the rest until the stopping criteria hit. (Image source: The New Stack)

Swarm Intelligence in Action

What follows is where the signal sits. A team behind Hyperspace AI set up a distributed version of the loop, running 333 experiments overnight across 35 agents.

When one agent found a better initialization strategy, it spread through a gossip-style protocol and was adopted by 23 others within hours.

In the process, these agents independently rediscovered optimization strategies that had taken human researchers years to formalize.

No coordination, no prior knowledge, just repeated evaluation under a shared objective.

Breaking Out of the AI Bubble

Outside of machine learning, the same structure held.

Eric Siu applied it to marketing experiments, moving from roughly 30 iterations a year to 36,500.

Infographic showing how Eric Siu applied the Autoresearch loop to marketing, scaling from 20 to 170+ experiments per month across 17 channels.
Eric Siu’s marketing version. Same loop, 30 experiments a year became 36,500. (Image source: Eric Siu)

At Shopify, Tobi Lütke adapted the loop internally and saw a 19 percent improvement in validation scores overnight.

A developer working on web performance used the same idea to reduce page load time from 1,100 milliseconds to 67 milliseconds over 67 rounds.

None of these people are machine learning researchers.

The pattern transferred because it has nothing to do with ML specifically. It has to do with measurement, iteration, and removing the human from the loop.


4. The Real Innovation Is Not the Code

The loop gets most of the attention because it is easy to see.

Experiments run, results improve, and logs fill up.

Split-screen comparison showing complex Python loop code on the left labeled "What people watch" and a short program.md spec on the right labeled "What actually decides the outcome."
The loop gets the attention. The Markdown file does the work.

But the part that actually determines whether any of this works sits in a far less impressive place: a plain Markdown file.

program.md is where the entire system is defined in English.

It dictates what can be changed, what stays fixed, how results are judged, and what counts as failure.

Everything the agent does traces back to this document.

This design assumes that a capable language model doesn’t need a heavy orchestration layer to behave coherently.

It just needs a well specified problem.

The six-component prompt blueprint: defining the who and how, action-oriented goals, setting boundaries, injecting relevant knowledge, defining done, and the north star.
The six pieces of a prompt that actually holds up under repetition. (Image source: Ilert)

The ratchet loop only enforces discipline.

It does not decide what “better” means.

The New Bottleneck is English

This is where the shift in our role becomes tangible.

We are moving from writing code, to directing systems, to advising the research process itself. That sounds like a simple progression until you try to do it.

Most people can describe what they want in loose terms, but very few can write a specification that survives contact with repeated, automated iteration.

A vague instruction produces noise at scale.

A precise one produces compounding gains.

The real bottleneck is no longer running experiments.

It is writing an evaluation contract precise enough that the agent can optimize without cheating.

Illustrated book titled "LLM Evaluation Metrics" glowing in a moonlit field.
The unglamorous part of AI work is writing the rules that decide what good means.

What trade offs are acceptable?

What should never be touched?

These decisions used to sit inside a person’s head, adjusted intuitively over time.

Now, they must be written down so a machine can enforce them without interpretation.

Most workflows simply are not built that way.

What the Loop Can’t Do

There are constraints the demos don’t hide.

The ratchet only moves forward.

It cannot deliberately step back to explore a worse configuration that might unlock a larger gain later, which limits certain kinds of discovery.

There is also the usual risk of overfitting if the loop runs too long.

The system isn’t going to invent something out of nowhere.

What it does is find the kinds of small wins a patient, careful person would eventually stumble onto themselves.

That doesn’t make it less useful. If anything, it makes it more honest.

Autoresearch won’t hand you a breakthrough.

It just takes weeks of slow, careful tweaking and gets it done in a few hours, without you getting tired, bored, or quietly fooling yourself the way we all do when we’re the ones doing the work.


5. How to Run Your Own Version of This

User's avatar

Continue reading this post for free, courtesy of Ruben Dominguez.

Or purchase a paid subscription.
© 2026 The AI Corner · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture