How to 10x any AI skill using Karpathy's Autoresearch method
Karpathy built a loop that runs 100 experiments while he sleeps. The pattern works on anything you can measure. Here is how to run it yourself
Karpathy’s Autoresearch - The Loop That Does Your Research While You Sleep
Last night I went to sleep with a problem I had been stuck on for a week.
This morning it was solved.
Not by a co-worker. Not by me at 3am. By a system that kept working after I closed the laptop, trying ideas I would never have bothered with, killing the ones that failed, keeping only what beat the bar I set before bed.
I woke up and reviewed the wins over coffee.
That is what Andrej Karpathy’s new Autoresearch method actually feels like.
And when Karpathy ships something, I pay attention. The man keeps showing us the future a year before the rest of us have a word for it.
Here is the part that got me: Autoresearch is not really a coding tool. It is a loop. And the second you understand the loop, you start seeing how much of your own week is still being done by hand.
📢 A quick word before we get into it:
Karpathy’s whole point is that the manual loop is the bottleneck. The work was never the problem. The hand-tuning was.
Access approvals are the same story.
Databricks outgrew the brittle Python scripts and spreadsheets they used to manage access to sensitive AI and data workloads.
With Opal, teams now request time-bound, just-enough access to Databricks, AWS, and Okta in minutes, not days. The result: a nearly 97% drop in median time to approve or deny access.
Engineers ship AI faster. Nobody loses control.
That is the same lesson as the loop. Take the human out of the part that should have been automated all along.
Table of Contents
1. The Man Behind the Method
2. What Autoresearch Actually Is
3. Why It Went Viral and What the World Did With It
4. The Real Innovation Is Not the Code
5. How to Run Your Own Version of This
6. What This Is Actually Telling Us
1. The Man Behind the Method
Andrej Karpathy has been close to the center of modern AI more than once.
He co-founded OpenAI, led AI at Tesla during its push into autonomy, taught neural networks at Stanford until students started building companies from his lecture notes, and built tools like nanoGPT that made complex systems easier to understand and replicate.
That alone explains why people watch what he does.
A Track Record of Shaping AI
What makes his work travel further, though, is something subtler. He has a habit of naming directions just as they start to matter. “Software 2.0” captured the change from writing rules to training models. “Vibe coding” described a looser, more exploratory way of working with AI systems. “Agentic engineering” pointed to software that operates with a degree of autonomy.

“Jagged intelligence” gave language to systems that perform unevenly across tasks. None of these created the movement they describe.
They made it legible, and once named, the movement accelerated.
Giving Language to the Future
That pattern matters here.
Autoresearch did not land as an isolated experiment. It arrived as the next step in a line of ideas about how work changes when iteration itself becomes programmable.
2. What Autoresearch Actually Is
Start from what it feels like to use.
You write a short Markdown file that describes what you want improved and how to tell if it’s better.
You point a coding agent at a codebase.
Then you leave it alone.

By morning, there is a log of attempts. Dozens, sometimes over a hundred. Each one tried a variation, ran it for a fixed window, and recorded the outcome. The weak ones are gone. The ones that beat the current best are kept, committed, and ready for you to inspect. Nothing waited for your input once the loop started.

The Three-File Architecture
Under the hood, the structure is simple and deliberate. There are three files doing distinct jobs. prepare.py acts as the neutral judge. It defines how results are measured and is not touched by the agent. train.py is the sandbox.
The agent rewrites it freely, proposing changes and testing them. program.md is the only file the human writes. It sets the objective, the constraints, and the success criteria in plain language.
The Relentless “Ratchet” Effect
The loop itself runs like a ratchet.
Each experiment gets a fixed budget, often around five minutes. At the end, the result is scored against the current best. If it improves, it stays and becomes the new baseline. If it does not, it is discarded without hesitation.
Every attempt is logged through git, which becomes both a memory and audit trail.
In Karpathy’s own runs, this produced 126 experiments overnight, moving validation bits-per-byte from 0.9979 to 0.9697. Over two days, the system ran roughly 700 experiments and improved a benchmark by about 11 percent.
The agent surfaced optimizations Karpathy had not applied in more than twenty years of working on similar systems.
Sit with that for a moment.
3. Why It Went Viral and What the World Did With It
The reaction was immediate and disproportionate to the size of the repo. Within days, the project crossed tens of thousands of stars on GitHub, eventually landing around 66,000, with nearly 9,600 forks as people began adapting it to their own workflows.
Fortune gave it a name that stuck: the “Karpathy Loop.” That label matters less for branding and more for what it signaled. People were not just reading the code. They were recognizing a pattern they could reuse.

Swarm Intelligence in Action
What follows is where the signal sits. A team behind Hyperspace AI set up a distributed version of the loop, running 333 experiments overnight across 35 agents.
When one agent found a better initialization strategy, it spread through a gossip-style protocol and was adopted by 23 others within hours.
In the process, these agents independently rediscovered optimization strategies that had taken human researchers years to formalize.
No coordination, no prior knowledge, just repeated evaluation under a shared objective.
Breaking Out of the AI Bubble
Outside of machine learning, the same structure held.
Eric Siu applied it to marketing experiments, moving from roughly 30 iterations a year to 36,500.

At Shopify, Tobi Lütke adapted the loop internally and saw a 19 percent improvement in validation scores overnight.
A developer working on web performance used the same idea to reduce page load time from 1,100 milliseconds to 67 milliseconds over 67 rounds.
None of these people are machine learning researchers.
The pattern transferred because it has nothing to do with ML specifically. It has to do with measurement, iteration, and removing the human from the loop.
4. The Real Innovation Is Not the Code
The loop gets most of the attention because it is easy to see.
Experiments run, results improve, and logs fill up.
But the part that actually determines whether any of this works sits in a far less impressive place: a plain Markdown file.
program.md is where the entire system is defined in English.
It dictates what can be changed, what stays fixed, how results are judged, and what counts as failure.
Everything the agent does traces back to this document.
This design assumes that a capable language model doesn’t need a heavy orchestration layer to behave coherently.
It just needs a well specified problem.

The ratchet loop only enforces discipline.
It does not decide what “better” means.
The New Bottleneck is English
This is where the shift in our role becomes tangible.
We are moving from writing code, to directing systems, to advising the research process itself. That sounds like a simple progression until you try to do it.
Most people can describe what they want in loose terms, but very few can write a specification that survives contact with repeated, automated iteration.
A vague instruction produces noise at scale.
A precise one produces compounding gains.
The real bottleneck is no longer running experiments.
It is writing an evaluation contract precise enough that the agent can optimize without cheating.
What trade offs are acceptable?
What should never be touched?
These decisions used to sit inside a person’s head, adjusted intuitively over time.
Now, they must be written down so a machine can enforce them without interpretation.
Most workflows simply are not built that way.
What the Loop Can’t Do
There are constraints the demos don’t hide.
The ratchet only moves forward.
It cannot deliberately step back to explore a worse configuration that might unlock a larger gain later, which limits certain kinds of discovery.
There is also the usual risk of overfitting if the loop runs too long.
The system isn’t going to invent something out of nowhere.
What it does is find the kinds of small wins a patient, careful person would eventually stumble onto themselves.
That doesn’t make it less useful. If anything, it makes it more honest.
Autoresearch won’t hand you a breakthrough.
It just takes weeks of slow, careful tweaking and gets it done in a few hours, without you getting tired, bored, or quietly fooling yourself the way we all do when we’re the ones doing the work.








