Why AI Agent Learning Plateaus
From plateau to staircase: what breaks through harness optimization

Same model. Same budget. Same fork.
Andrej Karpathy's autoresearch loop converged at val_bpb 1.42 after forty experiments. Ours hit 1.39 and kept going.
Harness optimization produces a plateau. Rekursor produces a staircase.
The val_bpb edge is small, and not the point. The point is what each loop accumulated. Karpathy's blind loop tracks one signal per run: did val_bpb improve, yes or no. Ours accumulated more. That's the difference that compounds across runs and across domains.
This is a follow-up to our first post, where we argued that most capability gaps in agents are actually context gaps, and that a frozen model gets effectively smarter when given the right perception. This post takes the next step: what happens when an agent has exhausted what it can do with a fixed set of capabilities, and needs new ones it didn't start with.
The state of harness optimization
Autoresearch from Andrej Karpathy, open-sourced in March 2026, is the current frontier of autonomous ML experimentation. Soon after it came out, Harvey published the canonical enterprise validation in April on twelve legal tasks, with average rubric score moving from 40.8% to 87.7%. Harvey also wrote about the canonical failure mode in the pattern. When they ran their board resolutions task, it plateaued at 60% after three iterations and diverged on the fourth, with the agent thrashing through hooks, skills, and utility scripts as the search space went unstable.
This sets up the shape of harness optimization at scale. Dramatic gains, then a plateau. Sometimes worse than a plateau. Harvey says they will next try reward shaping or regularization, which sounds right but remains bounded inside the same paradigm.
The autoresearch loop's process is itself straightforward brute force: train for five minutes, measure the result, keep the change if it improved, discard if it didn't. Repeat overnight. Garry Tan called the broader pattern "the cutting edge of recursive self-improvement" in his March piece on Karpathy's autoresearch release. Cursor and Anthropic have also shipped variations of the same approach in their coding agents.
Don't get us wrong. There is no doubt that autoresearch is a fantastic advance because the agent is not executing a fixed playbook. It reads its own failure feedback, hypothesizes fixes, and actually edits its own code. Early iterations fix obvious failures and later iterations develop genuine domain expertise.
But every team also seems to report the same thing about later iterations: the improvement curve flattens. On simple tasks the loop converges to near-perfect scores. On complex ones it plateaus. Sometimes it diverges. Harvey's board resolutions case is a clean example where the score got worse, not better.
This is not a bug in any particular implementation. It is a limitation of what harness optimization can do. The loop operates within a fixed set of levers: prompts, tools, skills, hooks, output structures. Once those levers are well-tuned, there is nothing left to tune. The agent has exhausted the improvements that are possible inside the bounded range in which it's operating.
Three limits
Run the loop long enough and the same three limits appear.
The agent cannot perceive what it has no way to measure. A legal agent reviewing commercial leases, after full harness optimization, has excellent cross-document playbooks, validation hooks, structured fact sheets for drafting. It has become a well-equipped associate. But if the critical signal is that a particular guaranty clause is structurally unusual compared to hundreds of similar clauses, and the agent has no way to see guaranty clause anomaly at all, the optimization loop will try every combination of existing levers and eventually converge without noticing the problem. Harvey's board resolutions case is structurally adjacent: aggressive exploration of hooks, skills, and utility scripts that destabilized the search space rather than revealing the missing dimension. Tuning the harness does not give the agent new sight.
Every task starts from scratch. The optimized harness for lease review does not help the tax memo agent. Each of Harvey's twelve tasks was optimized independently. The customer service agent's improvements do not transfer to the fraud detection agent. Each task is its own island. Harness engineering is task-specific. Nothing accumulates. The cost of optimization scales linearly with the number of tasks.
Humans steer. Harvey closes their April writeup with the cleanest articulation of this paradigm: "Humans steer. Agents execute." The agent climbs the hill the human picks. Autoresearch requires someone to define what good looks like: rubrics, grading criteria, evaluation signals designed by human experts. The agent can climb once told which hill to climb. It cannot discover that it is climbing the wrong hill, or that a more important hill exists.
The same pattern at the frontier
This isn't limited to teams running autoresearch loops. Take a look at frontier model releases across 2025 and 2026. GPT-5 at 74.9% on SWE-bench Verified (August 2025). GPT-5.2 at 80.0% four months later (December 2025). Claude Opus follows the same trajectory. The cadence is quarterly. The gains are 3–8 points per release.
These labs have effectively unlimited resources and first-class access to every technique in the field. They are not under-investing. What they produce is the reliable, modest, incrementally-engineered progress that harness-space optimization produces when all the available levers have been well-tuned. The curve seems to have flattened at the frontier, at a higher absolute level than any team's autoresearch loop, but with the same shape and the same cadence of marginal improvement.
If the limit is real at the frontier, no amount of harness tuning at the application level escapes it. The same pattern that shapes Harvey's twelve tasks also shapes the field's regularly recurring LLM releases.
Rekursor
Unlike Harvey's plan to use reward shaping or regularization, we have a different approach that doesn't plateau. It accumulates.
Five behaviors our system has that harness optimization can't produce:
1. Frozen-model compounding over long runs. We ran a frozen model on a financial filings task: classifying SEC EDGAR comment letters by disclosure deficiency severity, a domain with 2,400+ labeled letters and a long history of human-expert annotation. The model weights never changed. Performance began at 0.87× and reached 1.46× over 5,000 sequential cycles. The curve isn't smooth: long flat sections, punctuated by jumps. Harness optimization on the same task flattens in the first few hundred cycles. We see a staircase.
Plateau vs staircase
Harness optimization on a fixed task converges and flattens. Rekursor keeps improving in step changes, with each step adding capability the previous level couldn't reach. Reference task: SEC EDGAR comment letter classification, frozen model.
2. Ok, but does compounding stop? The skeptical response to any compounding-learning claim is that every learning system saturates. We've seen the saturation, but our system resolves it. On a 280-wave drug discovery campaign for a target the system was designed to optimize against, the system stopped finding new things to learn. For a stretch of cycles nothing new was being added. Then it began learning again, but what it began learning was a different kind of thing. When one objective stopped producing improvement, work resumed against a different one. The cycle resumed in a new regime. The saturation was local, not terminal.
3. Domain shift, no help. The system had been operating on text: SEC filings, scientific abstracts, news classification. We pointed it at a binding-affinity prediction task on a molecular target it had never seen. Initial results: no better than chance. With no domain-specific guidance and no architectural changes, performance reached AUC 0.844. The capabilities that had served it for SEC filings were the wrong tools for molecular structure. The system adapted.
4. Transfer across genuinely different domains. What transfers across domains is structural, not content, not documents, not transactions. What was learned during work on chemistry and biomedical literature, applied to SEC filings without modification, produced +52% improvement on the relevant accuracy metric, with an 80% win rate across the test set of filing tasks. What was learned across three different data modalities simultaneously (SEC filings, molecular assays, and information extraction traces), applied to a frozen model on a held-out task, produced a mean +0.173 improvement. What gets learned isn't limited to the substrate it was learned on.
5. Cross-model transfer on agent benchmarks. SWE-bench (75 issues): excellent completion 10.7% → 20%. ARC-AGI-2 (50 tasks): 24% → 30%. Terminal-Bench 2.0, targeted set where Opus 4.6 was below 100% (including one task at 0%), flipped to pass without prompt engineering, fewer steps than baseline on every passing task. What was learned from Opus transferred zero-shot to GLM 5.1. Every targeted task passed. No GLM-specific tuning. What gets learned isn't limited to the model it was learned from either.
As we showed in our prior post, the cross-model behavior traces to what each loop accumulates over time, and an accumulation richer than a single score per run is what makes the cross-model jump (Opus → GLM 5.1) cleanly transferable.
Recreating Karpathy's testbed
We ran our system on Karpathy's exact autoresearch setup to make a clean comparison.
The headline finding: over forty experiments on the same fork, Karpathy's loop accumulated one signal per run (did val_bpb improve, yes or no). Ours accumulated more. Rekursor leaves each run knowing more than that. This is the substantive difference between the two approaches, and it's what the val_bpb numbers below don't capture.
Five experiments each. Karpathy's loop: val_bpb 1.4203, with one experiment kept (a "keep" is a change the loop measured as an improvement and committed to the harness). Ours: 1.4198, two keeps. The val_bpb difference is 0.0004 and not the point.
Forty experiments. Karpathy's loop converges in the 1.42 range. We hit 1.3909. The val_bpb improvement is real. It's not the point.
Rekursor keeps pace with Karpathy's results on the val_bpb metric, while also leaving the run with a richer record than the loop started with.
Same fork, same model, same 5-minute budget, same seed, same val_bpb metric. Final val_bpb: Karpathy 1.4203, Rekursor 1.3909 (a small edge for Rekursor). The substantive difference is what each loop accumulates. Karpathy's blind loop tracks a single signal per experiment (did val_bpb improve, yes or no). Rekursor accumulates more.
Where it doesn't work
The Rekursor approach works where the domain has discoverable structure that current tools miss. It fails where the structure isn't there, where simpler methods already capture the signal, or where the domain itself resists this kind of learning.
What's happening
What we're describing is a different kind of learning entirely. This isn't a rebranding of harness optimization with extra steps. It's a different curve, different kinds of improvement, different conditions for success.
Harness optimization improves how an agent uses a fixed range of capabilities. It works brilliantly within its range. It converges because the range is bounded. When it plateaus, no amount of additional iteration within the same range produces improvement.
What we observe past the plateau is not an extension of harness tuning. It's a different kind of learning. The staircase curve is the most compact evidence we have for this. A system that extends its capabilities produces step changes that a bounded system cannot produce, with each step revealing capabilities the system did not have at the previous level.
The domain-shift recovery is the most vivid evidence. A system that finds capabilities of a different kind when its existing ones don't apply is doing something qualitatively different from tuning within a known vocabulary.
The cross-domain transfer (chemistry to finance, three modalities combined into one frozen-model application, Opus to GLM) suggests that what the system has learned is not artifacts of specific domains or specific models. What transfers across domains is structural: patterns that hold up regardless of substrate.
The prediction
Teams running autoresearch and harness optimization should expect the following. Early gains will be dramatic. Middle-phase gains will be substantial. Later-phase gains will flatten, and sometimes reverse. The point of flattening varies by task and harness size. The existence of flattening does not. A team whose optimization is still improving meaningfully after several hundred iterations is either operating partly outside harness space already without naming it, or is running on a task where the underlying data contains structure that harness tuning happens to be discovering as a side effect.
When a harness optimization loop flattens, the next source of gains is not a better-tuned harness. It is a system whose record of what's happened gets richer over time, not summarized away. A system that can extend its own range, not just keep using the same one.
Where this sits
The autoresearch loop is real and valuable. Harvey, Karpathy, Cursor, Anthropic, and the teams publishing similar results are doing important work. Harness optimization should be in every deployment toolkit.
It is also a bounded method. Something comes after it.
"Humans steer. Agents execute." That's the paradigm Harvey closed with. It's accurate as a description of harness engineering. It's not the only paradigm available.
The plateau is what harness optimization produces. The staircase is what comes after. Each step is a capability the previous level couldn't see. The agents that matter in five years will not be the best-steered. They will be the ones that find capabilities on their own.
If your agent is plateauing, we'll show you exactly where it breaks and what improves.
Validated in production with paying clients and working across five domains: financial filings, drug discovery, neural network interpretability, grant writing, and terminal tasks.
Detailed architecture and additional benchmarks available on request. All capabilities described are under ongoing patent evaluation.
Amit Tandon | Founder