Beyond Frontier Model Performance for Legal AI Agents

A new axis for continual learning, and what it now does in legal AI on Harvey's benchmark.

Amit Tandon
Amit Tandon
Founder, Rekursor · ·16 min read

Harvey released the Legal Agent Benchmark (LAB) earlier in May and published Initial Results on May 26, providing a baseline for how well frontier models perform on long-horizon legal work. Along with these results, Harvey, Baseten Research and Trajectory published additional findings on post-training open-source models with the Harvey benchmark. The common thread across all three is a shared and correct conviction: agents should learn from the work they do, not start fresh on every matter.

We share that conviction and have been building toward it on a different axis across multiple domains for many months. This is our second post on Harvey LAB, continuing from the first single-task result we published on May 20. We present five major results:

  • Held-out transfer across matters in the same practice area;
  • Autonomous revision that crosses the all-pass threshold on a previously-unseen task with no prior library exposure;
  • Autonomous skill generation from a firm's standard work product;
  • Results showing that revision is reliable and the system declines unsafe revisions — i.e., fixing what's wrong without breaking what's already right; and
  • A scaling-law curve showing that Rekursor's system can route the right skill from a large library to the task at hand where standard RAG breaks.

Our May 20 post on Harvey LAB showed our first example of autonomous revision of a Harvey LAB task result: using Harvey's published harness agent with claude-sonnet-4-6 and the Rekursor learning layer, we went from 45/48 to 48/48 ALL-PASS on scenario-01 after the learning layer converted a failed run into two reusable skills. That post was the first foray, audit-friendly, and one before-and-after. This post extends that work in five directions on new held-out tasks the system had not seen, using Harvey's benchmark, harness, agent, and judge unchanged.

Weights, context, and a third axis

Continual learning for AI agents, as currently framed in the field, means moving along one of two axes.

The first is weights: take the signal from an agent's work and post-train the model on it, so the model itself gets better. This is where the field's energy and capital have gone: Harvey/Baseten's open-weight post-training, Trajectory's continuous-post-training infrastructure, and the broader move to host and improve open models inside a firm's boundary. It is serious work and it raises the floor for everything that runs on a stronger base.

The second is context: put more of the relevant material into the model's window at run time, through retrieval, long context, or in-context examples. It is immediate and useful, but the learning is transient. Nothing persists between calls; nothing compounds; the window is bounded. Both axes optimize the agent's use of perceptions the agent already has, what its training already taught it to look for and attend to.

We present a third axis: expansion of the perceptions themselves. Most capability gaps in agents are context gaps in a deeper sense than long-context retrieval addresses. The agent doesn't have the right dimensions of perception to see what the task requires. Harness optimization can rearrange and reweight the perceptions the agent already has, but it cannot give the agent new ones. Post-training can sharpen them, but it operates inside the perception space the training already built. Rekursor's learning layer expands that space: it converts the firm's past scored work into new dimensions of perception the agent reads at runtime as skills. The skills are the vehicle; the learning is in measurement space.

The third axis has properties neither of the others has. The skills accumulate as discrete, human-readable, customer-owned artifacts that persist between matters and persist unchanged on top of whatever base model sits underneath, whether closed, open-weight, or post-trained. They cannot overwrite each other the way weight updates can erase prior learning; each skill is a separate file the firm can read, edit, or disable independently. The artifacts themselves are immutable and addressable. You cannot open a weight update and read "this is the conflict rule we now apply," edit it when it is wrong, or switch it off for one matter and not another. You can do all of that with a skill that lives outside the model.

The third axis composes with the other axes. A firm could run a post-trained open model and load the artifacts on top, the weights lifting general capability while the skills carry the firm's own institutional knowledge in a form a partner can read. The point is that the third axis exists, that it is the right one for regulated work where the learning has to be inspectable, and that it works. The rest of this post is the evidence for that last claim.

Where Harvey and its partners point, and where the third axis comes in

The three releases from Harvey and partners each move the field's center of gravity in a different way, and the Rekursor layer sits alongside all three.

Harvey's Initial Results baselined frontier models on LAB under a strict all-pass standard, where a task passes only if every required rubric criterion passes. Three findings carry the most weight for what follows. First, legal work is nowhere near saturated: frontier models complete less than 10% of LAB tasks end-to-end in aggregate, with Opus 4.7 leading at 7.1%. Second, performance is jagged across practice areas: no single model wins everywhere, and the leader changes by practice-area grouping (GPT-5.5 in regulated and emerging-company work, Opus 4.7 in corporate transactions and funds, Sonnet 4.6 in privacy and tax). Third, frontier quality is expensive and slow: Opus 4.7 runs ~$50.90 per task and ~22 minutes wall-clock.

Harvey's behavioral analysis of agent action traces also identifies self-correction as the single strongest positive signal: agents that change the deliverable after a review step gain +1.5 points all-pass on average; agents that draft but skip validation lose 1.2 points. The behaviors that correlate with success look like what a competent associate does: build context, draft, check against the record, then revise.

Harvey + Baseten Research's post-training result brings a smaller open-weight model into the frontier-band performance range on the same benchmark, with the model fully ownable by the firm. The point of that work, as published, is to give firms a closed-source-equivalent baseline they can host themselves at lower cost.

Trajectory's work, with Harvey among the launch partners, relates to infrastructure for continuously post-training agentic models. The point of that work, as published, is that the post-training loop should keep running as the firm produces more work, so the model stays current as the firm's practice evolves.

Two of Harvey's findings in particular shape the results we have presented.

  • Their jaggedness finding implies that the durable thing to accumulate is the knowledge that sits above the model, portable across whichever model leads a given practice area today.
  • Their self-correction finding implies that the behavior separating passing from failing legal work is revision against the record.

The open question Harvey's analysis leaves is how to get a model to perform that revision reliably on the matters where it doesn't already. That is exactly what our learning layer supplies: self-correction made into a system property rather than an emergent one. The layer analyzes the failed run, produces an inspectable procedural artifact, and applies that artifact in a later run, all delivered autonomously by the system. The behavior Harvey identified isn't aspirational. Rekursor's learning system turns that behavior into an explicit system mechanism and demonstrates it on the tasks below.

What our learning layer adds: five results on Harvey's benchmark

We are presenting five results with Harvey's published harness, agent, and judge unchanged. Four are run on Harvey LAB directly; the fifth is the scaling-law curve, run on SEC 10-K and 10-Q filings where we had a preexisting candidate library large enough to test scaling behavior.

Result 1 — Library transfer, held-out within family. On corporate-governance/review-board-resolutions/scenario-02, a task in the same practice family the system had not seen, Sonnet 4.6 baselined at 45/49. We applied two skills the learning layer had previously generated from scenario-01 substrate that did not include scenario-02, and ran the agent under Harvey's published harness unchanged. Three of four residuals flipped. 45/49 → 48/49, +3, zero regressions, no new skill generated. The skills themselves carried the lift on a task the system never saw.

Those two skills came from our May 20 post on scenario-01, where the same claude-sonnet-4-6 agent went from 45/48 to 48/48 ALL-PASS after the learning layer converted a failed run into two reusable skills. This result matters most for the long run: a new matter inheriting the firm's institutional knowledge from prior matters and applying it to unseen ones, the way a senior partner's accumulated experience carries across new tasks.

Held-out library transfer on scenario-02

corporate-governance/review-board-resolutions/scenario-02

Held-out library transfer on scenario-02. Two skills generated from scenario-01 substrate seven days earlier (substrate that did not include scenario-02) lifted Sonnet 4.6 from 45/49 baseline to 48/49 with zero regressions and no new skill generated. Three of four residuals flipped via library reuse.

Result 2 — Autonomous revision, clean all-pass closure. On corporate-governance/compare-corporate-bylaws-against-best-practice-governance-guidelines, a previously unseen task with no prior library exposure, Sonnet 4.6 baselined at 49/50, one criterion short of passing. Autonomous revision ran end-to-end with no human in the pipeline: no human authored the skill, selected the residual, or touched the harness or model. The result: 49/50 → 50/50 ALL-PASS, zero regressions. Cost ~$5; wall-clock ~10 minutes. A previously failing task crossed the all-pass threshold LAB actually grades against, and the new skill that closed the gap returned to the library for the next matter in the family.

Result 3 — Autonomous library generation from the firm's own data. In our demonstration on Harvey's own data, the learning layer read 22 already-scored corporate-governance agent runs (the substrate is Harvey's, not ours) and produced 7 distinct reusable skills at a generation cost of cents per skill. The same generation mechanism produced the skills that powered Results 1 and 2: the layer is autonomous, and the substrate is the firm's already-graded work product, not handwritten content.

The library is built from substrate the firm already has (partner redlines, judge prose, scored runs) without a separate labeling step. Cents per skill is the secondary advantage; the load-bearing point is that the system can build the library itself.

Result 4 — Revision is reliable, and the system knows when it isn't. The architectural test for autonomous revision is whether it can fix what's wrong on a matter without breaking what's already right. Running autonomous revision on scenario-02's remaining residual five times under identical configuration produced 48/49, 48/49, 48/49, 49/49, 49/49: five runs, all at or above the library-transfer mode, two reaching a perfect 49/49, with zero displacement of any baseline-passing criterion.

Revision quality is not uniform across our runs, and that turns out to be the useful part. In an earlier version of the revision on the same task, scores ranged down to 43/49, with displacement of baseline-passing criteria in some runs. The cases where a revision would have degraded something already passing are systematically distinguishable from the clean ones — the unsafe revisions are not random model noise, and the system can identify them before they ship. In practice that means the loop withholds a revision it cannot confirm is safe rather than risk a regression. The reliability question is therefore a detection problem the system already handles, not a missing capability. Broader held-out validation is underway.

Result 5 — A scaling-law curve: routing works where RAG breaks. Garry Tan publicly surfaced the resolver bottleneck for skill-based agents: when the library grows past a few hundred skills, the system can't reliably pick the right one for the task at hand, and standard RAG-based selection collapses in the range where it would actually need to work. The scaling-law curve below shows a path through the bottleneck range on this evaluated library. Routing precision holds as the library grows across the range where RAG goes to zero.

The curve was run on SEC 10-K and 10-Q filings, where we had a pre-existing skill library large enough to test scaling behavior. The architecture under test is the same one running Results 1-4 above.

Rekursor's result. Across libraries from 18 to 504 skills, roughly 28× growth, Rekursor's routing produced selections that the runs confirmed correct, with no degradation as the library scaled.

RAG's result. On the same matters, a standard RAG-based selection — the candidate RAG would pick first by description-similarity against the library — was the candidate that worked about 11% of the time at ~18 skills, ~2% at ~50 skills, and essentially never at ~200 skills and beyond. By ~200 skills, RAG's top pick is no longer the candidate the run confirms correct. RAG breaks roughly an order of magnitude earlier than the public bottleneck discussion suggests.

The category-defining point: Rekursor uses a different selection primitive than description-similarity retrieval. The load-bearing point is empirical: Rekursor's validated selections held as the candidate pool grew, while the RAG baseline degraded. RAG's mechanism degrades by construction as the candidate pool grows; ours doesn't. The scaling-law curve is the empirical anchor for the path through the bottleneck Tan named publicly. Reproducing the scaling behavior on a legal-substrate library is one of the next experiments.

Routing works where RAG breaks

RekursorRAG (description-similarity baseline)

Top-pick confirmed-correct rate vs candidate library size, on a skill library built from SEC 10-K and 10-Q filings, evaluated across pool sizes 18 to 504. Green: Rekursor's top pick was the candidate the run confirmed correct across the full range. Red: a standard RAG baseline (description-similarity retrieval over candidate descriptions) — RAG's top pick was the candidate that worked about 11% of the time at ~18 skills, ~2% at ~50 skills, and essentially never at ~200 skills and beyond. RAG breaks roughly an order of magnitude earlier than the public bottleneck discussion suggests.

Composing with the weight layer

Our learning layer is structurally aligned with post-training. Post-training changes what the model is; we change what the model can perceive on a given matter. One layer learns in weight space, by adjusting parameters against gradient signal. The other learns in measurement space, converting the firm's graded data into reusable skills the agent reads at runtime. The two are orthogonal. A firm could run a post-trained, self-hosted open model (Nemotron, Baseten's 27B, any of them) and load these skills on top. The post-training lifts general legal capability; the measurement-space layer carries the firm's own reviewed judgment as skills a partner can read, edit, and stand behind. Sovereign open-weight post-training and readable owned skills are not the same property; they are complementary, and the strongest regulated deployments will likely want both.

The difference in what each layer changes. Post-training from Baseten and Trajectory thus far brings an open model up to frontier parity: the same quality as a closed frontier model, at lower cost, on weights a firm controls. We've demonstrated something categorically different. The results in this post are improvements over the frontier model's baseline performance: a frontier-grade model that failed a task on its own, brought to all-pass by our learning layer, without touching the model. One layer lowers the cost of reaching the frontier; the other raises performance past where the frontier model sits unaided. Using the same benchmark, we are showing something entirely different.

The measurement-space layer is also more cost effective to run on its own, not just relative to post-training. Generating skills costs cents per skill (Result 3 above); running them at inference adds a small skill-load to existing agent cost. Post-training a frontier-quality open model costs many orders of magnitude more, before any inference. The two paths can be stacked, with a post-trained sovereign base lifted above its own ceiling by the new skills, and the stack is cost-effective because the measurement-space layer's marginal cost is small compared to the post-training itself.

But our learning layer doesn't require a post-trained base. It works on top of whatever model the firm is already running, closed or open. We have previously shown cross-model results on the coding substrate in a prior post; the same lift should, in principle, apply on top of any base. Validating that across legal models is one of the next experiments.

The token costs, and who this is for

As Harvey rightly points out, quality only counts if it's inside a budget customers can tolerate. That's another area where our learning layer shines, because it doesn't blow up token usage. A skill library can be built from data a firm already generates (redlines and the like). To demonstrate this, we created a 7-skill batch from 22 already-graded runs for cents of compute. At run time, the agent reads only the relevant skills, and the clean closures land in the single-digit dollars and minutes.

There are also demand signals for this kind of layer at the top of the market. Kirkland & Ellis recently announced it is committing $500M to build its own AI platform, with chair Jon Ballis telling the Financial Times the goal is to "take the collective intelligence of our institution and be able to deploy that throughout our firm," noting that widely available tools raise the floor for everyone, and that the floor is not where Kirkland competes.

The signal isn't that every firm will (or should) follow the in-house path; it's that the thing being built, a firm's accumulated work product made usable by agents, is now publicly priced as a category. Our system is one path to that capability that composes with the agent platforms firms already deploy, including Harvey's, rather than replacing them.

Where we've been, where we're going

The mechanism behind these results isn't limited to the legal AI domain. We've validated the same architecture across software engineering, drug discovery, regulatory comment analysis, and abstract reasoning. We are using legal AI as a wedge into the broader platform because the legal category has a high bar, significant public attention, and concentrated, regulated demand. The same axis works in any field with a scored work product, and we're building toward those fields (e.g., accounting, finance, healthcare, HR) from this foundation.

What today's post shows: both lift mechanisms of the loop demonstrated held-out on Harvey LAB; cost-effective library generation from a firm's already-graded work product; revision that the system can confirm safe before shipping; and a scaling-law curve for routing that is a step-function improvement over RAG. What's next for the platform is breadth across LAB tasks, the scaling behavior reproduced on a legal-substrate library at scale, and composition with post-trained open-weight bases. Not whether the third axis works.

Harvey built the benchmark and asked the right questions. The strongest of them, what behavior turns a failing legal deliverable into a passing one, has an answer, and we've now shown that answer working on matters the system never saw, reproducibly, with the library compounding across them. The next question is how far that compounding extends as the corpus grows from a handful of matters to a firm-sized one.

Run artifacts (transcripts, outputs, per-criterion scores and judge reasoning) available to partners under NDA. Two foundational provisional patents filed.


Continues the thread from our first post, the plateau-dynamics post, and the first Harvey LAB result.

If you're working with legal AI agents, whether Harvey, Legora, an in-house system, or your own stack, send us a trajectory and we'll show you whether the scored substrate yields reusable skills your agent can use on the next matter. If your firm has graded substrate beyond legal, the same applies; our domain diagnostic runs in minutes and if it doesn't show signal, we'll say so.