Rekursor: AI Agents That Keep Getting Smarter

Amit Tandon
Amit Tandon
Founder, Rekursor · ·5 min read

Rekursor is a learning layer that makes any frozen model smarter with every task it executes — without retraining, without weight changes, and with fewer tokens. The system is grounded in an empirical law we'll be publishing soon.


Results

Terminal-Bench 2.0 Tasks Flipped to Pass

Published pass rateWith Rekursor

6 tasks where Opus 4.6's published pass rate is below 100%. All 6 flipped to PASS with Rekursor. configure-git-webserver: 0 passes in 5 published trials.

Evaluation SubstrateModelBaselineWith Rekursor
TerminalBench 2.0Claude Opus 4.665.4%86%*
TerminalBench 2.0GLM 5.152.4%100%†
SWE-bench (75 issues)Claude Sonnet10.7% excellent20.0% excellent
EDGAR filings (621 10-K/Q filings)Claude SonnetSIS 0.223SIS 1.201
Drug discovery (KRAS)Claude SonnetPrior baselineContinued improvement‡

*6 of 7 targeted tasks passed. Tasks selected where Opus's published pass rate is below 100%, including a task with a 0% published pass rate.

†8/8 selected tasks. Cross-model transfer from Opus store, zero GLM-specific tuning.

‡System autonomously discovered a new optimization objective when the original target saturated.

SIS measures the system's ability to discover and reliably validate new measurement dimensions. Higher is better; the theoretical floor is zero.

Five model families tested. Zero weight changes. Every model above its own published baseline.

One more thing: these results don't use our full Rekursor Engine multi-agent architecture. No harness. Yet.


Terminal-Bench 2.0: Opus 4.6

We targeted the tasks Opus struggles with most: those with below-100% published pass rates. The learning system flipped 6 of 7.

TaskPublished rateResultSteps
configure-git-webserver0%PASS42
cancel-async-tasks40%PASS16
build-cython-ext60%PASS111
count-dataset-tokens60%FAIL16
path-tracing-reverse60%PASS45
sqlite-with-gcov80%PASS36
bn-fit-modify80%PASS29

No model changes. No prompt engineering. On every passing task, the learning system completed in fewer steps than the baseline.

For context: the top system on the Terminal-Bench 2.0 leaderboard, Claude Mythos Preview, achieves 82%.


Cross-Model Transfer: GLM 5.1

GLM 5.1 had failed cancel-async-tasks 8 consecutive times in our own baseline runs. The same store built from Opus failures — zero GLM-specific tuning, different model family entirely — passed it in 7 steps. We targeted 8 tasks where GLM struggled most. The system passed every one.

The system doesn't care which model it's enhancing.


How It Works

The system maintains a persistent store that grows with every task it attempts. What the agent tried, where it failed, what patterns emerged. That record compounds automatically — and informs every task that follows.

This is not prompt engineering. Not retrieval. Not retraining.

Most capability gaps are context gaps. The model already has the capability, it just doesn't have the right dimensions of perception to see what the task requires.

configure-git-webserver had a 0% published pass rate for Opus. Not because Opus couldn't do it. Because it couldn't see it. The store provided the context. The capability was always there.

cancel-async-tasks: GLM 5.1 failed it 8 consecutive times in our own baseline runs. With the store: PASS in 7 steps. The model didn't get smarter between those runs. It got perception.

qemu-startup: Opus failed at 126 steps. With the store: PASS at 74 steps. The same model, the same task, 52 fewer steps — steps the model had been spending discovering context it could have been perceiving.

For git-leak-recovery — a task GLM had never solved — the store identified a relevant prior task and explained why the same approach applied. Git operations and PyTorch recovery share no surface similarity. The system recognized the underlying structure anyway. That's what keyword matching misses.


What This Is

The model's weights never change. A measurement store grows with every task, every customer, every domain — and becomes more valuable over time. Five properties, all demonstrated empirically:

Unsupervised discovery. The system learns from failure signals that already exist: the task failed, the patch was rejected, the transaction was charged back. No labels required.

Recursive self-improvement. Most approaches to self-improvement require retraining the model. Ours doesn't. Every task the system attempts makes it better at the next one, automatically, without weight changes. Nothing is overwritten. The more it runs, the more it perceives — without catastrophic forgetting.

Cross-domain transfer. Dimensions discovered in one domain automatically transfer to others where the same structural pattern appears. A store built from grant-writing agent traces lifts regulatory document analysis. The store doesn't belong to a model or a domain. Every agent that touches it makes it richer for every agent that comes after.

Private by design. The store accumulates measurement functions, not customer data. What transfers across domains is structural pattern recognition — not content, not documents, not transactions. Every customer's data stays theirs.

Auditable. Every measurement has a readable provenance chain. For regulated industries, this is the only deployable version.


Two Patents. One Platform.

Two foundational patents filed covering context graphs, structured decision traces, measurement-space learning, autonomous dimension discovery, and cross-domain transfer.


About

Rekursor is built by a solo founder who has spent the last five years working at the frontier of applied AI. He is the creator of The Scroll, an AI and technology newsletter followed by thousands of investors and executives since 2020, and a graduate of Cornell University.

The next generation of consequential AI companies will be built by very small teams distinguished by very deep systems, not by headcount. If that's how you think about building, we'd like to hear from you.


Validated in production with paying clients, including a SUNY community college, across five domains: financial filings, drug discovery, neural network interpretability, grant writing, and terminal tasks. 4,500+ learned measurement dimensions accumulated across 4,400+ task executions so far.

If you're deploying AI agents in regulated, complex, or knowledge-intensive work, we'd welcome a conversation.

Amit Tandon | rekursor.com