AI Agents Are Not Magic. Here's What Actually Matters.

The basic version is becoming a commodity. What matters now is fit, oversight, and failure cost.

Amit Tandon
Amit Tandon
Founder, Rekursor · ·11 min read

If you want to get an AI agent working inside your business, the noise around how to do it just got louder. Last week the two biggest AI labs each launched multi-billion-dollar enterprise deployment ventures with Wall Street firms: OpenAI created a new deployment company backed by more than $4B of initial investment, reportedly at a $10B pre-money valuation, while Anthropic formed a $1.5B venture with Blackstone, Hellman & Friedman, and Goldman Sachs to bring Claude into large-company operations. Those engagements, with embedded engineering teams, are for Fortune 500s and PE portfolio companies that can absorb a custom build. Everyone else has to figure it out without a forward-deployed engineering team in the office, working through the long tail of AI tools and SaaS add-ons trying to sell into the moment while the air is thick with hype.

This piece is a working filter for that moment. By the end you'll know what an AI agent actually is, what kinds exist, where the real differences live, and how to think about getting one working without getting buried in jargon.

The four shapes of AI tools

AI tools today come in roughly four shapes. Knowing which shape your problem actually needs is the difference between buying something useful and buying something expensive.

1. Chat. One question, one answer. You ask, it answers. ChatGPT, Claude, Gemini. Useful for first drafts, brainstorming, research, summarizing. The work is one-shot. Best for when the AI is a thinking partner and you do the rest.

2. Workflow. Fixed steps, AI inside them. Predefined steps glued together by software, with AI doing the smart bits at each step. Read this email, classify it, route it, draft a response, wait for approval, send. The path is fixed. Most "AI features" inside SaaS products you already use are workflows. Best for repeatable processes with predictable shape.

3. Agent. The AI decides what to do, in a loop. This is where the vocabulary gets confused, because "agent" gets used for everything. The technical meaning: the AI decides what to do next, in a loop, until it thinks the task is done. It looks at your task, decides which of its available capabilities to use next (read a file, search the web, query a database, draft a document, send an email), uses that capability, looks at the result, and keeps going. Best suited to open-ended tasks where each task is different enough that you can't pre-script the steps. The good ones do extra work inside the loop to check themselves: tying outputs back to specific source documents so claims can't drift, validating against rules before producing a final answer, and catching their own mistakes. The basic ones just do the loop and trust the model.

4. Multi-agent. Several agents coordinating. Usually one that plans and several that execute, sometimes specialized by role. Best when the work genuinely decomposes into specialist tasks, when different parts need different rules or guardrails, or when a single agent loses coherence trying to hold the whole job in working memory. The thing to watch for is whether the design is bounded and supervised (an orchestrator with defined specialists and clear handoffs) or sprawling (agents spawning agents with no supervisor). The field has been moving away from sprawling swarms, which proved brittle and hard to debug, but bounded multi-agent designs are increasingly how serious institutional work gets done.

The honest answer for any given problem depends on the work, and it helps to think of it the way you'd think about staffing. A repeatable process with predictable shape is a workflow, a checklist someone follows the same way every time. A single open-ended task with one set of rules is usually a single agent, one capable employee handling a job end to end. Work that crosses departments, requires different validation logic at different stages, or has specialist subtasks that need their own guardrails is where multi-agent earns its keep, a small team with a manager, each person responsible for their part, with clear handoffs. The trap is being upsold complexity that doesn't match the work, in either direction: a multi-agent team for a job one capable employee could do, or one employee stretched across a job that genuinely needs a team.

What an "agent" actually is

Let me deflate the mystique, because the agent category has gotten suspiciously magical-sounding.

At its core, an agent is two things put together: a model and a harness. The model is the AI itself: Claude, GPT, Gemini, whatever's under the hood. The harness is everything else, the software wrapped around the model that lets it actually do work. Tool execution, memory, state persistence, error recovery, the rules about what the agent is allowed to do and when it should stop. Anthropic describes Claude Code as a "general-purpose agent harness." It's the term the field has settled on, and it's worth knowing because it points at where the real engineering lives.

Inside the harness, the agent runs a loop:

Look at the task. Decide what to do next. Use a capability to do it. Look at the result. Decide what to do next. Use a capability. Look at the result.

That's it. The loop runs until the agent thinks it's done, or until you stop it. The capabilities are usually things like reading a file, searching the web, querying a database, writing a document, sending an email, running a calculation. The model doesn't do anything itself. It picks capabilities and reads results.

Nearly every major agent built in the last two years has converged on this design. The open-source harnesses that have crossed into mainstream developer awareness, including Peter Steinberger's OpenClaw, Nous Research's Hermes Agent, and Garry Tan's GStack, are all variations on the same architecture: model plus loop plus standard capabilities, with different bets about memory, scheduling, and skill accumulation layered on top. A skilled developer with a good model and one of these harnesses can stand up a working agent for most use cases in weeks rather than months.

The viral example last week made the point concretely. A former Latham & Watkins associate named Will Chen, a lawyer who happens to have coded as a hobby since university, released an open-source legal AI tool called Mike, claiming feature parity with Harvey ($11B valuation) and Legora ($5.5B valuation), built in two weeks. The qualifier matters. Chen isn't your average lawyer, and he certainly isn't a typical operator at the firms his tool might compete with. He sits in the narrow intersection of domain expertise and serious coding skill that lets one person build a credible alternative to a billion-dollar product. That intersection is still rare, but the tools have made it less rare than it was a year ago. The reaction, and the quote making the rounds from a former law-firm IT director, captures the second-order effect: "Mike doesn't kill Harvey or Legora, but it absolutely changes the negotiation. Once a working open-source alternative is sitting on GitHub, the conversation in renewal meetings moves from 'Is this magic?' to 'What exactly am I paying enterprise prices for?'"

You're not going to build Mike. But Mike, and the next dozen Mikes that will appear in other verticals, changes what you're paying for when you evaluate vendor pitches in your own space. The basic agent harness is rapidly commoditizing. The differences worth paying for sit elsewhere.

Where agents fail

The basic version works for an enormous range of work, specifically when three conditions hold:

  1. A human reviews the output before it does anything. Somebody reads the draft, the report, the email, the code, and catches mistakes before they propagate.
  2. The task fits in one sitting. The agent has all the relevant material in working memory at once.
  3. The cost of a wrong output is small. Mistakes are noticed quickly, fixed quickly, and don't compound.

That covers most internal AI use: productivity tools, drafts that get reviewed, exploratory analysis, code generation a developer reviews before shipping. For all of it, the basic version is fine.

The architecture has one specific failure mode that emerges when stakes go up. It produces fluent, confident output without an internal mechanism to verify that output is grounded, complete, or consistent with what was asked. This is fine when a human is the verifier. It fails when the human can't be, because the volume is too high, the deadline forces the draft out before anyone reads every paragraph, or the failures are subtle enough to slip past a tired reviewer.

The result is outputs that look right and aren't. A grant submission with a fabricated citation. A compliance report missing a required section. An analysis that contradicts itself between page two and page eight. A reconciliation report that miscategorizes a recurring vendor payment because the name resembles a different category. Every one has a real cost: a lost grant, an audit finding, a damaged client relationship. And the basic version has no internal defense against any of them. Logs show what happened after it happened. By then the output has shipped.

This isn't fringe. Industry analyses suggest the majority of enterprise AI agent projects never reach production, and the failure pattern is recognizable. A model that benchmarks well loses coherence after fifty or a hundred tool calls in a real task. Static benchmarks don't catch it. The longer and more complex the work, the wider the gap between models that look equivalent on a leaderboard. Harness engineering, the discipline of building the scaffolding that catches those failures, is where serious work in the field is moving precisely because the basic version isn't enough for high-stakes work.

The fixes are known, though they are not free. Ground outputs in specific source documents so claims cannot drift. Add validation steps inside the loop so work is checked against rules before it ships. Use orchestration designs that route specialist subtasks to specialist agents with their own guardrails. None of this is exotic; it's what serious institutional deployments are moving toward. It's also what most vendor pitches are not doing, which is why the question of how a tool prevents bad output matters more than whether it claims to.

This is the failure mode worth watching for, because it's the one humans are worst at catching. Output that's wrong and looks wrong gets caught immediately. Output that's wrong and looks right gets shipped.

Four questions to ask any vendor

A short evaluation framework, regardless of who's pitching you.

1. What happens when this tool's output is used without anyone re-reading it line by line? The answer reveals which architecture you're buying. Some vendors will give you a process answer: "our customers always review carefully." Others will give you a structural answer: "here's what the harness blocks before output ships." Both can be valid for the right use case, but you need to know which you're getting.

2. Which of my use cases get reviewed carefully, and which don't? List the work you'd want AI help with. Mark each one for review intensity. The unreviewed use cases are where the harness architecture matters. If everything gets reviewed carefully, optimize for UI and integration fit.

3. What does this tool cost when something goes wrong? A tool that ships occasional embarrassing mistakes can be more expensive than one that doesn't, once you account for rework, client trust, and the time spent reviewing every paragraph. The total cost of an AI tool includes its failure rate at your stakes.

4. Is the vendor selling architecture or vibes? "Our AI is smarter" is vibes. "Our AI uses GPT-5" is vibes. "Here's the specific failure mode our harness prevents, and here's how" is architecture. Architecture answers can be evaluated. Vibes can't.

The takeaway

The basic agent harness, model plus loop plus standard capabilities, is consolidating fast. The named open-source harnesses are converging on a common design, and Mike-style proof points are going to keep appearing across verticals. Within a year you'll get the basic version everywhere, and most vendors leaning on the basic version alone will look overpriced.

The differences worth paying for are at the edges: vertical knowledge that actually matches your work, integrations that match your stack, and, for high-stakes work, architectural guarantees about how the tool stays grounded, validates its own output, and improves over time.

For low-stakes work, pick the tool with the best UI for your team and don't overthink it. For work where wrong outputs have real consequences, the architecture question is the question that matters most, and the vendor's answer to it tells you whether you're buying a tool or a marketing campaign.


Amit Tandon | Founder, Rekursor