The Thirty Percent Confession

Last time I told you the AI industry is paying a tax it doesn’t have to pay — that a great deal of what we grandly call “AI” is really just looking things up, and we’ve chosen to do that looking-up on the most expensive silicon ever manufactured. A number of you wrote to say I was overstating it. Surely, you said, the people setting hundreds of billions of dollars on fire know something I don’t.

So this week I won’t argue with you. I’ll let one of the largest companies in enterprise software argue with you instead — because it already has, in a research paper it published itself and seems to have hoped you wouldn’t read too closely.

The company is Salesforce. The same Salesforce selling you “agents,” an “agentic enterprise,” a tireless digital workforce to set beside your human one. While one part of the building handled the marketing, another part — Salesforce AI Research, the people whose job is to measure things rather than sell them — built a test to find out how well today’s best AI can do something gloriously unglamorous: find the right piece of information when it’s scattered across the mess of a normal company. Slack threads. GitHub. Meeting transcripts. Documents nobody filed correctly. The stuff every real business actually runs on.

They named it HERB — the Heterogeneous Enterprise RAG Benchmark — and they didn’t build it on the cheap. It’s a synthetic but painstakingly realistic company: 530 employees across 30 products, generating 39,190 documents, messages, transcripts, and pull requests, strewn about the way they really would be. The paper is on arXiv. The data is on Hugging Face. Anyone can check my arithmetic, which is exactly why I’m happy to build a column on it.

Now, the number.

When Salesforce turned the best agentic retrieval systems money can buy loose on HERB — top-tier models, the good stuff, with planning and tool use — they scored 32.96 out of 100. (Thirty-three, if we’re being precise; I rounded down for the headline.)

A third. On a test of finding information that is definitely, provably somewhere in the building. Two times out of three, the most advanced AI on the market went hunting for an answer that existed and came back with the wrong one — or with confident nonsense.

Sit with that, because two floors up the marketing department is selling you an autonomous digital employee, and the research department just published evidence that the digital employee finds the right file about a third of the time.

But the score isn’t the part that should keep you up at night. Two findings underneath it are.

The first is the diagnosis Salesforce’s own researchers wrote down: the bottleneck isn’t the thinking, it’s the finding. The models could reason fine — they simply couldn’t retrieve the right material to reason over. The proof is brutal in its simplicity. When the researchers stopped making the system hunt and instead handed the model the company’s documents outright, the best one leapt from that miserable third to 76.55. Same model. Same questions. The only thing that changed was whether it had to find the evidence or was handed it.

Read that twice, because it’s the most important sentence published in enterprise AI this year and almost nobody noticed: the model was never the problem. The expensive part — the giant, GPU-devouring brain everyone is mortgaging the next decade to buy more of — is sitting there perfectly capable, tapping its foot, waiting for the cheap, dull, unglamorous retrieval layer to bring it the right paragraph. And the retrieval layer can’t.

This is the whole ballgame, and it lands exactly where I left you last time. I claimed two-thirds of enterprise AI is really retrieval wearing intelligence as a costume. Here is Salesforce — not a friendly witness, but a company whose entire pitch depends on the opposite being true — confirming that retrieval is precisely where the enterprise falls apart, and that a bigger, smarter, hungrier model does not rescue you, because the model was already good enough.

The second finding is the one I find most damning, and it’s hiding in the dataset’s own structure. Of HERB’s 1,514 questions, only 815 have answers. The other 699 — nearly half — are unanswerable by design. Salesforce deliberately wrote hundreds of perfectly reasonable-sounding questions for which no supporting evidence exists anywhere in the simulated company, and then watched to see whether the AI would admit it didn’t know.

Think about what that means. HERB isn’t only a test of whether a system can find the answer. Nearly half of it is a test of whether the system knows when there isn’t one — whether, handed a plausible question and no facts to support it, it has the spine to say “I can’t find that” instead of manufacturing something that sounds right. That is the single most important behavior an enterprise needs from AI, and the one almost no system on the market reliably has. We even have a pet word for what they do instead. We call it hallucination, as though it were a charming quirk rather than the precise thing that makes the technology unusable for any job that matters.

So put the two findings together. The industry’s answer to the first is “buy a bigger brain,” which the data says won’t help. And there is no brain you can buy that fixes the second, because confidently inventing answers isn’t a shortage of intelligence — it’s a property of an architecture that was never built to know the edge of its own knowledge.

Which brings me to the part I’m not going to fully tell you here, though I have the answer.

Suppose someone refused the assumptions. Suppose they decided retrieval wasn’t plumbing to be stapled onto a generator but the main event — the actual machine, built from scratch, running on the cheap, cool, abundant silicon I told you about last time, with the expensive brain held in reserve for the rare moment something genuinely must be generated rather than found. And suppose that same system was designed, from the ground up, to know the boundary of what it can support with evidence — to say “I don’t know” on the 699 as readily as it answers the 815.

And suppose that someone took HERB — Salesforce’s own brutal, public, no-mercy test — and ran it.

I’ll tell you only this. We didn’t score a third. We didn’t score forty. We more than doubled the ceiling that Salesforce’s best systems could reach, on the identical public benchmark — and did it while honoring the thing the benchmark’s harder half actually demands: knowing when to keep their mouth shut. Our number is real, it was measured against the same data anyone can download, and it does what three years of ever-bigger GPUs have conspicuously failed to do.

And no — before you ask — it isn’t 100, and I’d be wary of anyone who told you it was. Remember that nearly half of HERB has no answer at all. A system that posts a perfect score on a test like that hasn’t reached wisdom; it’s learned to bluff its way past the trick questions. Perfection was never the target — recall that the best model on earth, handed every document outright, still only clawed its way into the mid-70s. The target is different: a system that’s right when the evidence is there and says so when it isn’t. And a system like that pays for its honesty in points, because a scorecard can’t tell the difference between “I don’t know” and “I got it wrong.” The distance between that number and 100 isn’t the machine failing. Much of it is the machine refusing to lie — which, when you sit with it, is the whole point of the exercise.

It changes the question. The industry has been asking “which model?” for so long it forgot there was a prior question underneath: which architecture? HERB is Salesforce’s accidental admission that the model question is largely settled and largely beside the point — that the next decade gets decided at the retrieval layer and the honesty layer, not inside the GPU.

A disclosure, as always

You should know I’m not a bystander. I co-founded a small company built on exactly the heresy in this column — that retrieval is the main event, that it belongs on cheap silicon, and that a system ought to know when to say it doesn’t know. So weigh my enthusiasm accordingly.

But notice what my conflict of interest cannot touch. The 32.96 is Salesforce’s number, not mine. The diagnosis that retrieval is the bottleneck was written by Salesforce’s researchers, not me. The choice to make nearly half the benchmark unanswerable was Salesforce’s, not mine. The most honest thing anyone has said about enterprise AI this year was a confession buried in a Salesforce research paper: the emperor’s brain is magnificent, the emperor cannot find his own files, and about half the time he doesn’t even know what he doesn’t know.

The whole industry heard that and went out to buy a bigger brain. I think that’s the most expensive mistake in the history of computing, and I’m going to show you why.