The LLM Does Not Use Tools - The Harness Makes Tool Use Possible

A week ago I posted a small visual demo about AI agents. It tries to make one simple point visible:

the LLM does not "use tools" by itself.

The model receives context and generates tokens. Some of those tokens may describe a tool call, but the model has not actually executed anything. A surrounding runtime has to parse the generated text, decide whether the call is valid, execute the tool, capture the result, and feed that result back into the next model context.

That surrounding runtime is what is called the harness.

In short:

- LLM = next-token generation
- Harness = state, control, validation, tool execution, and feedback
- Agent = LLM + Harness

The model predicts. The harness acts.

This distinction matters for reliability and safety. If an agent can query private data, call an API, send a message, or modify a file, the authority should not live in the model's prose. It should live in the harness: permissions, schemas, budgets, audit logs, validators, retries, and failure handling.

It also matters for evaluation. Many agent benchmark results are not properties of the model alone. They are properties of the model plus the harness around it: prompt format, tool interface, context policy, parser, oracle, retry logic, and stopping condition.

There is now good evidence for this. Terminal-Bench reports results as agent + model pairs, and the same model can move substantially depending on the agent scaffold around it.  SWE-ABS shows the oracle problem directly: after strengthening SWE-bench tests, about one in five previously "solved" patches were rejected, and the top score dropped from 78.8% to 62.2%. tau-bench makes reliability explicit with pass^k: even strong function-calling agents were below 50% on pass^1 and below 25% on pass^8 in the retail domain. The HAL GAIA leaderboard adds another angle: accuracy has to be read together with scaffold, tools, traces, and cost.

I made the demo to make the loop easier to see than to explain: context goes in, tokens come out, the harness detects a tool request, executes it deterministically, and feeds the result back.

Small demo here: https://apelov.github.io/agentic-demos/harness-llm-animation.html

If you build or evaluate agents, I think this boundary is worth keeping in mind. Not only "how smart is the model?", but "what harness is wrapped around it, and where does authority actually live?"

Alexander