May 2026 · Boris Vanin

How we build LLM pipelines

There are really only two base patterns. Everything else that looks like clever architecture is just some combination of them.

Two patterns

One agent in a loop. The model works step by step: it sees the result of the previous step and decides what to do next. Memory lives in the context, discipline lives in the system prompt. Good for tasks where the order of steps matters more than parallelism.

Several agents with aggregation. You run N models in parallel on different aspects of one task, then a separate step folds the results into a single answer. Good when the task decomposes honestly and the pieces don't depend on each other.

A real pipeline is always a mix. At the top sits an orchestrator making decisions; below it — parallel implementors; at the output — a verifier. And the loop from the first pattern doesn't go anywhere: the orchestrator always lives in a loop because it doesn't know in advance how many iterations it'll need.

When you can't avoid tools

A simple "generate text from a description" case doesn't need tools. But "update this large complex world model with new information" absolutely does.

Without tools, the model has to spit out the entire new state every time. That's expensive in tokens, unreliable (LLMs love to forget fields you didn't mention in the prompt), and miserable for humans — reading a diff between two walls of text is not realistic.

With tools, the model mutates state declaratively: addItem(...), updateNpc(id=42, ...), removeQuest(id=17). The context stays compact: the current version of the state plus a description of the changes is enough. It scales, it's auditable, and it logs cleanly.

Planner + implementors

The most reliable pattern we use: one strong planner decomposes the task into a list of actions with clear semantics, then an army of cheap implementors carries those actions out.

The key here is plan quality. If the planner emits "update the NPC named Harry" instead of "update NPC id=42, add trait X, remove status Y", you're done: implementors will guess, make mistakes, and propagate them down the chain. The other way around — if the plan is specific enough, the implementor can be Haiku, Flash, or any other inexpensive model. It doesn't have to be the smartest, it just has to execute neatly.

The savings are an order of magnitude. The implementor context is short and uniform — you can cache aggressively. If, that is, you got lucky with your provider.

State machine inside the loop

Inside a single agent loop it's useful to run a small state machine: "generate → review yourself → fix → done". Since system + tools is stable across iterations, the explicit cache stays warm — self-review becomes cheap.

Self-review quality is objectively lower than that of a big independent verifier: a model trusts its own output more than someone else's. But the cheapness pays off: gross mistakes get caught on the spot, and the number of big rounds with a full verifier drops noticeably.

Caches

In any LLM pipeline, caching is the main lever for cost control. Providers usually offer two modes, and both have tradeoffs even when they work perfectly.

Implicit. The provider notices that the prefix of your new request matches the prefix of a recent one and reuses the computation. Pros: zero developer effort, nothing to create or maintain. Cons: you depend fully on the provider's heuristics, there are no guarantees, and no "why didn't this hit" diagnostic either.

Explicit. You create a named cache object with a TTL and reference it by ID in subsequent requests. Pros: hits are predictable, you control what gets cached and when. Cons: an extra API mechanism, you pay for creation and storage, contents are immutable, you manage the TTL and clean up after yourself.

Both broken, no choice

Implicit — a lottery. Predictability is zero. On the fresh Gemini 3.5 Flash right after release our hit rate was around zero. On older models — 40–50% in the best case, and even that isn't promised. There's no "do this and you'll hit the cache" anywhere in the docs; you find out about a miss from the bill. The canonical bug — googleapis/python-genai#1880 (stable prefix, changing suffix, hit rate dancing between 40–60%); the recent funhouse — issue #2064 (on Gemini 3 Flash, between roughly 9K–17K prompt tokens, cached_content_token_count silently drops to zero); the parallel pain thread — "Has anyone gotten implicit caching to work?" on the developer forum.

Explicit — the game isn't worth the candle. The cache object has to include the full prompt config: system prompt, tool definitions, generation config. Any change to that part and you have to recreate the cache; contents are immutable. That cuts effectiveness by an order of magnitude right away: there's no shared "system prompt cache", every system + tools pair gets its own object. And in our domain even reuse between two runs of the same pipeline is almost impossible — because the cache fragments inside a single run: the planner goes with one toolset, the implementors with another, different implementors among themselves with different ones too. Each step lives with its own cache object, and meaningful matches between runs stay rare. On top of that — a TTL you have to extend, and cleanup after use, otherwise you're paying to store garbage.

Bottom line: you still pick the explicit one — it's the only one that gives any predictability. Just understand that effectiveness in a complex pipeline is modest, and you can only bake it into the economy where the system + tools combo is genuinely stable between requests — most often that's the iterations of a single agent loop. Don't lean on implicit caching in the base case; lean on explicit cautiously, with full awareness of its limits.

Keep it lean

The simplest architecture is the best one. If planner + implementors + a few basic tools for state mutation already solve your problem, you usually don't need to stack more abstractions on top. Complexity will find you on its own. Don't help.