Blog/Engineering·Feb 18, 2026

AI Agents in Production — Lessons from the Field

Boring CodeBoring Code · 7 min read
AI Agents in Production — Lessons from the Field

AI Agents in Production — Lessons from the Field

Demos are easy. An agent that browses the web, writes code, and sends emails looks impressive in a 5-minute screencast. Shipping that same agent to production users who depend on it for real work is a different problem entirely.

What breaks at production

Reliability at the tails

LLM outputs are probabilistic. In development, you test the happy path. In production, you encounter the 1% of inputs that produce malformed JSON, infinite loops, or nonsensical tool calls. An agent that works 99% of the time fails multiple times a day at any meaningful scale.

What works: Structured outputs with schema validation. If your agent must produce a function call or a JSON object, enforce the schema at the model layer and validate at the application layer. Fail fast with clear error messages rather than silently producing garbage.

Cost at scale

An agent that makes 10 LLM calls per task costs 10x more than a simple single-call system. That's fine for a demo. At 1,000 tasks per day, it's a budget line that needs to be justified.

What works: Cache aggressively. Many agent sub-tasks are semantically identical — looking up the same documentation, running the same classification. Cache by semantic similarity, not just exact match. Profile which steps drive cost and find cheaper alternatives for the expensive ones.

Observability

You cannot debug an agent you can't observe. Without traces, you have no idea why it failed on a particular input, how long each step took, or which tool calls produced unexpected results.

What works: Structured logging of every step — input, tool calls made, outputs, latency, token counts. We use a simple trace format: each agent run gets a trace ID, every step gets a span, every LLM call logs prompt + completion + cost. This makes debugging tractable.

Patterns we've converged on

Short context windows, explicit handoffs. Long agent contexts drift. The agent loses track of its goal, starts referencing stale state, and produces inconsistent outputs. We break long tasks into smaller sub-agents with explicit state passing between them.

Human-in-the-loop for high-stakes decisions. Not every decision should be made autonomously. We design agents with checkpoints — moments where the system surfaces a proposed action to a human before executing it. This is not a limitation. It's a feature that builds trust.

Graceful degradation over autonomous recovery. When an agent fails, the tempting response is to have it retry automatically or try to self-correct. In production, autonomous recovery often makes things worse. We prefer: detect failure, surface it clearly, route to a human or a simpler fallback.

The honest assessment

AI agents are genuinely useful for a specific class of problems: tasks that are too complex for a single LLM call, too structured for a human to do efficiently, and tolerant enough of occasional errors to run autonomously. That's a real and growing category.

But they're not magic. The infrastructure — reliability, observability, cost controls, human oversight — is unglamorous work that determines whether the demo becomes a product people actually trust.

At Boring Code, this infrastructure is the work we do before the first agent runs in production.