Why Your Agent Loops Forever (And How to Design One That Doesn't)

Mar 6, 2026 · 19 min read

agentstool-useinferencearchitectureproduction

An agent is a language model in a loop. The model generates a response. That response may include a tool call. The tool call executes and returns an observation. The observation is appended to the context. The model generates the next response. This continues until the model decides it is done.

The problem is that language models are not naturally inclined to decide they are done. They are trained to be helpful, which means continuing to generate rather than terminating. A model that stops prematurely has failed to complete the task. The training signal penalizes premature stopping more visibly than it penalizes excessive continuation, so the learned behavior is to keep going. Without explicit architectural constraints on termination, step count, and tool call structure, an agent loop will run until it exhausts its context window or encounters a tool call that fails repeatedly, at which point it either halts on an error or, more dangerously, generates a fabricated observation and continues as if the data were real.

The three failure modes are predictable. So are the fixes. This post covers both.

The Three Failure Modes

The first failure mode is the infinite loop. The agent calls a tool, receives a result, interprets the result as incomplete, calls a tool again with slightly different parameters, receives a similar result, and repeats. Without a maximum step count enforced by the loop orchestrator (not by the model), this continues indefinitely. Consider a web search agent trying to find a more authoritative source for a claim: it reformulates the query, gets plausible results, judges none of them authoritative enough, reformulates again. Or a code execution agent catching an error and re-running with minor modifications: it changes one variable name, observes the same error, changes another, and so on without converging. Or a database query agent that refines its SQL at each step but has no satisfaction criterion, no explicit condition under which it decides the data is good enough to proceed. In all three cases, the loop is not logically incorrect at any individual step. It is architecturally incomplete: the orchestrator has delegated termination to the model and the model has no reason to terminate.

The second failure mode is hallucinated tool calls. When a model encounters a tool schema it does not fully understand, or when the schema allows free-form arguments, the model generates a tool call with plausible-looking but invalid parameters. The tool returns an error. The model interprets the error message, generates another tool call with slightly different parameters, and may continue this process across many steps. Or, more dangerously, the model fabricates a plausible tool result without actually calling the tool, appending fake observations to its own context and proceeding as if the data were real. This second variant is harder to detect: the loop terminates normally, produces a confident final answer, and is entirely wrong because two of its five observations were invented. The fabrication mode is particularly common when the model has high prior confidence about the expected answer: it generates the observation that would make the final answer consistent rather than the observation the tool would actually return.

The third failure mode is state corruption. The agent accumulates observations across steps in its context window, but the context window is an unstructured string. When step 3's observation contradicts step 1's observation because a tool returned stale data in step 1, there is no mechanism for the model to detect and resolve the contradiction. The model may combine facts from inconsistent states in its final answer without flagging the inconsistency. More subtly, as the context window fills, the attention mechanism distributes its weight across an increasingly long sequence. Earlier observations receive less attention relative to recent ones, not because they are less important but because of the recency bias in how attention patterns form during inference. Critical facts from early steps are effectively forgotten not because they were truncated but because the model's attention has drifted away from them.

Agent loop anatomy: the three failure modes mapped to the loop components they originate from

Each failure mode has a specific location in the loop and a specific fix. Infinite loops originate at the orchestrator, which has outsourced the termination decision to the model. The fix is an enforced step budget. Hallucinated calls originate at the tool schema, which gives the model too many degrees of freedom. The fix is narrow schemas with strict types. State corruption originates at the context window, which stores observations as an unstructured string with no conflict detection. The fix is a typed state object.

Tool Schema Design

Tool schemas are the interface between the model and the world. A schema that is too broad gives the model excessive degrees of freedom and increases the probability of invalid calls. A schema that is too narrow requires multiple tool calls for operations that could be combined, increasing latency and cost. The right level of specificity is narrow single-purpose tools with strict types.

Consider a vector database search tool. The broad version accepts a query string and an options object with any structure the model chooses to provide. The model must know that valid option fields are top_k, filter, and namespace. It must know that top_k should be an integer between 1 and 20. Nothing in the schema enforces these constraints, so the model might pass "top_k": "five" or "options": {"ranking": "bm25"} or omit required context entirely. Each invalid call produces an error, the model attempts a correction, and the retry loop is the infinite loop failure mode by another name.

The narrow version accepts exactly three parameters: a required string query, an optional integer top_k with a default of 5 and a maximum of 20, and an optional enum doc_type with values ["research", "legal", "news"]. There is nothing else to get wrong. The model cannot generate an invalid doc_type because the schema expresses the complete set of valid values. It cannot pass "top_k": "five" because the type is declared as integer. The schema has reduced the model's degrees of freedom to the set of valid actions.

Use enum types for every parameter that has a fixed set of valid values. If a file operation tool can take actions read, write, or append, represent that as an enum, not a free string. The model will always select from the declared values rather than generating novel strings like write_append or READ. Use integers with explicit bounds for count parameters. Validate all inputs server-side regardless of schema constraints because the model can still hallucinate values that conform to the declared type but fall outside sensible ranges.

Return structured outputs from tools, not free-text strings. A tool that returns "Found 3 results: [result1], [result2], [result3]" as a plain string forces the model to parse the string to extract the data. A tool that returns a JSON array of objects with fields id, text, score, and metadata makes the structure explicit, reduces the probability of misinterpretation, and makes the observations straightforward to log and audit. The cost of implementing structured returns is one afternoon of engineering. The benefit is that every downstream consumer of the tool output, the model, the log system, the debugger, works with machine-readable data rather than a string that must be reverse-parsed.

Every optional free-form parameter is an opportunity for hallucination. Every required parameter the model must fill in correctly from memory is an opportunity for a wrong value. The goal is a schema where the set of structurally valid calls is as close as possible to the set of semantically meaningful calls.

Tool schema design: narrow vs. broad schemas, structured vs. free-text returns

Termination Conditions

A well-designed agent loop terminates on exactly three conditions: the model signals completion, the step budget is exhausted, or an unrecoverable error is encountered. All three need to be implemented in the orchestrator.

The step budget is the most important. It is a hard limit on the number of agent steps before the loop terminates, regardless of whether the model has indicated completion. Set it based on the expected number of steps for your task type: a question-answering agent that retrieves and synthesizes information should complete in 3 to 5 steps; a code generation and debugging agent may need 10 to 15 steps. Set the budget at 2 to 3x the expected step count to allow for retry logic, but enforce it unconditionally. When the budget is exceeded, the loop terminates and returns the best answer generated so far or an explicit failure signal. Do not let the orchestrator ask the model whether it should continue. The model will say yes.

Confidence thresholds are a soft termination condition. If the model assigns high confidence to its current answer, it may terminate before using its full budget. In practice, language model confidence estimates are poorly calibrated: a model that is 95% confident is wrong significantly more often than 5% of the time, particularly when it has been operating in a domain outside its training distribution or when its context contains internally inconsistent observations. Confidence thresholds are useful as a secondary signal to save compute on easy queries, but they cannot be the primary termination mechanism without the hard step budget as a backstop.

Stop tokens are the model's self-declared termination signal. When the model generates a designated token such as <FINAL_ANSWER> or a structured output that matches the expected final answer schema, the loop terminates immediately regardless of remaining budget. This requires the model to be prompted or fine-tuned to generate the stop token consistently, which is achievable with a carefully written system prompt that defines the expected output structure and gives explicit examples. Base models without task-specific prompting will not reliably generate stop tokens. With a good system prompt, the model can learn to use them consistently, and the result is an agent that terminates as soon as it has a good answer rather than consuming all remaining budget.

The combination of all three conditions gives you an agent that terminates quickly on easy tasks via the stop token, fails gracefully on tasks that hit the step budget, and does not continue past an unrecoverable tool error. The orchestrator code that implements this is simple: a counter that increments per step, checked before every LLM call, plus a parser that looks for the stop token in every model response.

State Management: Typed State vs. Context Window Reliance

The simplest state management strategy is to let the context window accumulate all observations. Every tool result is appended to the running conversation, and the model receives the full history at each step. This works adequately for short agents with 3 to 5 steps and small observations, but it has two failure modes at scale.

The first is context window exhaustion. An agent that retrieves multiple large documents, executes code with verbose output, or queries multiple database tables will fill its context window within 10 to 15 steps. The naive response is to truncate the oldest observations. The problem with truncation is that the observations most important to the current step may not be the most recent ones. The step 2 observation that establishes the user's data schema is more important than the step 9 observation showing the first failed query attempt, but a recency-based truncation strategy discards step 2 before step 9.

The second failure mode is contradiction propagation. An agent that collects observations from multiple sources with no explicit reconciliation mechanism will produce final answers that combine facts from inconsistent states. The model has no procedure for detecting that observation 4 contradicts observation 1, because the context window is a flat string with no semantic structure. The model treats the contradiction as a nuanced situation requiring synthesis and helpfully synthesizes a confident-sounding answer from incompatible data.

A typed state object is the architectural solution to both problems. It is a structured data structure, a Python dataclass, a TypeScript interface, or a Pydantic model, that the agent explicitly updates at each step. The state contains the canonical current values for all relevant variables: the task objective, the current plan, the observations collected so far (with timestamps and source tool), intermediate results, and any detected contradictions. The context window sent to the model at each step is constructed from the typed state rather than from the raw accumulated history.

@dataclass
class AgentState:
    objective: str
    step: int
    plan: list[str]
    observations: list[Observation]  # each has: step, tool, result, timestamp
    intermediate_results: dict[str, Any]
    contradictions: list[tuple[int, int]]  # pairs of contradicting observation indices
    status: Literal["running", "complete", "failed", "budget_exceeded"]

With typed state, explicit conflict resolution becomes possible. When a new observation arrives, a conflict detection function compares the new value against the relevant fields in the current state. If a contradiction is detected, the function applies a resolution strategy: prefer the more recent value, prefer the value from the more authoritative source, or flag the contradiction for inclusion in the context so the model can reason about it explicitly. Without typed state, contradictions pass silently into the flat context string and the model may or may not notice them.

Typed state also enables state persistence across context windows. When the context window must be reset, the full observation history can be compressed into a state summary that preserves the essential facts without the verbose intermediate outputs. The model at step 16 can receive a state summary constructed from the typed state object rather than a truncated version of the raw conversation history that may be missing critical earlier observations.

The debugging value of typed state justifies its implementation cost on its own. An agent that fails silently with no state log is nearly impossible to diagnose. An agent that logs its full typed state at each step produces a complete audit trail: every decision the model made, what information it had at the time, what tools it called, and what those tools returned. When the agent produces a wrong final answer, reading the state log at each step tells you exactly which observation was wrong, which step the reasoning diverged, and whether the failure was in the tool execution or in the model's interpretation of the result.

For production agents, typed state is not optional. It is the difference between a system you can debug and a system you can only restart.

ReAct vs. Plan-and-Execute

The ReAct pattern (Yao et al., 2022) structures each agent step as a three-part sequence: Thought, the model's reasoning about what to do next; Action, the tool call; and Observation, the tool's result. This interleaving of reasoning and action allows the model to adapt its plan dynamically based on each observation. After each tool call, the model re-evaluates whether its current plan still makes sense given the new information. ReAct works well for exploratory tasks where the path to the answer is not known in advance and each step's result meaningfully changes what the next step should be.

A financial research agent using ReAct might start with a broad search for a company's revenue figures, find an investor presentation, discover from that presentation that the company uses a non-standard revenue recognition policy, and pivot to a more specific search targeting their SEC filings rather than press releases. None of this adaptation was possible to plan upfront; it emerged from the observations. ReAct accommodates this because the model replans at every step.

The limitation of pure ReAct is the absence of a natural termination condition. The model reasons about whether it has enough information to answer, but "enough information" is a subjective judgment that the model will often resolve in favor of gathering more. ReAct agents without a hard step budget will exhaust it on almost every query, not because the query requires many steps but because the model is never fully satisfied.

Plan-and-Execute (Wang et al., 2023) separates planning from execution. In the planning phase, the model generates a complete multi-step plan before any tools are called. In the execution phase, each step of the plan executes sequentially, often with a smaller, cheaper model handling the individual execution steps. Plan-and-Execute provides a natural termination condition: the agent is done when all plan steps have executed. This makes it easier to enforce time and cost budgets, because the number of steps is bounded by the plan length.

A report generation agent using Plan-and-Execute produces a plan with five steps: retrieve Q3 earnings report, retrieve Q2 earnings report, extract revenue figures from each, calculate growth rate, format final answer. Each step is straightforward enough that a 7B-parameter model can execute it reliably. The planning step uses the more capable model but only once; the execution steps use the cheaper model five times. Total inference cost is lower than ReAct on the same task because ReAct would interleave reasoning with every step using the full model.

The limitation of Plan-and-Execute is that the plan may be wrong. If step 2 reveals that the Q2 earnings report uses a different accounting period than the Q3 report, the pre-generated plan has no mechanism for adapting. A ReAct agent would notice the discrepancy and adjust its next action; a Plan-and-Execute agent executes the remaining plan steps against potentially incompatible data. Mid-execution adaptation requires the orchestrator to detect when an observation contradicts the plan's assumptions and either re-plan from the current state or signal a failure.

The choice depends on task structure. For exploratory retrieval, investigation, and open-ended research tasks where the correct path emerges from the data, use ReAct: the adaptive replanning at each step is a feature, not overhead. For structured workflows with a known sequence of operations, including report generation, test suite execution, and data pipeline runs, use Plan-and-Execute: the predictable termination and the ability to use cheaper models for execution steps both matter. For tasks that require both exploration and structured execution, a hybrid is appropriate: use ReAct to produce a plan as the first phase, then execute that plan using the Plan-and-Execute pattern for the remaining phases. The exploratory phase benefits from adaptive replanning; the execution phase benefits from the step-budget clarity that a pre-specified plan provides.

ReAct vs Plan-and-Execute: when each pattern applies and how they handle uncertainty

Both patterns need explicit step budgets. The Plan-and-Execute budget is bounded by the plan length; the ReAct budget requires a hard limit in the orchestrator. A ReAct agent with no step budget is not a design choice but a deployment risk.

Observability: The Minimum Viable Tracing Schema

An agent in production without observability is a black box that produces outputs and occasionally runs up a bill. When it fails, you have no way to distinguish between a bad model response, an invalid tool call, a tool that returned incorrect data, and a state corruption that propagated undetected across five steps. All of these failures produce the same symptom: a wrong final answer. They require completely different fixes.

The minimum viable tracing schema for an agent step log has seven fields: timestamp, step number, the model input (the context sent to the model at this step, truncated to a manageable size for storage), the model output (the raw response, including any externalized reasoning), the tool called (name and full parameter values), the tool's return value, and the step duration in milliseconds. Every step appends one record to the log.

This schema enables four critical operations. Debugging failures by replaying the step log: given a wrong final answer, read the log from the first step and identify the exact step where the reasoning diverged or the tool returned bad data. Measuring per-step latency to identify which tool calls are slow: if step 4's tool call takes 800ms while all others take 50ms, the slow tool is the optimization target. Detecting infinite loops by counting repeated identical tool calls: if the same tool is called with the same parameters across three consecutive steps, the loop is converging nowhere and the orchestrator should terminate it. Auditing agent behavior for compliance or safety review: the full step log is a complete record of every action the agent took and every data source it accessed, which is a regulatory requirement in some domains.

At production scale, this log generates substantial data. A 10-step agent serving 10,000 requests per day produces 100,000 log entries per day. Each entry with a truncated context snapshot may be 2 to 10 KB. Total: 200 MB to 1 GB per day per agent type. Store it in a structured log system, Elasticsearch or ClickHouse or a structured logging service, with an indexed field for request ID, step number, and outcome. Set retention policies: 30 days for successful requests, 90 days for failed requests and for requests that exceeded the step budget.

A practical sampling strategy: store full step logs for 5% of successful requests, summary logs (final state only, no per-step detail) for the remaining 95%, and full logs for all requests that ended in an error or exceeded the step budget. This keeps storage costs manageable while ensuring that every failure is fully diagnosable. The 5% full-log sample covers normal behavior monitoring; the full logs on failures cover the cases you actually need to debug.

The implementation cost of this tracing schema is low. The orchestrator loop already has access to every piece of information in the schema: it constructs the model input, receives the model output, executes the tool calls, and measures step duration. Adding structured logging at each step is a matter of writing a log record at the point in the orchestrator where all seven fields are available. The return on that investment is the ability to answer the question "why did the agent produce this wrong answer" in minutes rather than days.

References

Yao, S., Zhao, J., Yu, D., et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. https://arxiv.org/abs/2210.03629
Wang, L., Xu, W., Lan, Y., et al. "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." ACL, 2023. https://arxiv.org/abs/2305.04091
Schick, T., Dwivedi-Yu, J., Dessì, R., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023. https://arxiv.org/abs/2302.04761
Significant Gravitas. "Auto-GPT: An Autonomous GPT-4 Experiment." GitHub, 2023. https://github.com/Significant-Gravitas/AutoGPT
OpenAI. "Function Calling Documentation." OpenAI Platform, 2023. https://platform.openai.com/docs/guides/function-calling
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. "Gorilla: Large Language Model Connected with Massive APIs." arXiv, 2023. https://arxiv.org/abs/2305.15334