Agentic Debugging: How I Trace Problems Across Tool Calls
Debugging a conventional program has a familiar shape. You have a call stack. The failure happened at line N, called from line M. You can reproduce it with a unit test. You can set a breakpoint and watch state change.
Agentic systems break most of those assumptions.
When an agent produces the wrong output, there is no meaningful line number. The failure may have happened three tool calls ago, in a context management decision, or in a previous session that wrote stale data to disk. The "stack" is a sequence of tool calls and model turns, and tracing it is more archaeology than debugging.
I work in this environment constantly. Here is how I actually approach it when things go wrong.
The Cognitive Model Shift
The single most important thing I had to learn is that agentic debugging is root cause analysis, not error tracing. In traditional debugging, the error message points you toward the failure site. You work backward from there. In agentic systems, the error message — if you even get one — often describes a symptom several steps downstream from the real problem.
A build fails because a TypeScript file has wrong types. Why? Because the file was generated by an agent that misread the schema. Why did it misread the schema? Because the previous session had committed a conflicting version and the context passed to this session was stale. The TypeScript error is three levels of causation removed from the actual problem.
If you debug the TypeScript error directly, you will patch the symptom. The same pattern will produce a different symptom in two more sessions.
The question to ask first is not "what broke" but "at what decision point did the execution deviate from correct behavior." That is a different question, and it usually requires reading the session history rather than the error output.
Tool-Call Traces Are Not Stack Traces
A traditional call stack is an exact causal record. A tool-call trace is different in three ways.
First, tool calls are not pure functions. The same read call at two different points in a session can return different results if the file changed between them.
Second, the model's interpretation of a tool result is part of the execution path. That interpretation step is opaque unless you instrument it explicitly — which means structuring tool outputs to force intermediate reasoning to the surface.
Third, tool calls can have durable side effects: file writes, commits, cron job registrations. A session that ended with bad state may have written that state to disk, where it will affect every future session that reads from the same location.
Here is what a useful tool call trace looks like when I instrument a multi-step agent flow:
{
"step": 3,
"tool": "read_file",
"input": { "path": ".tickets/open/H6-17.yaml" },
"output_hash": "a3f7b2",
"output_size_bytes": 2841,
"model_action": "proceed_with_delivery",
"timestamp": "2026-03-20T17:12:44Z"
}The model_action field is the key addition. It forces the agent to emit a structured decision at each step. When debugging a failed run, I scan model_action fields to find the first step where the decision diverges — that is almost always closer to root cause than the final error.
Context Management During Debug
Agentic sessions have a context window. When debugging, there is a strong temptation to dump everything into that context — full file contents, complete history, all the error output — because it feels like more information should produce better diagnosis.
In practice, oversaturated context degrades debugging quality. The model's attention is finite. If you give it 40,000 tokens of background before the error message, the error message gets proportionally less attention. I have seen sessions where the correct fix was in the first tool result, but the model kept searching for more information because the context was too noisy to recognize what it had already found.
My approach is to start narrow. When a session fails, I do not reload the entire workspace state. I start with three specific things:
- The ticket's acceptance criteria and verification commands
- The exact error output (not the full build log, just the failure section)
- The files the failing step was supposed to create or modify
Everything else — the broader codebase, the full history, the related tickets — only comes in if the narrow context does not explain the failure. This mimics the way experienced engineers debug: start with the smallest context that could contain the answer, and expand only when necessary.
The corollary is that tool calls during debugging should be targeted rather than exploratory. Reading an entire codebase to debug a TypeScript error is rarely necessary. Reading the five files referenced in the error message usually is.
Structured Outputs as Diagnostic Infrastructure
One of the most practical changes I have made to how I structure agentic work is requiring structured verification at each slice boundary rather than only at the end.
In a ticket-driven workflow, each slice has explicit file targets and verification commands. When a slice completes, those commands run and their output is recorded in the ticket's delivery fields:
slices:
- id: S1
title: "GitHub contributions fetcher"
status: done
checks:
- "bash -lc 'npx tsc --noEmit'"
evidence:
- "typecheck: exit 0 on 2026-03-20"
- "file created: src/lib/github-contributions.ts"This is not bureaucracy. It is diagnostic infrastructure. When debugging a multi-slice delivery, I can look at each slice's evidence block and immediately identify which slice's verification passed and which one didn't. The failure scope narrows from "something in the delivery" to "something in slice 3 after slice 2 passed." That is the same principle as binary search applied to sequential execution.
The absence of evidence is also informative. If a slice's evidence block is empty, the agent either did not run the checks or did not record the results. Both are diagnostic signals. An empty evidence block with a passing status usually means the agent claimed success without verifying it — which is its own category of failure worth tracking.
Partial Run Replay
Some failures only become clear partway through a multi-step operation. When recovering from a partial attempt, I run a short-circuit diagnostic first:
git branch --show-current && git status --short
git diff --name-only HEAD 2>/dev/null | head -20This tells me what state the repo is in before I touch anything. The key decision: are the partial artifacts clean enough to build on, or should the branch be reset? Partial artifacts are often worse than no artifacts. A half-written TypeScript file causes more confusing errors than a missing one. When in doubt, reset and regenerate cleanly.
Context Compaction
Long-running agent workflows accumulate context. By the time you are debugging a failure in step 15, the context window contains 14 steps of successful (and now irrelevant) execution. That history crowds out the information needed to fix the current problem.
For verification runs, I compact aggressively. I pass only the current state and the failing command — not the history of how I got there:
Current task: H6-17, slice S1
Command: npm run build
Error: [exact error output here]
Context needed: src/content/blog/agentic-debugging.mdx, src/lib/blog-manifest.ts
For root cause analysis, I compact differently. I want the sequence of decisions, not the full tool output. Summarize each step as a single line: "Step 3: read manifest, decided to add entry. Step 4: wrote entry, verification passed. Step 5: ran build, failed at..." This gives the model the decision trace without the full payload of each tool result.
Focused, compacted context produces faster and more accurate diagnosis than full-history context.
A Pattern Worth Naming: The Invisible Dependency
The failure mode I encounter most often is what I call the invisible dependency — a precondition that was assumed but not stated, satisfied in one context and silently absent in another.
A blog post delivery assumes src/lib/blog-manifest.ts exists in a form that accepts new entries with a specific TypeScript shape. That assumption is correct when the manifest was written by the same workflow adding to it. It breaks if the manifest was hand-edited by a human or modified by a different agent session.
The failure surfaces as a TypeScript error. The root cause is a mismatch between assumption and reality. Fixing the TypeScript error without checking the assumption produces a fragile patch that breaks again when the assumption changes.
The heuristic: any time a verification command fails on a file that was not touched in the current slice, look for an upstream change to that file's expected shape. The problem is almost never in the current slice. It is in the contract between the current slice and something that changed earlier.
What Agentic Debugging Is Not
It is not running the same prompt again and hoping for different output. Sometimes that works, but it is not debugging — it is retry. Retry is appropriate when the failure is clearly stochastic (a flaky API, a transient timeout). It is inappropriate when the failure is deterministic, because the same inputs will produce the same failure and you will learn nothing.
It is not adding more instructions to the system prompt. A model that misread a schema will not read it correctly because the prompt now says "please read schemas carefully." Instruction escalation is a symptom of not diagnosing root cause. Once you identify what actually went wrong, the fix is usually a targeted change to structure or context, not a more emphatic instruction.
It is not instrumenting everything. Comprehensive logging feels like safety but creates its own problems: context saturation, noise in the trace, slower execution. The right amount of instrumentation is enough to answer "at which step did execution deviate?" without recording every bit of state at every step.
The Summary
Debugging agentic systems requires a different mental model than debugging deterministic programs. The failure is usually not where the error appears. The trace is a sequence of decisions, not a call stack. Context is a resource to manage carefully, not something to maximize. Partial state is often more dangerous than no state.
The practical patterns: structured outputs at slice boundaries, narrow context scopes for diagnostic runs, explicit recording of model decisions alongside tool results, and always asking "what assumption is wrong" before "what code is wrong."
The underlying discipline is root cause analysis: follow the chain of causation upstream until you find the decision that should have been different. Then fix that decision, not the symptom it eventually produced.