Weight: 15% of scored content
This domain covers managing conversation context across long interactions, designing escalation patterns, implementing error propagation in multi-agent systems, handling context in large codebases, designing human review workflows, and preserving information provenance during synthesis.
Task Statement 5.1: Manage conversation context to preserve critical information across long interactions
The Core Problem
As conversations grow, critical details get lost. Three specific risks:
Progressive Summarization Risk
When conversation history is condensed, numerical values, percentages, dates, and customer-stated expectations get reduced to vague summaries. A customer who said "I was promised a full refund within 3 business days" becomes "customer expects refund" — losing the specific commitment and timeline.
"Lost in the Middle" Effect
Models reliably process information at the beginning and end of long inputs but may omit findings from middle sections. In a multi-issue investigation, the first and last issues get proper attention while middle issues are dropped.
Tool Result Accumulation
Tool results accumulate in context and consume tokens disproportionately to their relevance. An order lookup might return 40+ fields when only 5 are relevant to the current issue.
Mitigation Strategies
Persistent "case facts" blocks: Extract critical transactional data (amounts, dates, order numbers, statuses) into a structured block that's included in every prompt, outside of summarized history:
<case_facts>
<customer_id>C-12345</customer_id>
<order_id>ORD-67890</order_id>
<order_amount>$89.99</order_amount>
<order_date>2025-01-15</order_date>
<customer_expectation>Full refund within 3 business days</customer_expectation>
<issues>
<issue id="1" status="resolved">Wrong color received</issue>
<issue id="2" status="investigating">Charged twice</issue>
</issues>
</case_facts>
Trim verbose tool outputs to only relevant fields before they accumulate in context — keep order status and items from an order lookup, drop internal metadata.
Position-aware organization: Place key findings summaries at the beginning of aggregated inputs. Organize detailed results with explicit section headers to mitigate the lost-in-the-middle effect.
Require metadata in structured outputs: Have subagents include dates, source locations, and methodological context so downstream synthesis agents can work with complete information.
Return structured data, not verbose reasoning: When downstream agents have limited context budgets, modify upstream agents to return key facts, citations, and relevance scores instead of verbose reasoning chains.
Connection to Other Domains
- Tool result handling in agentic loops → see Agentic Architecture & Orchestration
- Subagent output design → see Agentic Architecture & Orchestration
Task Statement 5.2: Design effective escalation and ambiguity resolution patterns
When to Escalate
Appropriate escalation triggers (based on explicit criteria, not sentiment):
| Trigger | Action |
|---|---|
| Customer explicitly requests a human agent | Escalate immediately — don't attempt resolution first |
| Policy gap or ambiguity (e.g., competitor price matching when policy only covers own-site adjustments) | Escalate — the agent shouldn't make policy decisions |
| Inability to make meaningful progress after investigation | Escalate with context of what was attempted |
| Issue is within the agent's capability but customer is frustrated | Offer resolution first; escalate only if customer reiterates preference for a human |
What NOT to Use as Escalation Signals
- Sentiment analysis — customer frustration doesn't correlate with case complexity. An easy return request from an angry customer shouldn't escalate, while a calm request hitting a policy gap should.
- Self-reported confidence scores — LLM confidence is poorly calibrated. The model may report high confidence on hard cases it's getting wrong.
Ambiguity Resolution: Multiple Customer Matches
When a tool lookup returns multiple matches (e.g., multiple "John Smith" accounts), the agent should ask for additional identifiers (email, phone, order number) rather than selecting based on heuristics. Guessing creates misidentification risk.
Few-Shot Escalation Examples
Include explicit examples in the system prompt showing when to escalate vs. resolve:
<example type="escalate">
Customer: "I was told I could get a price match for a competitor's deal"
Action: Escalate — competitor price matching is not covered by our standard price adjustment policy. This is a policy gap.
</example>
<example type="resolve">
Customer: "This is ridiculous! I've been waiting for a week for my refund!"
Action: Investigate the refund status first. The customer is frustrated but the issue (delayed refund) is within our capability to resolve.
</example>
Connection to Other Domains
- Structured handoff protocols → see Agentic Architecture & Orchestration
- Hooks for enforcement → see Agentic Architecture & Orchestration
Task Statement 5.3: Implement error propagation strategies across multi-agent systems
Structured Error Context
When a subagent fails, it should return detailed error context to enable the coordinator to make intelligent recovery decisions:
{
"status": "error",
"failure_type": "timeout",
"attempted_query": "impact of AI on music industry 2024-2025",
"partial_results": [
{"source": "example.com/article1", "summary": "AI in music production..."}
],
"alternative_approaches": [
"Try narrower query: 'AI music composition tools 2024'",
"Try different search engine"
]
}
Access Failures vs. Valid Empty Results
| Situation | Meaning | Action |
|---|---|---|
| Timeout / service unavailable | Access failure — data may exist | Retry or try alternative |
| Search returned 0 results | Valid empty result — no data matches | Move on, note in synthesis |
These must be clearly distinguished in error reporting. A generic "search unavailable" hides whether the search timed out (retry) or legitimately found nothing (proceed).
Anti-Patterns
- Silently suppressing errors (returning empty results as success) — the coordinator can't make informed decisions
- Terminating the entire workflow on a single subagent failure — partial results may still be valuable
- Generic error messages ("Operation failed") — hide the context needed for recovery
Local Recovery Before Propagation
Subagents should attempt local recovery for transient failures (retry with backoff). Only propagate errors that can't be resolved locally, including what was attempted and any partial results.
Coverage Annotations in Synthesis
When some subagents succeed and others fail, the synthesis output should annotate which topic areas are well-supported vs. which have gaps due to unavailable sources.
Connection to Other Domains
- Structured error responses → see Tool Design & MCP Integration
- Coordinator-subagent patterns → see Agentic Architecture & Orchestration
Task Statement 5.4: Manage context effectively in large codebase exploration
Context Degradation
In extended sessions, models start giving inconsistent answers — referencing "typical patterns" rather than specific classes discovered earlier. This happens because earlier findings get pushed out of effective attention range.
Scratchpad Files
Have agents maintain scratchpad files that record key findings across context boundaries. The agent can reference these files for subsequent questions, counteracting context degradation:
# investigation-scratchpad.md
## Key Findings
- Auth module entry point: src/auth/index.ts
- 3 auth strategies: JWT, OAuth, API key (see src/auth/strategies/)
- Refund flow: controller → service → repository → database
- Critical dependency: auth middleware used by 47 endpoints
Subagent Delegation for Verbose Exploration
Delegate specific investigation questions to subagents (e.g., "find all test files," "trace refund flow dependencies") while the main agent preserves high-level coordination context. The subagent does the verbose exploration and returns a summary.
Summarize Before Spawning New Phases
Before starting a new exploration phase, summarize findings from the previous phase and inject that summary into the new phase's initial context. This prevents information loss at phase boundaries.
Structured State Persistence for Crash Recovery
Each agent exports its state to a known location, and the coordinator loads a manifest on resume. This enables recovery from crashes without losing progress:
{
"phase": "analysis",
"completed_files": ["auth.ts", "users.ts", "orders.ts"],
"pending_files": ["payments.ts", "notifications.ts"],
"key_findings": { ... }
}
/compact for Token Management
Use /compact to reduce context usage during extended exploration sessions when the context fills with verbose discovery output. This compresses the conversation while preserving key information.
Connection to Other Domains
- Session management → see Agentic Architecture & Orchestration
- Plan mode for exploration → see Claude Code Configuration & Workflows
Task Statement 5.5: Design human review workflows and confidence calibration
The Hidden Risk of Aggregate Metrics
A system reporting 97% overall accuracy may be masking poor performance on specific document types or fields. Invoice dates might be 99.5% accurate while contract penalty clauses are only 80% accurate. Aggregate metrics hide these critical segments.
Stratified Random Sampling
Measure error rates using stratified random sampling across document types and fields, even for high-confidence extractions. This catches novel error patterns that overall accuracy metrics miss.
Field-Level Confidence Scores
Have the model output confidence scores at the field level (not just document level), then calibrate review thresholds using labeled validation sets:
{
"company_name": {"value": "Acme Corp", "confidence": 0.98},
"revenue": {"value": 52000000, "confidence": 0.72},
"fiscal_year": {"value": "2024", "confidence": 0.95}
}
Confidence-Based Routing
Route extractions to human review based on: 1. Low model confidence on specific fields 2. Ambiguous or contradictory source documents 3. Segment-specific accuracy hasn't been validated yet
This prioritizes limited reviewer capacity on the documents most likely to contain errors.
Validation Before Reducing Human Review
Before automating a category of extractions, verify consistent performance by document type and field — not just overall. Automating based on aggregate accuracy risks silently degrading quality in specific segments.
Connection to Other Domains
- Validation-retry loops → see Prompt Engineering & Structured Output
- Structured output design → see Prompt Engineering & Structured Output
Task Statement 5.6: Preserve information provenance and handle uncertainty in multi-source synthesis
The Provenance Problem
During summarization steps, source attribution is lost when findings are compressed without preserving claim-source mappings. A synthesis that says "AI will create 5 million new jobs" without attribution is unverifiable.
Structured Claim-Source Mappings
Require subagents to output structured mappings that downstream agents must preserve:
{
"claim": "AI will create 5 million new jobs by 2030",
"source_url": "https://example.com/report",
"document_name": "Future of Work Report 2024",
"publication_date": "2024-03-15",
"methodology": "Survey of 500 companies"
}
Handling Conflicting Sources
When two credible sources provide different statistics, annotate the conflict with source attribution rather than arbitrarily selecting one value:
{
"topic": "AI market size 2025",
"values": [
{"value": "$190B", "source": "Gartner Report", "date": "2024-06"},
{"value": "$225B", "source": "IDC Forecast", "date": "2024-08"}
],
"conflict_note": "Discrepancy likely due to different market definitions"
}
Temporal Data
Include publication or data collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions. A 2023 report and a 2025 report may show different numbers because the situation changed, not because one is wrong.
Report Structure
Structure synthesis reports to distinguish: - Well-established findings — multiple corroborating sources - Contested findings — conflicting sources, annotated with attribution - Single-source claims — flagged as requiring additional verification
Content-Appropriate Rendering
Render different content types in their natural format — financial data as tables, news as prose, technical findings as structured lists — rather than converting everything to a uniform format.
Connection to Other Domains
- Subagent context passing → see Agentic Architecture & Orchestration
- Coordinator evaluation and refinement → see Agentic Architecture & Orchestration