AI & Machine Learning

Domain 5: Context Management & Reliability (15%)

lex@lexgaines.com · · 10 min read
Managing conversation context across long interactions, error propagation in multi-agent systems, human review workflows, and provenance tracking during synthesis.

Weight: 15% of scored content

This domain covers managing conversation context across long interactions, designing escalation patterns, implementing error propagation in multi-agent systems, handling context in large codebases, designing human review workflows, and preserving information provenance during synthesis.


Task Statement 5.1: Manage conversation context to preserve critical information across long interactions

The Core Problem

As conversations grow, critical details get lost. Three specific risks:

Progressive Summarization Risk

When conversation history is condensed, numerical values, percentages, dates, and customer-stated expectations get reduced to vague summaries. A customer who said "I was promised a full refund within 3 business days" becomes "customer expects refund" — losing the specific commitment and timeline.

"Lost in the Middle" Effect

Models reliably process information at the beginning and end of long inputs but may omit findings from middle sections. In a multi-issue investigation, the first and last issues get proper attention while middle issues are dropped.

Tool Result Accumulation

Tool results accumulate in context and consume tokens disproportionately to their relevance. An order lookup might return 40+ fields when only 5 are relevant to the current issue.

Mitigation Strategies

Persistent "case facts" blocks: Extract critical transactional data (amounts, dates, order numbers, statuses) into a structured block that's included in every prompt, outside of summarized history:

<case_facts>
  <customer_id>C-12345</customer_id>
  <order_id>ORD-67890</order_id>
  <order_amount>$89.99</order_amount>
  <order_date>2025-01-15</order_date>
  <customer_expectation>Full refund within 3 business days</customer_expectation>
  <issues>
    <issue id="1" status="resolved">Wrong color received</issue>
    <issue id="2" status="investigating">Charged twice</issue>
  </issues>
</case_facts>

Trim verbose tool outputs to only relevant fields before they accumulate in context — keep order status and items from an order lookup, drop internal metadata.

Position-aware organization: Place key findings summaries at the beginning of aggregated inputs. Organize detailed results with explicit section headers to mitigate the lost-in-the-middle effect.

Require metadata in structured outputs: Have subagents include dates, source locations, and methodological context so downstream synthesis agents can work with complete information.

Return structured data, not verbose reasoning: When downstream agents have limited context budgets, modify upstream agents to return key facts, citations, and relevance scores instead of verbose reasoning chains.

Connection to Other Domains


Task Statement 5.2: Design effective escalation and ambiguity resolution patterns

When to Escalate

Appropriate escalation triggers (based on explicit criteria, not sentiment):

Trigger Action
Customer explicitly requests a human agent Escalate immediately — don't attempt resolution first
Policy gap or ambiguity (e.g., competitor price matching when policy only covers own-site adjustments) Escalate — the agent shouldn't make policy decisions
Inability to make meaningful progress after investigation Escalate with context of what was attempted
Issue is within the agent's capability but customer is frustrated Offer resolution first; escalate only if customer reiterates preference for a human

What NOT to Use as Escalation Signals

  • Sentiment analysis — customer frustration doesn't correlate with case complexity. An easy return request from an angry customer shouldn't escalate, while a calm request hitting a policy gap should.
  • Self-reported confidence scores — LLM confidence is poorly calibrated. The model may report high confidence on hard cases it's getting wrong.

Ambiguity Resolution: Multiple Customer Matches

When a tool lookup returns multiple matches (e.g., multiple "John Smith" accounts), the agent should ask for additional identifiers (email, phone, order number) rather than selecting based on heuristics. Guessing creates misidentification risk.

Few-Shot Escalation Examples

Include explicit examples in the system prompt showing when to escalate vs. resolve:

<example type="escalate">
Customer: "I was told I could get a price match for a competitor's deal"
Action: Escalate  competitor price matching is not covered by our standard price adjustment policy. This is a policy gap.
</example>

<example type="resolve">
Customer: "This is ridiculous! I've been waiting for a week for my refund!"
Action: Investigate the refund status first. The customer is frustrated but the issue (delayed refund) is within our capability to resolve.
</example>

Connection to Other Domains


Task Statement 5.3: Implement error propagation strategies across multi-agent systems

Structured Error Context

When a subagent fails, it should return detailed error context to enable the coordinator to make intelligent recovery decisions:

{
  "status": "error",
  "failure_type": "timeout",
  "attempted_query": "impact of AI on music industry 2024-2025",
  "partial_results": [
    {"source": "example.com/article1", "summary": "AI in music production..."}
  ],
  "alternative_approaches": [
    "Try narrower query: 'AI music composition tools 2024'",
    "Try different search engine"
  ]
}

Access Failures vs. Valid Empty Results

Situation Meaning Action
Timeout / service unavailable Access failure — data may exist Retry or try alternative
Search returned 0 results Valid empty result — no data matches Move on, note in synthesis

These must be clearly distinguished in error reporting. A generic "search unavailable" hides whether the search timed out (retry) or legitimately found nothing (proceed).

Anti-Patterns

  • Silently suppressing errors (returning empty results as success) — the coordinator can't make informed decisions
  • Terminating the entire workflow on a single subagent failure — partial results may still be valuable
  • Generic error messages ("Operation failed") — hide the context needed for recovery

Local Recovery Before Propagation

Subagents should attempt local recovery for transient failures (retry with backoff). Only propagate errors that can't be resolved locally, including what was attempted and any partial results.

Coverage Annotations in Synthesis

When some subagents succeed and others fail, the synthesis output should annotate which topic areas are well-supported vs. which have gaps due to unavailable sources.

Connection to Other Domains


Task Statement 5.4: Manage context effectively in large codebase exploration

Context Degradation

In extended sessions, models start giving inconsistent answers — referencing "typical patterns" rather than specific classes discovered earlier. This happens because earlier findings get pushed out of effective attention range.

Scratchpad Files

Have agents maintain scratchpad files that record key findings across context boundaries. The agent can reference these files for subsequent questions, counteracting context degradation:

# investigation-scratchpad.md
## Key Findings
- Auth module entry point: src/auth/index.ts
- 3 auth strategies: JWT, OAuth, API key (see src/auth/strategies/)
- Refund flow: controller → service → repository → database
- Critical dependency: auth middleware used by 47 endpoints

Subagent Delegation for Verbose Exploration

Delegate specific investigation questions to subagents (e.g., "find all test files," "trace refund flow dependencies") while the main agent preserves high-level coordination context. The subagent does the verbose exploration and returns a summary.

Summarize Before Spawning New Phases

Before starting a new exploration phase, summarize findings from the previous phase and inject that summary into the new phase's initial context. This prevents information loss at phase boundaries.

Structured State Persistence for Crash Recovery

Each agent exports its state to a known location, and the coordinator loads a manifest on resume. This enables recovery from crashes without losing progress:

{
  "phase": "analysis",
  "completed_files": ["auth.ts", "users.ts", "orders.ts"],
  "pending_files": ["payments.ts", "notifications.ts"],
  "key_findings": { ... }
}

/compact for Token Management

Use /compact to reduce context usage during extended exploration sessions when the context fills with verbose discovery output. This compresses the conversation while preserving key information.

Connection to Other Domains


Task Statement 5.5: Design human review workflows and confidence calibration

The Hidden Risk of Aggregate Metrics

A system reporting 97% overall accuracy may be masking poor performance on specific document types or fields. Invoice dates might be 99.5% accurate while contract penalty clauses are only 80% accurate. Aggregate metrics hide these critical segments.

Stratified Random Sampling

Measure error rates using stratified random sampling across document types and fields, even for high-confidence extractions. This catches novel error patterns that overall accuracy metrics miss.

Field-Level Confidence Scores

Have the model output confidence scores at the field level (not just document level), then calibrate review thresholds using labeled validation sets:

{
  "company_name": {"value": "Acme Corp", "confidence": 0.98},
  "revenue": {"value": 52000000, "confidence": 0.72},
  "fiscal_year": {"value": "2024", "confidence": 0.95}
}

Confidence-Based Routing

Route extractions to human review based on: 1. Low model confidence on specific fields 2. Ambiguous or contradictory source documents 3. Segment-specific accuracy hasn't been validated yet

This prioritizes limited reviewer capacity on the documents most likely to contain errors.

Validation Before Reducing Human Review

Before automating a category of extractions, verify consistent performance by document type and field — not just overall. Automating based on aggregate accuracy risks silently degrading quality in specific segments.

Connection to Other Domains


Task Statement 5.6: Preserve information provenance and handle uncertainty in multi-source synthesis

The Provenance Problem

During summarization steps, source attribution is lost when findings are compressed without preserving claim-source mappings. A synthesis that says "AI will create 5 million new jobs" without attribution is unverifiable.

Structured Claim-Source Mappings

Require subagents to output structured mappings that downstream agents must preserve:

{
  "claim": "AI will create 5 million new jobs by 2030",
  "source_url": "https://example.com/report",
  "document_name": "Future of Work Report 2024",
  "publication_date": "2024-03-15",
  "methodology": "Survey of 500 companies"
}

Handling Conflicting Sources

When two credible sources provide different statistics, annotate the conflict with source attribution rather than arbitrarily selecting one value:

{
  "topic": "AI market size 2025",
  "values": [
    {"value": "$190B", "source": "Gartner Report", "date": "2024-06"},
    {"value": "$225B", "source": "IDC Forecast", "date": "2024-08"}
  ],
  "conflict_note": "Discrepancy likely due to different market definitions"
}

Temporal Data

Include publication or data collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions. A 2023 report and a 2025 report may show different numbers because the situation changed, not because one is wrong.

Report Structure

Structure synthesis reports to distinguish: - Well-established findings — multiple corroborating sources - Contested findings — conflicting sources, annotated with attribution - Single-source claims — flagged as requiring additional verification

Content-Appropriate Rendering

Render different content types in their natural format — financial data as tables, news as prose, technical findings as structured lists — rather than converting everything to a uniform format.

Connection to Other Domains

CCA context management reliability error handling human review