Domain 4: Prompt Engineering & Structured Output (20%)

Weight: 20% of scored content

This domain covers designing precise prompts, using few-shot examples for consistency, enforcing structured output with JSON schemas and tool_use, implementing validation-retry loops, batch processing strategies, and multi-pass review architectures.

Task Statement 4.1: Design prompts with explicit criteria to improve precision and reduce false positives

Explicit Criteria > Vague Instructions

Vague instructions like "be conservative" or "only report high-confidence findings" fail because they give the model no concrete criteria to apply. Explicit criteria define exactly what to flag and what to skip.

# Bad: vague
"Flag comments only when claimed behavior contradicts actual code behavior"

# Good: explicit
"Flag a comment as stale when:
1. The comment describes a return type that doesn't match the function signature
2. The comment references a parameter that no longer exists
3. The comment describes a control flow path that has been removed

Do NOT flag:
- Comments describing intent or rationale (even if implementation changed)
- TODOs or FIXMEs (these are intentionally aspirational)
- Comments on adjacent lines that may refer to nearby code"

The False Positive Trust Problem

High false positive rates in one category undermine developer confidence in all categories — even accurate ones. If your security findings are reliable but your style findings are noisy, developers start ignoring everything.

Strategies for Reducing False Positives

Write specific review criteria that define which issues to report vs. skip — bugs and security yes, minor style and local patterns no
Temporarily disable high false-positive categories while improving prompts for those categories, to restore trust
Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification

Connection to Other Domains

CI/CD code review integration → see Claude Code Configuration & Workflows
Multi-pass review → see Task Statement 4.6

Task Statement 4.2: Apply few-shot prompting to improve output consistency and quality

Why Few-Shot Examples Work

Few-shot examples are the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results. They demonstrate: - Ambiguous-case handling — show the model what to do when the right answer isn't obvious - Output format — demonstrate the exact structure you want - Judgment patterns — help the model generalize to novel cases rather than just matching pre-specified rules

Designing Effective Few-Shot Examples

Include 2–4 targeted examples for ambiguous scenarios. Each example should show reasoning for why one action was chosen over plausible alternatives:

<example>
<customer_message>I bought this last week and the zipper broke already. I want my money back.</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer mentions a recent purchase with a defect. Even though they mention "money back" (which might suggest jumping to process_refund), we must first locate the order to verify purchase date, amount, and return eligibility. Always look up the order before processing any refund.</reasoning>
</example>

<example>
<customer_message>Can you check if my order has shipped? Order #ORD-45678</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer provides an order ID and asks about shipping status. This is a direct order lookup — do NOT call get_customer first since we already have the order identifier.</reasoning>
</example>

Few-Shot Examples for Extraction Tasks

For data extraction, examples should demonstrate correct handling of varied document structures: - Inline citations vs. bibliographies - Narrative descriptions vs. structured tables - Methodology sections vs. embedded details

Include examples showing correct extraction from documents where some fields are absent to train the model to return null rather than fabricating values.

Few-Shot Examples for Reducing Hallucination

Examples that show acceptable code patterns alongside genuine issues help the model distinguish false positives from real problems, enabling generalization rather than rigid pattern matching.

Connection to Other Domains

Tool selection improvement → see Tool Design & MCP Integration
Escalation calibration → see Context Management & Reliability

Task Statement 4.3: Enforce structured output using tool use and JSON schemas

`tool_use` with JSON Schemas: The Gold Standard

Using tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output. It eliminates JSON syntax errors entirely because the model produces tool call arguments that are validated against the schema.

Syntax Errors vs. Semantic Errors

Error Type	Eliminated by `tool_use`?	Example
Syntax	Yes — schema validation prevents malformed JSON	Missing comma, unclosed bracket
Semantic	No — model can still make logical errors	Line items don't sum to total, value in wrong field

`tool_choice` for Structured Output

Setting	When to Use
`tool_choice: "auto"`	Model may return text instead of calling a tool
`tool_choice: "any"`	Guarantees the model calls some tool — use when multiple extraction schemas exist and document type is unknown
`tool_choice: {"type": "tool", "name": "extract_metadata"}`	Forces a specific extraction tool — use to ensure metadata extraction runs before enrichment steps

Schema Design Best Practices

Required vs. optional fields: Make fields optional (nullable) when source documents may not contain the information. If a field is required, the model may fabricate a value to satisfy the schema rather than returning null.

{
  "type": "object",
  "properties": {
    "company_name": {"type": "string"},
    "revenue": {"type": ["number", "null"]},
    "fiscal_year": {"type": ["string", "null"]},
    "category": {
      "type": "string",
      "enum": ["technology", "healthcare", "finance", "manufacturing", "other"]
    },
    "category_detail": {
      "type": ["string", "null"],
      "description": "Specific sub-category when 'other' is selected"
    }
  },
  "required": ["company_name", "category"]
}

Key patterns: - Use "other" + detail string pattern for extensible categorization - Use enum values like "unclear" for ambiguous cases - Include format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting

Connection to Other Domains

Tool choice configuration → see Tool Design & MCP Integration
Validation loops for semantic errors → see Task Statement 4.4

Task Statement 4.4: Implement validation, retry, and feedback loops for extraction quality

Retry-with-Error-Feedback

When extraction validation fails, append the specific validation errors to the prompt on retry to guide the model toward correction:

# First attempt fails validation
extraction = call_extraction_tool(document)
errors = validate(extraction)

if errors:
    # Retry with error feedback
    retry_prompt = f"""
    Original document: {document}

    Your previous extraction had these errors:
    {json.dumps(errors, indent=2)}

    Please re-extract, correcting these specific issues.
    """
    extraction = call_extraction_tool(retry_prompt)

When Retries Won't Help

Retries are ineffective when the required information is simply absent from the source document. A format mismatch or structural error can be retried; missing data cannot. Track which errors are resolvable via retry (format issues) vs. which are not (information absent from source).

The `detected_pattern` Field

Add a detected_pattern field to structured findings to enable systematic analysis of false positive patterns. When developers dismiss findings, you can analyze which patterns are being rejected and improve the prompt accordingly.

{
  "finding": "Potential SQL injection",
  "severity": "high",
  "location": "src/api/users.js:42",
  "detected_pattern": "string_concatenation_in_query",
  "confidence": 0.85
}

Self-Correction Validation Flows

For numerical accuracy, extract both the stated value and a calculated value, then flag discrepancies:

{
  "stated_total": 1250.00,
  "calculated_total": 1230.00,
  "conflict_detected": true,
  "line_items": [...]
}

This lets downstream systems decide how to handle the inconsistency rather than silently accepting one value.

Connection to Other Domains

Structured error responses → see Tool Design & MCP Integration
Human review routing → see Context Management & Reliability

Task Statement 4.5: Design efficient batch processing strategies

Message Batches API

Feature	Detail
Cost savings	50% reduction
Processing window	Up to 24 hours
Latency guarantee	None — no SLA
Multi-turn tool calling	Not supported within a single request
Request correlation	Use `custom_id` fields

When to Use Batch vs. Synchronous

Workflow	API	Why
Pre-merge checks (blocking)	Synchronous	Developers are waiting; need guaranteed latency
Overnight technical debt reports	Batch	Latency-tolerant; 50% cost savings
Weekly audits	Batch	Non-blocking, can run overnight
Nightly test generation	Batch	No one waiting for results

Calculating Batch Submission Frequency

If your SLA requires results within 30 hours and batch processing takes up to 24 hours, submit in 4-hour windows to ensure completion before the deadline (4 + 24 = 28 < 30).

Handling Batch Failures

Track failures using custom_id to identify which documents failed
Resubmit only failed documents with appropriate modifications:
Chunk documents that exceeded context limits
Fix input formatting issues
Don't resubmit the entire batch

Before processing thousands of documents, refine your prompt on a sample set first. This maximizes first-pass success rates and reduces iterative resubmission costs.

Connection to Other Domains

CI/CD integration → see Claude Code Configuration & Workflows
Structured output schemas → see Task Statement 4.3

Task Statement 4.6: Design multi-instance and multi-pass review architectures

Self-Review Limitations

A model that generated code retains its reasoning context, making it less likely to question its own decisions. Self-review in the same session, even with explicit "review your work" instructions, is less effective than an independent review.

Independent Review Instances

Use a second, independent Claude instance (without the generator's reasoning context) for review. It approaches the code fresh and is more likely to catch subtle issues.

Multi-Pass Review for Large PRs

When a PR modifies many files, a single-pass review suffers from attention dilution — some files get detailed feedback while others get superficial comments, and the model may flag a pattern as problematic in one file while approving identical code elsewhere.

Split into: 1. Per-file local analysis passes — each file gets focused, consistent attention 2. Cross-file integration pass — examines data flow, API contracts, and cross-file consistency

Verification Passes with Confidence Scores

Have the model self-report confidence alongside each finding. This enables calibrated review routing — high-confidence findings go straight through, while low-confidence findings get human review.

Connection to Other Domains

CI/CD code review → see Claude Code Configuration & Workflows
Task decomposition (splitting reviews) → see Agentic Architecture & Orchestration
Human review workflows → see Context Management & Reliability

Domain 4: Prompt Engineering & Structured Output (20%)

Task Statement 4.1: Design prompts with explicit criteria to improve precision and reduce false positives

Explicit Criteria > Vague Instructions

The False Positive Trust Problem

Strategies for Reducing False Positives

Connection to Other Domains

Task Statement 4.2: Apply few-shot prompting to improve output consistency and quality

Why Few-Shot Examples Work

Designing Effective Few-Shot Examples

Few-Shot Examples for Extraction Tasks

Few-Shot Examples for Reducing Hallucination

Connection to Other Domains

Task Statement 4.3: Enforce structured output using tool use and JSON schemas

tool_use with JSON Schemas: The Gold Standard

Syntax Errors vs. Semantic Errors

tool_choice for Structured Output

Schema Design Best Practices

Connection to Other Domains

Task Statement 4.4: Implement validation, retry, and feedback loops for extraction quality

Retry-with-Error-Feedback

When Retries Won't Help

The detected_pattern Field

Self-Correction Validation Flows

Connection to Other Domains

Task Statement 4.5: Design efficient batch processing strategies

Message Batches API

When to Use Batch vs. Synchronous

Calculating Batch Submission Frequency

Handling Batch Failures

Prompt Refinement Before Batch Processing

Connection to Other Domains

Task Statement 4.6: Design multi-instance and multi-pass review architectures

Self-Review Limitations

Independent Review Instances

Multi-Pass Review for Large PRs

Verification Passes with Confidence Scores

Connection to Other Domains

`tool_use` with JSON Schemas: The Gold Standard

`tool_choice` for Structured Output

The `detected_pattern` Field