Weight: 20% of scored content
This domain covers designing precise prompts, using few-shot examples for consistency, enforcing structured output with JSON schemas and tool_use, implementing validation-retry loops, batch processing strategies, and multi-pass review architectures.
Task Statement 4.1: Design prompts with explicit criteria to improve precision and reduce false positives
Explicit Criteria > Vague Instructions
Vague instructions like "be conservative" or "only report high-confidence findings" fail because they give the model no concrete criteria to apply. Explicit criteria define exactly what to flag and what to skip.
# Bad: vague
"Flag comments only when claimed behavior contradicts actual code behavior"
# Good: explicit
"Flag a comment as stale when:
1. The comment describes a return type that doesn't match the function signature
2. The comment references a parameter that no longer exists
3. The comment describes a control flow path that has been removed
Do NOT flag:
- Comments describing intent or rationale (even if implementation changed)
- TODOs or FIXMEs (these are intentionally aspirational)
- Comments on adjacent lines that may refer to nearby code"
The False Positive Trust Problem
High false positive rates in one category undermine developer confidence in all categories — even accurate ones. If your security findings are reliable but your style findings are noisy, developers start ignoring everything.
Strategies for Reducing False Positives
- Write specific review criteria that define which issues to report vs. skip — bugs and security yes, minor style and local patterns no
- Temporarily disable high false-positive categories while improving prompts for those categories, to restore trust
- Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification
Connection to Other Domains
- CI/CD code review integration → see Claude Code Configuration & Workflows
- Multi-pass review → see Task Statement 4.6
Task Statement 4.2: Apply few-shot prompting to improve output consistency and quality
Why Few-Shot Examples Work
Few-shot examples are the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results. They demonstrate: - Ambiguous-case handling — show the model what to do when the right answer isn't obvious - Output format — demonstrate the exact structure you want - Judgment patterns — help the model generalize to novel cases rather than just matching pre-specified rules
Designing Effective Few-Shot Examples
Include 2–4 targeted examples for ambiguous scenarios. Each example should show reasoning for why one action was chosen over plausible alternatives:
<example>
<customer_message>I bought this last week and the zipper broke already. I want my money back.</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer mentions a recent purchase with a defect. Even though they mention "money back" (which might suggest jumping to process_refund), we must first locate the order to verify purchase date, amount, and return eligibility. Always look up the order before processing any refund.</reasoning>
</example>
<example>
<customer_message>Can you check if my order has shipped? Order #ORD-45678</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer provides an order ID and asks about shipping status. This is a direct order lookup — do NOT call get_customer first since we already have the order identifier.</reasoning>
</example>
Few-Shot Examples for Extraction Tasks
For data extraction, examples should demonstrate correct handling of varied document structures: - Inline citations vs. bibliographies - Narrative descriptions vs. structured tables - Methodology sections vs. embedded details
Include examples showing correct extraction from documents where some fields are absent to train the model to return null rather than fabricating values.
Few-Shot Examples for Reducing Hallucination
Examples that show acceptable code patterns alongside genuine issues help the model distinguish false positives from real problems, enabling generalization rather than rigid pattern matching.
Connection to Other Domains
- Tool selection improvement → see Tool Design & MCP Integration
- Escalation calibration → see Context Management & Reliability
Task Statement 4.3: Enforce structured output using tool use and JSON schemas
tool_use with JSON Schemas: The Gold Standard
Using tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output. It eliminates JSON syntax errors entirely because the model produces tool call arguments that are validated against the schema.
Syntax Errors vs. Semantic Errors
| Error Type | Eliminated by tool_use? |
Example |
|---|---|---|
| Syntax | Yes — schema validation prevents malformed JSON | Missing comma, unclosed bracket |
| Semantic | No — model can still make logical errors | Line items don't sum to total, value in wrong field |
tool_choice for Structured Output
| Setting | When to Use |
|---|---|
tool_choice: "auto" |
Model may return text instead of calling a tool |
tool_choice: "any" |
Guarantees the model calls some tool — use when multiple extraction schemas exist and document type is unknown |
tool_choice: {"type": "tool", "name": "extract_metadata"} |
Forces a specific extraction tool — use to ensure metadata extraction runs before enrichment steps |
Schema Design Best Practices
Required vs. optional fields: Make fields optional (nullable) when source documents may not contain the information. If a field is required, the model may fabricate a value to satisfy the schema rather than returning null.
{
"type": "object",
"properties": {
"company_name": {"type": "string"},
"revenue": {"type": ["number", "null"]},
"fiscal_year": {"type": ["string", "null"]},
"category": {
"type": "string",
"enum": ["technology", "healthcare", "finance", "manufacturing", "other"]
},
"category_detail": {
"type": ["string", "null"],
"description": "Specific sub-category when 'other' is selected"
}
},
"required": ["company_name", "category"]
}
Key patterns:
- Use "other" + detail string pattern for extensible categorization
- Use enum values like "unclear" for ambiguous cases
- Include format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting
Connection to Other Domains
- Tool choice configuration → see Tool Design & MCP Integration
- Validation loops for semantic errors → see Task Statement 4.4
Task Statement 4.4: Implement validation, retry, and feedback loops for extraction quality
Retry-with-Error-Feedback
When extraction validation fails, append the specific validation errors to the prompt on retry to guide the model toward correction:
# First attempt fails validation
extraction = call_extraction_tool(document)
errors = validate(extraction)
if errors:
# Retry with error feedback
retry_prompt = f"""
Original document: {document}
Your previous extraction had these errors:
{json.dumps(errors, indent=2)}
Please re-extract, correcting these specific issues.
"""
extraction = call_extraction_tool(retry_prompt)
When Retries Won't Help
Retries are ineffective when the required information is simply absent from the source document. A format mismatch or structural error can be retried; missing data cannot. Track which errors are resolvable via retry (format issues) vs. which are not (information absent from source).
The detected_pattern Field
Add a detected_pattern field to structured findings to enable systematic analysis of false positive patterns. When developers dismiss findings, you can analyze which patterns are being rejected and improve the prompt accordingly.
{
"finding": "Potential SQL injection",
"severity": "high",
"location": "src/api/users.js:42",
"detected_pattern": "string_concatenation_in_query",
"confidence": 0.85
}
Self-Correction Validation Flows
For numerical accuracy, extract both the stated value and a calculated value, then flag discrepancies:
{
"stated_total": 1250.00,
"calculated_total": 1230.00,
"conflict_detected": true,
"line_items": [...]
}
This lets downstream systems decide how to handle the inconsistency rather than silently accepting one value.
Connection to Other Domains
- Structured error responses → see Tool Design & MCP Integration
- Human review routing → see Context Management & Reliability
Task Statement 4.5: Design efficient batch processing strategies
Message Batches API
| Feature | Detail |
|---|---|
| Cost savings | 50% reduction |
| Processing window | Up to 24 hours |
| Latency guarantee | None — no SLA |
| Multi-turn tool calling | Not supported within a single request |
| Request correlation | Use custom_id fields |
When to Use Batch vs. Synchronous
| Workflow | API | Why |
|---|---|---|
| Pre-merge checks (blocking) | Synchronous | Developers are waiting; need guaranteed latency |
| Overnight technical debt reports | Batch | Latency-tolerant; 50% cost savings |
| Weekly audits | Batch | Non-blocking, can run overnight |
| Nightly test generation | Batch | No one waiting for results |
Calculating Batch Submission Frequency
If your SLA requires results within 30 hours and batch processing takes up to 24 hours, submit in 4-hour windows to ensure completion before the deadline (4 + 24 = 28 < 30).
Handling Batch Failures
- Track failures using
custom_idto identify which documents failed - Resubmit only failed documents with appropriate modifications:
- Chunk documents that exceeded context limits
- Fix input formatting issues
- Don't resubmit the entire batch
Prompt Refinement Before Batch Processing
Before processing thousands of documents, refine your prompt on a sample set first. This maximizes first-pass success rates and reduces iterative resubmission costs.
Connection to Other Domains
- CI/CD integration → see Claude Code Configuration & Workflows
- Structured output schemas → see Task Statement 4.3
Task Statement 4.6: Design multi-instance and multi-pass review architectures
Self-Review Limitations
A model that generated code retains its reasoning context, making it less likely to question its own decisions. Self-review in the same session, even with explicit "review your work" instructions, is less effective than an independent review.
Independent Review Instances
Use a second, independent Claude instance (without the generator's reasoning context) for review. It approaches the code fresh and is more likely to catch subtle issues.
Multi-Pass Review for Large PRs
When a PR modifies many files, a single-pass review suffers from attention dilution — some files get detailed feedback while others get superficial comments, and the model may flag a pattern as problematic in one file while approving identical code elsewhere.
Split into: 1. Per-file local analysis passes — each file gets focused, consistent attention 2. Cross-file integration pass — examines data flow, API contracts, and cross-file consistency
Verification Passes with Confidence Scores
Have the model self-report confidence alongside each finding. This enables calibrated review routing — high-confidence findings go straight through, while low-confidence findings get human review.
Connection to Other Domains
- CI/CD code review → see Claude Code Configuration & Workflows
- Task decomposition (splitting reviews) → see Agentic Architecture & Orchestration
- Human review workflows → see Context Management & Reliability