AI & Machine Learning

Domain 4: Prompt Engineering & Structured Output (20%)

lex@lexgaines.com · · 9 min read
JSON schema design, tool_use and tool_choice, few-shot examples, validation-retry loops, and multi-pass review architectures for the CCA exam.

Weight: 20% of scored content

This domain covers designing precise prompts, using few-shot examples for consistency, enforcing structured output with JSON schemas and tool_use, implementing validation-retry loops, batch processing strategies, and multi-pass review architectures.


Task Statement 4.1: Design prompts with explicit criteria to improve precision and reduce false positives

Explicit Criteria > Vague Instructions

Vague instructions like "be conservative" or "only report high-confidence findings" fail because they give the model no concrete criteria to apply. Explicit criteria define exactly what to flag and what to skip.

# Bad: vague
"Flag comments only when claimed behavior contradicts actual code behavior"

# Good: explicit
"Flag a comment as stale when:
1. The comment describes a return type that doesn't match the function signature
2. The comment references a parameter that no longer exists
3. The comment describes a control flow path that has been removed

Do NOT flag:
- Comments describing intent or rationale (even if implementation changed)
- TODOs or FIXMEs (these are intentionally aspirational)
- Comments on adjacent lines that may refer to nearby code"

The False Positive Trust Problem

High false positive rates in one category undermine developer confidence in all categories — even accurate ones. If your security findings are reliable but your style findings are noisy, developers start ignoring everything.

Strategies for Reducing False Positives

  1. Write specific review criteria that define which issues to report vs. skip — bugs and security yes, minor style and local patterns no
  2. Temporarily disable high false-positive categories while improving prompts for those categories, to restore trust
  3. Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification

Connection to Other Domains


Task Statement 4.2: Apply few-shot prompting to improve output consistency and quality

Why Few-Shot Examples Work

Few-shot examples are the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results. They demonstrate: - Ambiguous-case handling — show the model what to do when the right answer isn't obvious - Output format — demonstrate the exact structure you want - Judgment patterns — help the model generalize to novel cases rather than just matching pre-specified rules

Designing Effective Few-Shot Examples

Include 2–4 targeted examples for ambiguous scenarios. Each example should show reasoning for why one action was chosen over plausible alternatives:

<example>
<customer_message>I bought this last week and the zipper broke already. I want my money back.</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer mentions a recent purchase with a defect. Even though they mention "money back" (which might suggest jumping to process_refund), we must first locate the order to verify purchase date, amount, and return eligibility. Always look up the order before processing any refund.</reasoning>
</example>

<example>
<customer_message>Can you check if my order has shipped? Order #ORD-45678</customer_message>
<tool_selection>lookup_order</tool_selection>
<reasoning>Customer provides an order ID and asks about shipping status. This is a direct order lookup  do NOT call get_customer first since we already have the order identifier.</reasoning>
</example>

Few-Shot Examples for Extraction Tasks

For data extraction, examples should demonstrate correct handling of varied document structures: - Inline citations vs. bibliographies - Narrative descriptions vs. structured tables - Methodology sections vs. embedded details

Include examples showing correct extraction from documents where some fields are absent to train the model to return null rather than fabricating values.

Few-Shot Examples for Reducing Hallucination

Examples that show acceptable code patterns alongside genuine issues help the model distinguish false positives from real problems, enabling generalization rather than rigid pattern matching.

Connection to Other Domains


Task Statement 4.3: Enforce structured output using tool use and JSON schemas

tool_use with JSON Schemas: The Gold Standard

Using tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output. It eliminates JSON syntax errors entirely because the model produces tool call arguments that are validated against the schema.

Syntax Errors vs. Semantic Errors

Error Type Eliminated by tool_use? Example
Syntax Yes — schema validation prevents malformed JSON Missing comma, unclosed bracket
Semantic No — model can still make logical errors Line items don't sum to total, value in wrong field

tool_choice for Structured Output

Setting When to Use
tool_choice: "auto" Model may return text instead of calling a tool
tool_choice: "any" Guarantees the model calls some tool — use when multiple extraction schemas exist and document type is unknown
tool_choice: {"type": "tool", "name": "extract_metadata"} Forces a specific extraction tool — use to ensure metadata extraction runs before enrichment steps

Schema Design Best Practices

Required vs. optional fields: Make fields optional (nullable) when source documents may not contain the information. If a field is required, the model may fabricate a value to satisfy the schema rather than returning null.

{
  "type": "object",
  "properties": {
    "company_name": {"type": "string"},
    "revenue": {"type": ["number", "null"]},
    "fiscal_year": {"type": ["string", "null"]},
    "category": {
      "type": "string",
      "enum": ["technology", "healthcare", "finance", "manufacturing", "other"]
    },
    "category_detail": {
      "type": ["string", "null"],
      "description": "Specific sub-category when 'other' is selected"
    }
  },
  "required": ["company_name", "category"]
}

Key patterns: - Use "other" + detail string pattern for extensible categorization - Use enum values like "unclear" for ambiguous cases - Include format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting

Connection to Other Domains


Task Statement 4.4: Implement validation, retry, and feedback loops for extraction quality

Retry-with-Error-Feedback

When extraction validation fails, append the specific validation errors to the prompt on retry to guide the model toward correction:

# First attempt fails validation
extraction = call_extraction_tool(document)
errors = validate(extraction)

if errors:
    # Retry with error feedback
    retry_prompt = f"""
    Original document: {document}

    Your previous extraction had these errors:
    {json.dumps(errors, indent=2)}

    Please re-extract, correcting these specific issues.
    """
    extraction = call_extraction_tool(retry_prompt)

When Retries Won't Help

Retries are ineffective when the required information is simply absent from the source document. A format mismatch or structural error can be retried; missing data cannot. Track which errors are resolvable via retry (format issues) vs. which are not (information absent from source).

The detected_pattern Field

Add a detected_pattern field to structured findings to enable systematic analysis of false positive patterns. When developers dismiss findings, you can analyze which patterns are being rejected and improve the prompt accordingly.

{
  "finding": "Potential SQL injection",
  "severity": "high",
  "location": "src/api/users.js:42",
  "detected_pattern": "string_concatenation_in_query",
  "confidence": 0.85
}

Self-Correction Validation Flows

For numerical accuracy, extract both the stated value and a calculated value, then flag discrepancies:

{
  "stated_total": 1250.00,
  "calculated_total": 1230.00,
  "conflict_detected": true,
  "line_items": [...]
}

This lets downstream systems decide how to handle the inconsistency rather than silently accepting one value.

Connection to Other Domains


Task Statement 4.5: Design efficient batch processing strategies

Message Batches API

Feature Detail
Cost savings 50% reduction
Processing window Up to 24 hours
Latency guarantee None — no SLA
Multi-turn tool calling Not supported within a single request
Request correlation Use custom_id fields

When to Use Batch vs. Synchronous

Workflow API Why
Pre-merge checks (blocking) Synchronous Developers are waiting; need guaranteed latency
Overnight technical debt reports Batch Latency-tolerant; 50% cost savings
Weekly audits Batch Non-blocking, can run overnight
Nightly test generation Batch No one waiting for results

Calculating Batch Submission Frequency

If your SLA requires results within 30 hours and batch processing takes up to 24 hours, submit in 4-hour windows to ensure completion before the deadline (4 + 24 = 28 < 30).

Handling Batch Failures

  • Track failures using custom_id to identify which documents failed
  • Resubmit only failed documents with appropriate modifications:
  • Chunk documents that exceeded context limits
  • Fix input formatting issues
  • Don't resubmit the entire batch

Prompt Refinement Before Batch Processing

Before processing thousands of documents, refine your prompt on a sample set first. This maximizes first-pass success rates and reduces iterative resubmission costs.

Connection to Other Domains


Task Statement 4.6: Design multi-instance and multi-pass review architectures

Self-Review Limitations

A model that generated code retains its reasoning context, making it less likely to question its own decisions. Self-review in the same session, even with explicit "review your work" instructions, is less effective than an independent review.

Independent Review Instances

Use a second, independent Claude instance (without the generator's reasoning context) for review. It approaches the code fresh and is more likely to catch subtle issues.

Multi-Pass Review for Large PRs

When a PR modifies many files, a single-pass review suffers from attention dilution — some files get detailed feedback while others get superficial comments, and the model may flag a pattern as problematic in one file while approving identical code elsewhere.

Split into: 1. Per-file local analysis passes — each file gets focused, consistent attention 2. Cross-file integration pass — examines data flow, API contracts, and cross-file consistency

Verification Passes with Confidence Scores

Have the model self-report confidence alongside each finding. This enables calibrated review routing — high-confidence findings go straight through, while low-confidence findings get human review.

Connection to Other Domains

CCA prompt engineering structured output JSON schema tool_choice