Prompt Engineering Patterns for Reliable AI Data Quality Assurance
My CEO walked into my office with a printout. “Your AI analysis said Q4 revenue was $2.3 million. Finance says it’s $2.1 million. Which is correct?”
I pulled up the original data. Finance was right. My AI assistant had confidently invented a number that looked plausible but was completely wrong.
That moment taught me something critical: AI tools are terrible at data work because they prioritize helpfulness over accuracy. They’ll happily give you wrong numbers that look right.
The Problem With AI Data Analysis
I used to think AI hallucinations were just for creative tasks. Then I started using LLMs for data analysis.
The failures were subtle but dangerous:
- Numeric hallucinations: The AI would invent “average order value: $47.32” when the real average was $52.18
- Context collapse: It would analyze all-time data when I asked for Q3, or mix currencies without conversion
- Inconsistent outputs: The same prompt, same data, different results across runs
- Silent logic errors: It would apply wrong formulas or skip data transformations
Each error looked reasonable. The outputs were well-formatted, the language was confident, and the insights seemed smart. But the numbers were wrong.
Traditional QA doesn’t catch these problems because AI outputs are non-deterministic. You can’t write a unit test for “did the AI make up numbers?”
What Actually Works: Building QA Into Prompts
After that CEO incident, I rebuilt my entire approach to AI data analysis. The key insight: validation belongs in the prompt itself, not just in post-processing.
Here are the patterns that eliminated 90% of data quality issues.
Pattern 1: Explicit Constraint Declaration
My old prompts were vague:
Analyze the sales data and tell me what you find.This gave the AI too much freedom to make assumptions. Now I start with hard boundaries:
## Task: Analyze sales data for Q3 2024
### Constraints- Date range: 2024-07-01 to 2024-09-30 ONLY- Currency: All values in USD- Regions: North America only (exclude all others)- Minimum threshold: Include only products with >100 units sold- Missing data: Mark as 'NULL' - do not estimate or interpolate
### Forbidden Actions- Do not invent numbers if data is unclear- Do not extrapolate beyond the date range- Do not mix currencies without explicit conversionThe constraint section acts as a contract. When the AI violates it, the output is obviously wrong because I can check against the stated rules.
Pattern 2: Output Schema Enforcement
Unstructured outputs are impossible to validate. I force the AI to return structured data with validation fields built in:
### Required Output FormatReturn JSON with this exact schema:
{ "summary": { "total_revenue": number, "units_sold": number, "top_product": string }, "validation": { "record_count": number, "date_range_verified": boolean, "calculations_check": boolean }, "insights": [ { "finding": string, "data_point": string, "confidence": "high" | "medium" | "low" } ]}The validation section forces the AI to self-report on quality checks. If calculations_check comes back false, I know there’s a problem.
Pattern 3: Chain-of-Verification
This pattern catches errors before they reach me. I add explicit verification steps at the end of every prompt:
### Verification StepsBefore finalizing your analysis, verify:1. Sum of product revenues equals total_revenue2. All insights reference specific data points from the source3. No data points fall outside the specified date range4. All percentages are correctly calculated
If any verification fails, note the issue in your response.The AI now has to check its own work. About 30% of the time, it catches its own errors during this step and corrects them.
Pattern 4: Reference Grounding
The most dangerous AI outputs are unsupported claims. I require citations for every insight:
### Citation RequirementsFor each insight:- Quote the exact data point that supports it- Include the row/column identifier or timestamp- If no direct evidence exists, mark confidence as "low"
Example format:"Sales increased 15% in July" -> Data point: July 2024 row, sales column: $45,000 vs June $39,130This forces the AI to ground its claims in actual data. When it can’t find evidence, the low confidence flag warns me.
Pattern 5: Adversarial Self-Critique
The best way to catch errors is to ask the AI to argue against itself:
After completing your analysis, answer these questions:1. What assumptions in this analysis might be wrong?2. What data points contradict these conclusions?3. What alternative interpretations exist?4. Which claims have the weakest evidence?This surfaces edge cases and logical problems the initial analysis missed.
A Real Example: Before and After
Here’s how I transformed a failing prompt.
The Old Way (Failed)
Analyze the customer churn data and identify patterns.Result: The AI claimed 23% churn rate. Actual was 18%. It had mixed monthly and annual customers in the calculation.
The New Way (Works)
## Task: Analyze customer churn for Q3 2024
### Constraints- Date range: 2024-07-01 to 2024-09-30- Customer types: Monthly and annual tracked SEPARATELY- Output: Churn rates for each cohort, then combined weighted average- Missing data: Exclude from calculations, report count
### Required Output{ "monthly_customers": { "start_count": number, "churned": number, "churn_rate": number }, "annual_customers": { "start_count": number, "churned": number, "churn_rate": number }, "combined_churn_rate": number, "validation": { "data_completeness": number, "calculations_verified": boolean }}
### Verification1. churn_rate = churned / start_count * 1002. combined_churn_rate is weighted average, not simple average3. All percentages are < 100%Result: Accurate calculations, clear methodology, validated output.
Building a Reusable Prompt Framework
I turned these patterns into a Python class that generates QA-enhanced prompts automatically:
from typing import Dict, Any, List
class DataQAPrompt: """Wrapper for QA-enhanced data analysis prompts"""
def __init__(self, base_prompt: str, constraints: Dict[str, Any]): self.base_prompt = base_prompt self.constraints = constraints self.validation_steps: List[str] = []
def add_validation(self, check: str) -> 'DataQAPrompt': """Add a validation step to the prompt""" self.validation_steps.append(check) return self
def build(self) -> str: """Construct the full QA-enhanced prompt""" sections = [ f"## Task: {self.base_prompt}", "\n### Constraints", self._format_constraints(), ]
if self.validation_steps: sections.extend([ "\n### Verification Steps", self._format_validations() ])
return "\n".join(sections)
def _format_constraints(self) -> str: return "\n".join([f"- {k}: {v}" for k, v in self.constraints.items()])
def _format_validations(self) -> str: return "\n".join([f"{i+1}. {v}" for i, v in enumerate(self.validation_steps)])
# Usageprompt = DataQAPrompt( base_prompt="Analyze customer churn data", constraints={ "date_range": "2024-01-01 to 2024-12-31", "minimum_records": 1000, "output_format": "JSON" })
prompt.add_validation("Total customers = retained + churned")prompt.add_validation("Churn rate = churned / total * 100")prompt.add_validation("All percentages must be < 100%")
print(prompt.build())This ensures every data analysis prompt includes constraints and verification.
Validating AI Outputs Programmatically
Once the AI returns structured output, I validate it:
interface DataAnalysisResult { summary: Record<string, number | string>; validation: { record_count: number; date_range_verified: boolean; calculations_check: boolean; }; insights: Array<{ finding: string; data_point: string; confidence: 'high' | 'medium' | 'low'; }>;}
function validateAnalysisResult(result: DataAnalysisResult): { valid: boolean; errors: string[];} { const errors: string[] = [];
// Check validation flags the AI self-reported if (!result.validation.date_range_verified) { errors.push('Date range verification failed'); }
if (!result.validation.calculations_check) { errors.push('Calculations failed validation'); }
// Flag if too many low-confidence insights const lowConfidenceCount = result.insights.filter( i => i.confidence === 'low' ).length;
if (lowConfidenceCount > result.insights.length / 2) { errors.push('Too many low-confidence insights - data may be insufficient'); }
// Check for missing evidence const missingEvidence = result.insights.filter( i => !i.data_point || i.data_point === 'N/A' );
if (missingEvidence.length > 0) { errors.push(`${missingEvidence.length} insights lack data citations`); }
return { valid: errors.length === 0, errors };}This catches problems the AI’s self-checks missed.
The QA-Prompt Sandwich
I organize every prompt as a three-layer structure:
Top layer: Context and constraints
- Define the task scope
- Set hard boundaries
- Specify formats and requirements
Middle layer: The actual task
- The analysis or calculation request
- Data references
Bottom layer: Validation requirements
- Verification steps
- Self-critique questions
- Output validation rules
This structure makes prompts auditable. Anyone can read the prompt and understand what quality checks were requested.
What Still Requires Human Judgment
These patterns catch most errors, but not all. I still manually review:
- High-stakes decisions: Anything affecting revenue forecasts or strategic plans
- Novel analysis types: First-time prompts without established error patterns
- Edge case signals: When the AI flags low confidence or missing data
- Cross-check anomalies: When AI results differ from expectations
The goal isn’t to eliminate human review. It’s to focus human attention on genuinely uncertain outputs instead of checking every number.
Lessons Learned
Building this system took months of iteration. Key insights:
- Explicit beats implicit: Every assumption the AI might make should be stated as a constraint
- Structure enables validation: Free-form text is impossible to check; schemas make validation automatic
- Self-critique works: AI is surprisingly good at finding its own errors when prompted
- Document what fails: Every hallucination teaches you a new constraint to add
The CEO incident that started this journey led to a 90% reduction in data quality issues. The remaining 10% are caught by the validation layer or human review.
Most importantly, my AI-generated analyses are now trusted instead of questioned. The prompts themselves serve as documentation of what quality checks were performed.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments