Skip to main content

Overview

Parsefy provides field-level confidence scoring with evidence tracking. Every extracted field comes with:
  • A confidence score (0.0 to 1.0)
  • The source text evidence
  • The page number where it was found
  • A reason explaining the score
Our goal: 0% silent errors. If a required field can’t be extracted with sufficient confidence, the API triggers a fallback model or fails with clear reasons; never returns unreliable data silently.

The _meta Structure

Every extraction includes detailed metadata:
{
  "invoice_number": "INV-2024-0042",
  "date": "2024-01-15",
  "total": 1250.00,
  "vendor": "Acme Corp",
  "_meta": {
    "confidence_score": 0.94,
    "field_confidence": [
      { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "Invoice # INV-2024-0042" },
      { "field": "$.date", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Date: 01/15/2024" },
      { "field": "$.total", "score": 0.92, "reason": "Formatting ambiguous", "page": 1, "text": "Total: $1,250.00" },
      { "field": "$.vendor", "score": 0.90, "reason": "Inferred from header", "page": 1, "text": "Acme Corp" }
    ],
    "issues": []
  }
}
You don’t need to define _meta in your schema; it’s injected automatically.

Field Confidence Object

Each entry in field_confidence contains:
PropertyTypeDescription
fieldstringJSON path to the field (e.g., $.invoice_number)
scorenumberConfidence score (0.0 to 1.0)
reasonstringExplanation: “Exact match”, “Inferred from header”, etc.
pageintegerPage number where the value was found
textstringSource text evidence from the document

Common Reasons

ReasonMeaning
Exact matchField value was found exactly as expected
Inferred from headerValue was derived from document header/context
Formatting ambiguousValue found but format was unclear
Multiple values foundSeveral possible values were detected
Partially obscuredSome text was difficult to read

Confidence Threshold

Control when the fallback model is triggered using confidence_threshold:
curl -X POST "https://api.parsefy.io/v1/extract" \
  -H "Authorization: Bearer pk_your_api_key" \
  -F "file=@invoice.pdf" \
  -F 'output_schema={
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string"},
      "date": {"type": "string"},
      "total": {"type": "number"},
      "vendor": {"type": "string"}
    },
    "required": ["invoice_number", "total"]
  }' \
  -F "confidence_threshold=0.85"
Default: 0.85
ThresholdBehaviorUse Case
Lower (e.g., 0.70)Faster: Accepts Tier 1 results more oftenHigh-volume, less critical
Higher (e.g., 0.95)More accurate: Triggers Tier 2 fallback more oftenFinancial reconciliation
Lower confidence_threshold = faster and cheaper (accepts Tier 1 more often). Higher confidence_threshold = more accurate but more expensive (triggers Tier 2 fallback more often).

Automatic Fallback

Parsefy uses a two-tier model architecture for reliability:
1

Tier 1 Extraction

Your document is first processed by a fast, efficient model.
2

Confidence Check

If any required field returns null or falls below confidence_threshold, the extraction is automatically re-run.
3

Tier 2 Fallback

A more powerful (and more expensive) model processes the document for improved accuracy.
Important: If a required field can’t be extracted with sufficient confidence, it triggers the fallback model. This is critical for billing. See the section on Required vs Optional Fields.
The metadata.fallback_triggered field tells you if the fallback was used:
{
  "object": { ... },
  "metadata": {
    "processing_time_ms": 4500,
    "credits": 2,
    "fallback_triggered": true
  },
  "verification": { ... }
}

Required vs Optional Fields (Critical for Billing)

All fields are required by default in both SDKs. This is intentional for data safety, but it significantly impacts your costs.

Why This Matters

If a required field returns null or falls below the confidence_threshold, the API triggers the fallback model (Tier 2), which is significantly more expensive.
User writes (SDK)SDK converts to (JSON Schema)API interprets as
name: z.string()required: ["name"]Required: triggers fallback if low confidence
name: z.string().optional()required: []Optional: won’t trigger fallback

To Avoid Unexpected High Billing

Mark fields as optional if they might be missing in >20% of your documents:
const schema = z.object({
  // REQUIRED - Always present on invoices, keep required
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount including tax'),

  // OPTIONAL - May not appear on all documents, mark optional!
  vendor: z.string().optional().describe('Vendor name'),       // Not all invoices have vendor name
  tax_id: z.string().optional().describe('Tax ID number'),     // Rarely present
  notes: z.string().optional().describe('Additional notes'),   // Usually empty
  due_date: z.string().optional().describe('Payment due date'),// Sometimes missing
});
Rule of thumb: If a field might be missing in >20% of your documents, mark it as optional.

Score Interpretation

ScoreLevelMeaningRecommended Action
0.95 - 1.0Very HighAll fields found with high certaintyUse directly
0.90 - 0.94HighMinor uncertainties, excellent extractionUse directly
0.85 - 0.89ModerateSome unclear fieldsReview if critical
0.70 - 0.84LowMultiple issues detectedManual review recommended
< 0.70Very LowSignificant problemsResults may be unreliable

The Issues Array

The issues array contains human-readable descriptions of any problems encountered:
{
  "_meta": {
    "confidence_score": 0.82,
    "field_confidence": [...],
    "issues": [
      "Date format ambiguous: could be DD/MM/YYYY or MM/DD/YYYY",
      "Total amount unclear - multiple totals found",
      "Vendor name partially obscured"
    ]
  }
}

Using Confidence in Your Application

TypeScript Example

const { object, metadata, verification, error } = await client.extract({
  file: './invoice.pdf',
  schema,
  confidenceThreshold: 0.85,
  enableVerification: true, // Enable math verification
});

if (!error && object) {
  // Overall confidence from _meta
  console.log(`Overall confidence: ${object._meta.confidence_score}`);

  // Check individual field confidence
  object._meta.field_confidence.forEach((fc) => {
    console.log(`${fc.field}: ${fc.score} (${fc.reason}) - "${fc.text}"`);
    
    if (fc.score < 0.80) {
      console.warn(`Low confidence on ${fc.field}`);
    }
  });

  // Check for issues
  if (object._meta.issues.length > 0) {
    console.warn('Issues detected:', object._meta.issues);
  }

  // Check verification results
  if (verification) {
    console.log(`Verification: ${verification.status}`);
    verification.checks_run.forEach((check) => {
      console.log(`${check.type}: ${check.passed ? 'PASSED' : 'FAILED'}`);
    });
  }
}

Python Example

result = client.extract(
    file="document.pdf",
    schema=Invoice,
    confidence_threshold=0.85,
    enable_verification=True  # Enable math verification
)

if result.error is None:
    # Overall confidence from meta
    if result.meta:
        print(f"Overall confidence: {result.meta.confidence_score}")
        
        # Check individual field confidence
        for fc in result.meta.field_confidence:
            print(f"{fc.field}: {fc.score} ({fc.reason}) - '{fc.text}'")
            
            if fc.score < 0.80:
                print(f"  Low confidence on {fc.field}")
        
        # Check for issues
        if result.meta.issues:
            print("Issues:", result.meta.issues)
    
    # Check verification results
    if result.verification:
        print(f"Verification: {result.verification.status}")
        for check in result.verification.checks_run:
            print(f"{check.type}: {'PASSED' if check.passed else 'FAILED'}")

Best Practices

Mark Optional Fields

Any field that might be missing in >20% of documents should be optional to avoid unnecessary fallback triggers.

Set Appropriate Thresholds

Financial reconciliation may need 0.95+, while categorization might accept 0.80+.

Log Field Evidence

Store field_confidence for audit trails and debugging extraction issues.

Handle Low Confidence

Build workflows that route low-confidence extractions to human review.

Next Steps