Extracting Data

Overview

Parsefy extracts structured data from financial PDFs (invoices, receipts, bills) and returns validated JSON with field-level confidence and evidence. You define what data you want using a schema, and Parsefy returns structured data or fails with clear reasons.

Our goal: 0% silent errors. You get validated output or clear failure reasons; never unreliable data silently.

Basic extraction

import { Parsefy } from 'parsefy';
import * as z from 'zod';

const client = new Parsefy();

const schema = z.object({
  // REQUIRED - triggers fallback if below confidence threshold
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount including tax'),

  // OPTIONAL - won't trigger fallback if missing
  vendor: z.string().optional().describe('Vendor name'),
  due_date: z.string().optional().describe('Payment due date'),
});

const { object, metadata, verification, error } = await client.extract({
  file: './invoice.pdf',
  schema,
  enableVerification: true, // Enable math verification
});

if (!error && object) {
  console.log(object.invoice_number); // Fully typed!

  // Access field-level confidence from _meta
  console.log(`Overall confidence: ${object._meta.confidence_score}`);
  object._meta.field_confidence.forEach((fc) => {
    console.log(`${fc.field}: ${fc.score} (${fc.reason}) - "${fc.text}"`);
  });

  // Check verification results
  if (verification) {
    console.log(`Verification: ${verification.status}`);
  }
}

Confidence threshold

Control when the fallback model is triggered:

const { object, metadata } = await client.extract({
  file: './invoice.pdf',
  schema,
  confidenceThreshold: 0.85, // default
});

Threshold	Behavior	Use Case
Lower (e.g., 0.70)	Faster: Accepts Tier 1 results more often	High-volume, less critical
Higher (e.g., 0.95)	More accurate: Triggers Tier 2 fallback more often	Financial reconciliation

Default: 0.85

File inputs

Parsefy accepts multiple file input types:

Input	Description
File path	`"./document.pdf"` - reads from disk
Buffer/bytes	In-memory file data
File object	Browser `File` from form input
Blob	Raw binary with MIME type

// File path
const result = await client.extract({ file: './doc.pdf', schema });

// Buffer (Node.js)
import { readFileSync } from 'fs';
const result = await client.extract({ 
  file: readFileSync('./doc.pdf'), 
  schema 
});

// File object (browser)
const fileInput = document.querySelector('input[type="file"]');
const result = await client.extract({ 
  file: fileInput.files[0], 
  schema 
});

Response format

Every extraction returns structured data with field-level confidence:

{
  "object": {
    "invoice_number": "INV-2024-0042",
    "date": "2024-01-15",
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "vendor": "Acme Corp",
    "_meta": {
      "confidence_score": 0.94,
      "field_confidence": [
        { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "Invoice # INV-2024-0042" },
        { "field": "$.date", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Date: 01/15/2024" },
        { "field": "$.subtotal", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Subtotal: $1,150.00" },
        { "field": "$.tax", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Tax: $100.00" },
        { "field": "$.total", "score": 0.92, "reason": "Formatting ambiguous", "page": 1, "text": "Total: $1,250.00" },
        { "field": "$.vendor", "score": 0.90, "reason": "Inferred from header", "page": 1, "text": "Acme Corp" }
      ],
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 2340,
    "credits": 1,
    "fallback_triggered": false
  },
  "verification": {
    "status": "PASSED",
    "checks_passed": 1,
    "checks_failed": 0,
    "cannot_verify_count": 0,
    "checks_run": [
      {
        "type": "HORIZONTAL_SUM",
        "status": "PASSED",
        "fields": ["total", "subtotal", "tax"],
        "passed": true,
        "delta": 0.0,
        "expected": 1250.00,
        "actual": 1250.00
      }
    ]
  },
  "error": null
}

The `_meta` field

Every extraction includes quality metrics with evidence:

Property	Type	Description
`confidence_score`	number	Overall confidence (0.0 to 1.0)
`field_confidence`	array	Per-field confidence with evidence
`issues`	array	Warnings or anomalies detected

Field confidence object

Each entry in field_confidence contains:

Property	Type	Description
`field`	string	JSON path (e.g., `$.invoice_number`)
`score`	number	Confidence score (0.0 to 1.0)
`reason`	string	”Exact match”, “Inferred from header”, etc.
`page`	number	Page number where found
`text`	string	Source text evidence

Metadata

Processing information:

Property	Type	Description
`processing_time_ms`	integer	How long the extraction took
`credits`	integer	Credits consumed (~1 per page)
`fallback_triggered`	boolean	Whether Tier 2 model was used

Verification (optional)

When enable_verification is set to true, the response includes math verification:

Property	Type	Description
`status`	string	`PASSED`, `FAILED`, `PARTIAL`, `CANNOT_VERIFY`, or `NO_RULES`
`checks_passed`	integer	Number of checks that passed
`checks_failed`	integer	Number of checks that failed
`checks_run`	array	Details of each verification check

Supported formats

Format	Extension	Processing
PDF	`.pdf`	Native multimodal AI (can “see” the document)
Microsoft Word	`.docx`	Converted to Markdown

Maximum file size: 10 MB

Error handling

import { Parsefy, APIError, ValidationError, ParsefyError } from 'parsefy';

try {
  const { object, error, metadata } = await client.extract({
    file: './document.pdf',
    schema,
  });

  // Extraction-level error (request succeeded, but extraction failed)
  if (error) {
    console.error(`Extraction failed: [${error.code}] ${error.message}`);
    console.log(`Fallback triggered: ${metadata.fallbackTriggered}`);
    console.log(`Issues: ${metadata.issues.join(', ')}`);
    return;
  }

  // Check for low confidence fields
  const lowConfidence = metadata.fieldConfidence.filter((fc) => fc.score < 0.80);
  if (lowConfidence.length > 0) {
    console.warn('Low confidence fields:', lowConfidence);
  }

  console.log('Success:', object);
} catch (err) {
  // HTTP/Network errors
  if (err instanceof APIError) {
    console.error(`API Error ${err.statusCode}: ${err.message}`);
  } else if (err instanceof ValidationError) {
    console.error(`Validation Error: ${err.message}`);
  } else if (err instanceof ParsefyError) {
    console.error(`Parsefy Error: ${err.message}`);
  }
}

Next steps

Schema Basics

Learn how to define extraction schemas with required vs optional fields

Confidence Scores

Understanding field-level confidence and the fallback mechanism

Extraction Rules

Improve accuracy with custom rules

API Reference

Full endpoint documentation

Documentation

Quickstart

Learn

Resources

Extracting Data

Overview

Basic extraction

Confidence threshold

File inputs

Response format

The `_meta` field

Field confidence object

Metadata

Verification (optional)

Supported formats

Error handling

Next steps

Schema Basics

Confidence Scores

Extraction Rules

API Reference

Documentation

Quickstart

Learn

Resources

​Overview

​Basic extraction

​Confidence threshold

​File inputs

​Response format

​The _meta field

​Field confidence object

​Metadata

​Verification (optional)

​Supported formats

​Error handling

​Next steps

Schema Basics

Confidence Scores

Extraction Rules

API Reference

Overview

Basic extraction

Confidence threshold

File inputs

Response format

The `_meta` field

Field confidence object

Metadata

Verification (optional)

Supported formats

Error handling

Next steps