Skip to main content

What is a Schema?

A schema defines the structure of data you want to extract from your documents. Parsefy uses JSON Schema to understand exactly what fields to extract, their types, and any validation rules.
If you’re using our SDKs, you can define schemas using Pydantic models (Python) or Zod schemas (TypeScript) instead of raw JSON Schema.

Basic Structure

Every Parsefy schema is a JSON object with these key properties:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "What this field contains"
    }
  },
  "required": ["field_name"]
}

Schema Properties

PropertyRequiredDescription
typeYesAlways "object" for the root schema
propertiesYesObject containing field definitions
requiredNoArray of required field names
Add description to each field to help the AI understand what to extract. Field-level descriptions are much more valuable than top-level schema descriptions.

⚠️ Required vs Optional Fields (Critical for Billing)

All fields are required by default in both SDKs. This significantly impacts your costs because required fields that return null or fall below the confidence threshold trigger the expensive fallback model.

How It Works

User writes (SDK)SDK converts to (JSON Schema)API interprets as
name: z.string()required: ["name"]Required: triggers fallback if low confidence
name: z.string().optional()required: []Optional: won’t trigger fallback
name: str (Python)required: ["name"]Required: triggers fallback if low confidence
name: str | None = Nonerequired: []Optional: won’t trigger fallback

Why This Matters

If a required field returns null or falls below the confidence_threshold:
  1. The API automatically triggers the fallback model (Tier 2)
  2. Tier 2 is significantly more expensive
  3. Your costs increase unexpectedly

Best Practice: Mark Optional Fields

const invoiceSchema = z.object({
  // REQUIRED - Core financial data that's always present
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount including tax'),

  // OPTIONAL - May not appear on all invoices
  vendor: z.string().optional().describe('Vendor name'),
  tax_id: z.string().optional().describe('Tax ID or VAT number'),
  due_date: z.string().optional().describe('Payment due date'),
  notes: z.string().optional().describe('Additional notes'),
});
Rule of thumb: If a field might be missing in >20% of your documents, mark it as optional.

Field Types

Parsefy supports all standard JSON Schema types:
{
  "invoice_number": {
    "type": "string",
    "description": "The invoice or receipt number"
  }
}
{
  "total_amount": {
    "type": "number",
    "description": "Total amount due in dollars"
  }
}
For integers only:
{
  "quantity": {
    "type": "integer",
    "description": "Number of items"
  }
}
{
  "is_paid": {
    "type": "boolean",
    "description": "Whether the invoice has been paid"
  }
}
{
  "line_items": {
    "type": "array",
    "description": "List of items on the invoice",
    "items": {
      "type": "object",
      "properties": {
        "description": {"type": "string"},
        "quantity": {"type": "integer"},
        "price": {"type": "number"}
      }
    }
  }
}
{
  "vendor": {
    "type": "object",
    "description": "Vendor information",
    "properties": {
      "name": {"type": "string"},
      "address": {"type": "string"},
      "phone": {"type": "string"}
    }
  }
}

Complete Financial Document Schema

Here’s a comprehensive invoice extraction schema with proper required/optional fields:
import * as z from 'zod';

const vendorSchema = z.object({
  name: z.string().describe('Company name'),
  address: z.string().optional().describe('Full address'),
  phone: z.string().optional().describe('Phone number'),
  email: z.string().optional().describe('Email address'),
  tax_id: z.string().optional().describe('Tax ID or VAT number'),
});

const lineItemSchema = z.object({
  description: z.string().describe('Item description'),
  quantity: z.number().describe('Number of items'),
  unit_price: z.number().describe('Price per unit'),
  amount: z.number().describe('Line total'),
});

const invoiceSchema = z.object({
  // REQUIRED - Core financial fields
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount due including tax'),
  currency: z.string().describe('3-letter currency code (USD, EUR, etc.)'),
  line_items: z.array(lineItemSchema).describe('List of line items'),

  // OPTIONAL - May not be on all invoices
  date: z.string().optional().describe('Invoice date in YYYY-MM-DD format'),
  due_date: z.string().optional().describe('Payment due date'),
  vendor: vendorSchema.optional().describe('Vendor information'),
  subtotal: z.number().optional().describe('Subtotal before tax'),
  tax: z.number().optional().describe('Tax amount'),
  payment_terms: z.string().optional().describe('e.g., Net 30'),
});

Best Practices

Use Descriptions

Always add description fields. They help the AI understand what to look for and where.

Be Specific

“Invoice date in YYYY-MM-DD format” is better than just “date”.

Mark Optional Carefully

Fields missing in >20% of documents should be optional to avoid costly fallbacks.

Use Appropriate Types

Use number for amounts, integer for counts, string for text.

Do’s and Don’ts

{
  "total_amount": {
    "type": "number",
    "description": "The final total amount due, including tax"
  },
  "invoice_date": {
    "type": "string",
    "description": "The date the invoice was issued, preserve original format"
  }
}

The _meta Field

Parsefy automatically injects a _meta field into every extraction response with field-level confidence:
{
  "invoice_number": "INV-2024-001",
  "total": 1500.00,
  "_meta": {
    "confidence_score": 0.95,
    "field_confidence": [
      { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "INV-2024-001" },
      { "field": "$.total", "score": 0.92, "reason": "Exact match", "page": 1, "text": "Total: $1,500.00" }
    ],
    "issues": []
  }
}
You don’t need to include _meta in your schema; it’s added automatically.

Next Steps