Schema Basics

What is a Schema?

A schema defines the structure of data you want to extract from your documents. Parsefy uses JSON Schema to understand exactly what fields to extract, their types, and any validation rules.

If you’re using our SDKs, you can define schemas using Pydantic models (Python) or Zod schemas (TypeScript) instead of raw JSON Schema.

Basic Structure

Every Parsefy schema is a JSON object with these key properties:

{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "What this field contains"
    }
  },
  "required": ["field_name"]
}

Schema Properties

Property	Required	Description
`type`	Yes	Always `"object"` for the root schema
`properties`	Yes	Object containing field definitions
`required`	No	Array of required field names

Add description to each field to help the AI understand what to extract. Field-level descriptions are much more valuable than top-level schema descriptions.

⚠️ Required vs Optional Fields (Critical for Billing)

All fields are required by default in both SDKs. This significantly impacts your costs because required fields that return null or fall below the confidence threshold trigger the expensive fallback model.

How It Works

User writes (SDK)	SDK converts to (JSON Schema)	API interprets as
`name: z.string()`	`required: ["name"]`	Required: triggers fallback if low confidence
`name: z.string().optional()`	`required: []`	Optional: won’t trigger fallback
`name: str` (Python)	`required: ["name"]`	Required: triggers fallback if low confidence
`name: str \| None = None`	`required: []`	Optional: won’t trigger fallback

Why This Matters

If a required field returns null or falls below the confidence_threshold:

The API automatically triggers the fallback model (Tier 2)
Tier 2 is significantly more expensive
Your costs increase unexpectedly

Best Practice: Mark Optional Fields

const invoiceSchema = z.object({
  // REQUIRED - Core financial data that's always present
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount including tax'),

  // OPTIONAL - May not appear on all invoices
  vendor: z.string().optional().describe('Vendor name'),
  tax_id: z.string().optional().describe('Tax ID or VAT number'),
  due_date: z.string().optional().describe('Payment due date'),
  notes: z.string().optional().describe('Additional notes'),
});

Rule of thumb: If a field might be missing in >20% of your documents, mark it as optional.

Field Types

Parsefy supports all standard JSON Schema types:

String

{
  "invoice_number": {
    "type": "string",
    "description": "The invoice or receipt number"
  }
}

Number

{
  "total_amount": {
    "type": "number",
    "description": "Total amount due in dollars"
  }
}

For integers only:

{
  "quantity": {
    "type": "integer",
    "description": "Number of items"
  }
}

Boolean

{
  "is_paid": {
    "type": "boolean",
    "description": "Whether the invoice has been paid"
  }
}

Array

{
  "line_items": {
    "type": "array",
    "description": "List of items on the invoice",
    "items": {
      "type": "object",
      "properties": {
        "description": {"type": "string"},
        "quantity": {"type": "integer"},
        "price": {"type": "number"}
      }
    }
  }
}

Nested Object

{
  "vendor": {
    "type": "object",
    "description": "Vendor information",
    "properties": {
      "name": {"type": "string"},
      "address": {"type": "string"},
      "phone": {"type": "string"}
    }
  }
}

Complete Financial Document Schema

Here’s a comprehensive invoice extraction schema with proper required/optional fields:

import * as z from 'zod';

const vendorSchema = z.object({
  name: z.string().describe('Company name'),
  address: z.string().optional().describe('Full address'),
  phone: z.string().optional().describe('Phone number'),
  email: z.string().optional().describe('Email address'),
  tax_id: z.string().optional().describe('Tax ID or VAT number'),
});

const lineItemSchema = z.object({
  description: z.string().describe('Item description'),
  quantity: z.number().describe('Number of items'),
  unit_price: z.number().describe('Price per unit'),
  amount: z.number().describe('Line total'),
});

const invoiceSchema = z.object({
  // REQUIRED - Core financial fields
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount due including tax'),
  currency: z.string().describe('3-letter currency code (USD, EUR, etc.)'),
  line_items: z.array(lineItemSchema).describe('List of line items'),

  // OPTIONAL - May not be on all invoices
  date: z.string().optional().describe('Invoice date in YYYY-MM-DD format'),
  due_date: z.string().optional().describe('Payment due date'),
  vendor: vendorSchema.optional().describe('Vendor information'),
  subtotal: z.number().optional().describe('Subtotal before tax'),
  tax: z.number().optional().describe('Tax amount'),
  payment_terms: z.string().optional().describe('e.g., Net 30'),
});

Best Practices

Use Descriptions

Always add description fields. They help the AI understand what to look for and where.

Be Specific

“Invoice date in YYYY-MM-DD format” is better than just “date”.

Mark Optional Carefully

Fields missing in >20% of documents should be optional to avoid costly fallbacks.

Use Appropriate Types

Use number for amounts, integer for counts, string for text.

Do’s and Don’ts

Do
Don't

{
  "total_amount": {
    "type": "number",
    "description": "The final total amount due, including tax"
  },
  "invoice_date": {
    "type": "string",
    "description": "The date the invoice was issued, preserve original format"
  }
}

{
  "total": {
    "type": "string"
  },
  "date": {
    "type": "string"
  }
}

Missing descriptions
total should be number, not string
Vague field names without context

The `_meta` Field

Parsefy automatically injects a _meta field into every extraction response with field-level confidence:

{
  "invoice_number": "INV-2024-001",
  "total": 1500.00,
  "_meta": {
    "confidence_score": 0.95,
    "field_confidence": [
      { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "INV-2024-001" },
      { "field": "$.total", "score": 0.92, "reason": "Exact match", "page": 1, "text": "Total: $1,500.00" }
    ],
    "issues": []
  }
}

You don’t need to include _meta in your schema; it’s added automatically.

Next Steps

Extraction Rules

Learn how to use custom rules to improve accuracy

Confidence Scores

Understanding the confidence scoring system

Documentation

Quickstart

Learn

Resources

What is a Schema?

Basic Structure

Schema Properties

⚠️ Required vs Optional Fields (Critical for Billing)

How It Works

Why This Matters

Best Practice: Mark Optional Fields

Field Types

Complete Financial Document Schema

Best Practices

Use Descriptions

Be Specific

Mark Optional Carefully

Use Appropriate Types

Do’s and Don’ts

The `_meta` Field

Next Steps

Extraction Rules

Confidence Scores

Documentation

Quickstart

Learn

Resources

​What is a Schema?

​Basic Structure

​Schema Properties

​⚠️ Required vs Optional Fields (Critical for Billing)

​How It Works

​Why This Matters

​Best Practice: Mark Optional Fields

​Field Types

​Complete Financial Document Schema

​Best Practices

Use Descriptions

Be Specific

Mark Optional Carefully

Use Appropriate Types

​Do’s and Don’ts

​The _meta Field

​Next Steps

Extraction Rules

Confidence Scores

What is a Schema?

Basic Structure

Schema Properties

⚠️ Required vs Optional Fields (Critical for Billing)

How It Works

Why This Matters

Best Practice: Mark Optional Fields

Field Types

Complete Financial Document Schema

Best Practices

Do’s and Don’ts

The `_meta` Field

Next Steps