What is a Schema?
A schema defines the structure of data you want to extract from your documents. Parsefy uses JSON Schema to understand exactly what fields to extract, their types, and any validation rules.
If you’re using our SDKs, you can define schemas using Pydantic models (Python) or Zod schemas (TypeScript) instead of raw JSON Schema.
Basic Structure
Every Parsefy schema is a JSON object with these key properties:
{
"type" : "object" ,
"properties" : {
"field_name" : {
"type" : "string" ,
"description" : "What this field contains"
}
},
"required" : [ "field_name" ]
}
Schema Properties
Property Required Description typeYes Always "object" for the root schema propertiesYes Object containing field definitions requiredNo Array of required field names
Add description to each field to help the AI understand what to extract. Field-level descriptions are much more valuable than top-level schema descriptions.
⚠️ Required vs Optional Fields (Critical for Billing)
All fields are required by default in both SDKs. This significantly impacts your costs because required fields that return null or fall below the confidence threshold trigger the expensive fallback model.
How It Works
User writes (SDK) SDK converts to (JSON Schema) API interprets as name: z.string()required: ["name"]Required : triggers fallback if low confidencename: z.string().optional()required: []Optional : won’t trigger fallbackname: str (Python)required: ["name"]Required : triggers fallback if low confidencename: str | None = Nonerequired: []Optional : won’t trigger fallback
Why This Matters
If a required field returns null or falls below the confidence_threshold:
The API automatically triggers the fallback model (Tier 2)
Tier 2 is significantly more expensive
Your costs increase unexpectedly
Best Practice: Mark Optional Fields
TypeScript (Zod)
Python (Pydantic)
JSON Schema
const invoiceSchema = z . object ({
// REQUIRED - Core financial data that's always present
invoice_number: z . string (). describe ( 'The invoice number' ),
total: z . number (). describe ( 'Total amount including tax' ),
// OPTIONAL - May not appear on all invoices
vendor: z . string (). optional (). describe ( 'Vendor name' ),
tax_id: z . string (). optional (). describe ( 'Tax ID or VAT number' ),
due_date: z . string (). optional (). describe ( 'Payment due date' ),
notes: z . string (). optional (). describe ( 'Additional notes' ),
});
Rule of thumb: If a field might be missing in >20% of your documents, mark it as optional.
Field Types
Parsefy supports all standard JSON Schema types:
{
"invoice_number" : {
"type" : "string" ,
"description" : "The invoice or receipt number"
}
}
{
"total_amount" : {
"type" : "number" ,
"description" : "Total amount due in dollars"
}
}
For integers only: {
"quantity" : {
"type" : "integer" ,
"description" : "Number of items"
}
}
{
"is_paid" : {
"type" : "boolean" ,
"description" : "Whether the invoice has been paid"
}
}
{
"line_items" : {
"type" : "array" ,
"description" : "List of items on the invoice" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"quantity" : { "type" : "integer" },
"price" : { "type" : "number" }
}
}
}
}
{
"vendor" : {
"type" : "object" ,
"description" : "Vendor information" ,
"properties" : {
"name" : { "type" : "string" },
"address" : { "type" : "string" },
"phone" : { "type" : "string" }
}
}
}
Complete Financial Document Schema
Here’s a comprehensive invoice extraction schema with proper required/optional fields:
TypeScript (Zod)
Python (Pydantic)
JSON Schema
import * as z from 'zod' ;
const vendorSchema = z . object ({
name: z . string (). describe ( 'Company name' ),
address: z . string (). optional (). describe ( 'Full address' ),
phone: z . string (). optional (). describe ( 'Phone number' ),
email: z . string (). optional (). describe ( 'Email address' ),
tax_id: z . string (). optional (). describe ( 'Tax ID or VAT number' ),
});
const lineItemSchema = z . object ({
description: z . string (). describe ( 'Item description' ),
quantity: z . number (). describe ( 'Number of items' ),
unit_price: z . number (). describe ( 'Price per unit' ),
amount: z . number (). describe ( 'Line total' ),
});
const invoiceSchema = z . object ({
// REQUIRED - Core financial fields
invoice_number: z . string (). describe ( 'The invoice number' ),
total: z . number (). describe ( 'Total amount due including tax' ),
currency: z . string (). describe ( '3-letter currency code (USD, EUR, etc.)' ),
line_items: z . array ( lineItemSchema ). describe ( 'List of line items' ),
// OPTIONAL - May not be on all invoices
date: z . string (). optional (). describe ( 'Invoice date in YYYY-MM-DD format' ),
due_date: z . string (). optional (). describe ( 'Payment due date' ),
vendor: vendorSchema . optional (). describe ( 'Vendor information' ),
subtotal: z . number (). optional (). describe ( 'Subtotal before tax' ),
tax: z . number (). optional (). describe ( 'Tax amount' ),
payment_terms: z . string (). optional (). describe ( 'e.g., Net 30' ),
});
Best Practices
Use Descriptions Always add description fields. They help the AI understand what to look for and where.
Be Specific “Invoice date in YYYY-MM-DD format” is better than just “date”.
Mark Optional Carefully Fields missing in >20% of documents should be optional to avoid costly fallbacks.
Use Appropriate Types Use number for amounts, integer for counts, string for text.
Do’s and Don’ts
{
"total_amount" : {
"type" : "number" ,
"description" : "The final total amount due, including tax"
},
"invoice_date" : {
"type" : "string" ,
"description" : "The date the invoice was issued, preserve original format"
}
}
{
"total" : {
"type" : "string"
},
"date" : {
"type" : "string"
}
}
Missing descriptions
total should be number, not string
Vague field names without context
Parsefy automatically injects a _meta field into every extraction response with field-level confidence:
{
"invoice_number" : "INV-2024-001" ,
"total" : 1500.00 ,
"_meta" : {
"confidence_score" : 0.95 ,
"field_confidence" : [
{ "field" : "$.invoice_number" , "score" : 0.98 , "reason" : "Exact match" , "page" : 1 , "text" : "INV-2024-001" },
{ "field" : "$.total" , "score" : 0.92 , "reason" : "Exact match" , "page" : 1 , "text" : "Total: $1,500.00" }
],
"issues" : []
}
}
You don’t need to include _meta in your schema; it’s added automatically.
Next Steps