Skip to main content

What Are Extraction Rules?

Extraction rules are custom instructions you can add to any field in your schema. They tell the AI exactly how to find and extract specific data, improving accuracy for complex or ambiguous documents.
Rules are a Parsefy extension to JSON Schema. They’re not part of the standard spec but are fully supported by our API.

Adding Rules

Add a rules array to any field definition:
{
  "type": "object",
  "properties": {
    "invoice_date": {
      "type": "string",
      "description": "The invoice date",
      "rules": [
        "Look for labels like 'Invoice Date', 'Date', or 'Issued On'",
        "Format should be preserved exactly as shown in the document",
        "If multiple dates exist, use the one near 'Invoice Date' label"
      ]
    }
  }
}

Rule Examples

Finding Specific Fields

{
  "amount_due": {
    "type": "number",
    "description": "Total amount due",
    "rules": [
      "This is usually the largest/most prominent amount on the invoice",
      "May be labeled 'Total', 'Amount Due', 'Balance Due', or 'Grand Total'",
      "Located typically at the bottom right of the document"
    ]
  }
}

Handling Ambiguity

{
  "customer_name": {
    "type": "string",
    "description": "Name of the customer",
    "rules": [
      "Look for 'Bill To', 'Customer', or 'Sold To' sections",
      "This is NOT the vendor/seller name",
      "Usually appears on the left side of the document"
    ]
  }
}

Date Formatting

{
  "due_date": {
    "type": "string",
    "description": "Payment due date",
    "rules": [
      "Look for 'Due Date', 'Payment Due', or 'Pay By' labels",
      "Preserve the exact format from the document",
      "If not found, return null - do not guess"
    ]
  }
}

Table Extraction

{
  "line_items": {
    "type": "array",
    "description": "List of items/services",
    "rules": [
      "Extract from the main itemized table/list",
      "Each row is one item - include all columns",
      "Skip header rows and summary rows"
    ],
    "items": {
      "type": "object",
      "properties": {
        "description": {
          "type": "string",
          "rules": ["Usually the longest text field in each row"]
        },
        "quantity": {
          "type": "integer",
          "rules": ["Look for 'Qty', 'Quantity', or 'Units' column"]
        },
        "unit_price": {
          "type": "number",
          "rules": ["Look for 'Price', 'Rate', or 'Unit Price' column"]
        },
        "total": {
          "type": "number",
          "rules": ["Look for 'Amount', 'Total', or 'Line Total' column"]
        }
      }
    }
  }
}

Complete Example

Here’s a receipt extraction schema with comprehensive rules:
{
  "type": "object",
  "properties": {
    "merchant": {
      "type": "object",
      "description": "Store/merchant information",
      "properties": {
        "name": {
          "type": "string",
          "description": "Store name",
          "rules": [
            "Usually at the very top of the receipt",
            "Often in large or bold text",
            "May include 'Inc.', 'LLC', etc."
          ]
        },
        "address": {
          "type": "string",
          "description": "Store address",
          "rules": [
            "Located below the store name",
            "Combine street, city, state, zip into one string"
          ]
        },
        "phone": {
          "type": "string",
          "description": "Store phone number",
          "rules": [
            "Usually near the address",
            "Format: (XXX) XXX-XXXX or XXX-XXX-XXXX"
          ]
        }
      }
    },
    "transaction": {
      "type": "object",
      "properties": {
        "date": {
          "type": "string",
          "description": "Transaction date",
          "rules": [
            "Look for 'Date:', 'Transaction Date', or standalone date",
            "Preserve original format"
          ]
        },
        "time": {
          "type": "string",
          "description": "Transaction time",
          "rules": [
            "Usually appears with or near the date",
            "Format: HH:MM AM/PM or 24-hour"
          ]
        },
        "receipt_number": {
          "type": "string",
          "description": "Receipt or transaction number",
          "rules": [
            "Look for 'Receipt #', 'Trans #', 'Order #'",
            "May be a long numeric or alphanumeric code"
          ]
        }
      }
    },
    "items": {
      "type": "array",
      "description": "Purchased items",
      "rules": [
        "Extract each line item from the receipt",
        "Include quantity if shown (e.g., '2 @ $5.00')",
        "Skip subtotal, tax, and total lines"
      ],
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "rules": ["Item description, may be abbreviated"]
          },
          "quantity": {
            "type": "integer",
            "rules": ["Default to 1 if not explicitly shown"]
          },
          "price": {
            "type": "number",
            "rules": ["Price for this line (may be unit or total)"]
          }
        }
      }
    },
    "subtotal": {
      "type": "number",
      "description": "Subtotal before tax",
      "rules": [
        "Look for 'Subtotal', 'Sub-total', or 'Sub Total'",
        "Appears before tax calculation"
      ]
    },
    "tax": {
      "type": "number",
      "description": "Tax amount",
      "rules": [
        "Look for 'Tax', 'Sales Tax', or tax percentage label",
        "Extract the dollar amount, not the percentage"
      ]
    },
    "total": {
      "type": "number",
      "description": "Total amount paid",
      "rules": [
        "The final amount, after tax",
        "Look for 'Total', 'Grand Total', 'Amount Due'",
        "Usually the most prominent number at the bottom"
      ]
    },
    "payment_method": {
      "type": "string",
      "description": "How the purchase was paid",
      "rules": [
        "Look for 'Visa', 'Mastercard', 'Cash', 'Debit', etc.",
        "May show last 4 digits of card",
        "Extract just the payment type"
      ]
    }
  },
  "required": ["merchant", "total"]
}

Rule Writing Tips

Be Specific

Mention exact labels and positions the AI should look for.

Handle Edge Cases

Include fallback instructions for when the primary approach doesn’t work.

Prevent Confusion

Explicitly state what NOT to extract to avoid mixing similar fields.

Keep It Short

Each rule should be one clear instruction. Multiple short rules beat one long paragraph.

Effective Rule Patterns

"rules": [
  "Usually at the top right of the document",
  "Located in the header section",
  "Found near the company logo"
]

Using Rules with SDKs

Python (Pydantic)

Use json_schema_extra to add rules:
from pydantic import BaseModel, Field

class Invoice(BaseModel):
    invoice_number: str = Field(
        description="The invoice number",
        json_schema_extra={
            "rules": [
                "Look for 'Invoice #', 'Inv No', or 'Reference'",
                "Usually at the top right of the document"
            ]
        }
    )
    
    total: float = Field(
        description="Total amount due",
        json_schema_extra={
            "rules": [
                "The final amount including tax",
                "Look for 'Total', 'Amount Due', or 'Balance Due'"
            ]
        }
    )

TypeScript (Zod)

Use .describe() with detailed instructions:
import * as z from 'zod';

const invoiceSchema = z.object({
  invoice_number: z.string().describe(
    'The invoice number. Look for "Invoice #", "Inv No", or "Reference". Usually at the top right.'
  ),
  
  total: z.number().describe(
    'Total amount due. The final amount including tax. Look for "Total", "Amount Due", or "Balance Due".'
  ),
});
The Zod SDK doesn’t directly support rules arrays, but you can include rule-like instructions in the description. The AI will follow them.

Next Steps