Skip to main content
POST
/
v1
/
extract
Extract
curl --request POST \
  --url https://api.example.com/v1/extract \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "output_schema": "<string>",
  "confidence_threshold": 123,
  "enable_verification": true
}
'
{
  "object": {
    "_meta": {
      "confidence_score": 123,
      "field_confidence": [
        {
          "field": "<string>",
          "score": 123,
          "reason": "<string>",
          "page": 123,
          "text": "<string>"
        }
      ],
      "issues": [
        {}
      ]
    }
  },
  "metadata": {
    "processing_time_ms": 123,
    "credits": 123,
    "fallback_triggered": true
  },
  "verification": {
    "status": "<string>",
    "checks_passed": 123,
    "checks_failed": 123,
    "cannot_verify_count": 123,
    "checks_run": [
      {
        "type": "<string>",
        "status": "<string>",
        "fields": [
          {}
        ],
        "passed": true,
        "delta": 123,
        "expected": 123,
        "actual": 123
      }
    ]
  },
  "error": {
    "code": "<string>",
    "message": "<string>"
  }
}

Overview

The /v1/extract endpoint is the primary way to extract structured data from financial documents (invoices, receipts, bills). It processes PDF and DOCX files according to your JSON Schema definition and returns validated JSON with field-level confidence and evidence.
Our goal: 0% silent errors. You get validated output with field-level evidence, or clear failure reasons; never unreliable data silently.

Request

file
file
required
The document to extract data from.
  • Supported formats: PDF (.pdf), Microsoft Word (.docx)
  • Maximum size: 10 MB
output_schema
string
required
A JSON Schema string defining the structure of data to extract.See the Schema Guide for detailed documentation.
Important: Fields in the required array that return null or fall below confidence_threshold trigger the fallback model (Tier 2), which is more expensive.
confidence_threshold
number
default:"0.85"
Minimum confidence score (0.0 to 1.0) required before accepting Tier 1 results.
  • Lower values (e.g., 0.70): Faster and cheaper (accepts Tier 1 results more often)
  • Higher values (e.g., 0.95): More accurate but more expensive (triggers Tier 2 fallback more often)
Default: 0.85
enable_verification
boolean
default:"false"
Enable math verification to ensure extracted numeric data is mathematically consistent.When enabled, Parsefy automatically:
  • Verifies totals match subtotals + tax
  • Validates line item sums
  • Performs shadow extraction for single-field verification
Default: false
Authorization
string
required
Bearer token authentication.Format: Bearer pk_your_api_key

Response

object
object
The extracted data matching your schema.
metadata
object
Processing information.
verification
object
Math verification results (only present if enable_verification was true).
error
object
Present only if extraction failed.

Examples

Basic Invoice Extraction with Confidence Threshold

curl -X POST "https://api.parsefy.io/v1/extract" \
  -H "Authorization: Bearer pk_your_api_key" \
  -F "file=@invoice.pdf" \
  -F 'output_schema={
    "type": "object",
    "properties": {
      "invoice_number": {
        "type": "string",
        "description": "The invoice number"
      },
      "date": {
        "type": "string",
        "description": "Invoice date"
      },
      "total": {
        "type": "number",
        "description": "Total amount including tax"
      },
      "vendor": {
        "type": "string",
        "description": "Vendor name"
      }
    },
    "required": ["invoice_number", "total"]
  }' \
  -F "confidence_threshold=0.85"

Response with Field-Level Confidence

{
  "object": {
    "invoice_number": "INV-2024-0042",
    "date": "01/15/2024",
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "vendor": "Acme Corp",
    "_meta": {
      "confidence_score": 0.94,
      "field_confidence": [
        { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "Invoice # INV-2024-0042" },
        { "field": "$.date", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Date: 01/15/2024" },
        { "field": "$.subtotal", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Subtotal: $1,150.00" },
        { "field": "$.tax", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Tax: $100.00" },
        { "field": "$.total", "score": 0.92, "reason": "Formatting ambiguous", "page": 1, "text": "Total: $1,250.00" },
        { "field": "$.vendor", "score": 0.90, "reason": "Inferred from header", "page": 1, "text": "Acme Corp" }
      ],
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 2340,
    "credits": 1,
    "fallback_triggered": false
  },
  "verification": {
    "status": "PASSED",
    "checks_passed": 1,
    "checks_failed": 0,
    "cannot_verify_count": 0,
    "checks_run": [
      {
        "type": "HORIZONTAL_SUM",
        "status": "PASSED",
        "fields": ["total", "subtotal", "tax"],
        "passed": true,
        "delta": 0.0,
        "expected": 1250.00,
        "actual": 1250.00
      }
    ]
  }
}

Complex Schema with Line Items

curl -X POST "https://api.parsefy.io/v1/extract" \
  -H "Authorization: Bearer pk_your_api_key" \
  -F "file=@invoice.pdf" \
  -F 'output_schema={
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string", "description": "Invoice number"},
      "vendor": {
        "type": "object",
        "properties": {
          "name": {"type": "string", "description": "Company name"},
          "address": {"type": "string", "description": "Address"}
        }
      },
      "line_items": {
        "type": "array",
        "description": "Line items on the invoice",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer"},
            "unit_price": {"type": "number"},
            "amount": {"type": "number"}
          }
        }
      },
      "subtotal": {"type": "number"},
      "tax": {"type": "number"},
      "total": {"type": "number", "description": "Total amount due"}
    },
    "required": ["invoice_number", "total", "line_items"]
  }' \
  -F "confidence_threshold=0.85"

Response with Line Items and Verification

{
  "object": {
    "invoice_number": "INV-2024-0042",
    "vendor": {
      "name": "Acme Corp",
      "address": "123 Business Ave, New York, NY 10001"
    },
    "line_items": [
      {
        "description": "Professional Services",
        "quantity": 10,
        "unit_price": 100.00,
        "amount": 1000.00
      },
      {
        "description": "Software License",
        "quantity": 1,
        "unit_price": 150.00,
        "amount": 150.00
      }
    ],
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "_meta": {
      "confidence_score": 0.95,
      "field_confidence": [
        { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "INV-2024-0042" },
        { "field": "$.total", "score": 0.96, "reason": "Exact match", "page": 1, "text": "Total: $1,250.00" },
        { "field": "$.line_items[0].description", "score": 0.94, "reason": "Exact match", "page": 1, "text": "Professional Services" }
      ],
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 3200,
    "credits": 1,
    "fallback_triggered": false
  },
  "verification": {
    "status": "PASSED",
    "checks_passed": 2,
    "checks_failed": 0,
    "cannot_verify_count": 0,
    "checks_run": [
      {
        "type": "HORIZONTAL_SUM",
        "status": "PASSED",
        "fields": ["total", "subtotal", "tax"],
        "passed": true,
        "delta": 0.0,
        "expected": 1250.00,
        "actual": 1250.00
      },
      {
        "type": "VERTICAL_SUM",
        "status": "PASSED",
        "fields": ["subtotal", "line_items"],
        "passed": true,
        "delta": 0.0,
        "expected": 1150.00,
        "actual": 1150.00
      }
    ]
  }
}

Fallback Behavior

When a required field returns null or falls below confidence_threshold, the API automatically triggers the fallback model (Tier 2):
{
  "object": { ... },
  "metadata": {
    "processing_time_ms": 5500,
    "credits": 2,
    "fallback_triggered": true,
    "confidence_score": 0.97
  }
}
The fallback model (Tier 2) consumes more credits. To avoid unexpected costs, mark fields as optional if they might be missing in >20% of your documents.

Error Responses

Invalid File Type (400)

{
  "detail": "Invalid file type. Supported formats: PDF, DOCX"
}

Invalid Schema (400)

{
  "detail": "Invalid JSON schema: Expecting property name enclosed in double quotes"
}

Unauthorized (401)

{
  "detail": "Invalid or missing API key"
}

Rate Limited (429)

{
  "detail": "Rate limit exceeded. Please retry after 1 second."
}

Extraction Failed (200 with error)

{
  "object": null,
  "metadata": {
    "processing_time_ms": 5000,
    "credits": 1,
    "fallback_triggered": true
  },
  "error": {
    "code": "EXTRACTION_FAILED",
    "message": "Unable to extract data from document"
  }
}

Rate Limits

  • Request Rate: 1 request per second per IP
  • File Size: Maximum 10 MB
See Rate Limits for more details.