Extract - Parsefy

POST

extract

Extract

curl --request POST \
  --url https://api.example.com/v1/extract \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "output_schema": "<string>",
  "confidence_threshold": 123,
  "enable_verification": true
}
'

{
  "object": {
    "_meta": {
      "confidence_score": 123,
      "field_confidence": [
        {
          "field": "<string>",
          "score": 123,
          "reason": "<string>",
          "page": 123,
          "text": "<string>"
        }
      ],
      "issues": [
        {}
      ]
    }
  },
  "metadata": {
    "processing_time_ms": 123,
    "credits": 123,
    "fallback_triggered": true
  },
  "verification": {
    "status": "<string>",
    "checks_passed": 123,
    "checks_failed": 123,
    "cannot_verify_count": 123,
    "checks_run": [
      {
        "type": "<string>",
        "status": "<string>",
        "fields": [
          {}
        ],
        "passed": true,
        "delta": 123,
        "expected": 123,
        "actual": 123
      }
    ]
  },
  "error": {
    "code": "<string>",
    "message": "<string>"
  }
}

Overview

The /v1/extract endpoint is the primary way to extract structured data from financial documents (invoices, receipts, bills). It processes PDF and DOCX files according to your JSON Schema definition and returns validated JSON with field-level confidence and evidence.

Our goal: 0% silent errors. You get validated output with field-level evidence, or clear failure reasons; never unreliable data silently.

Request

file

required

The document to extract data from.

Supported formats: PDF (.pdf), Microsoft Word (.docx)
Maximum size: 10 MB

output_schema

string

required

A JSON Schema string defining the structure of data to extract.See the Schema Guide for detailed documentation.

Important: Fields in the required array that return null or fall below confidence_threshold trigger the fallback model (Tier 2), which is more expensive.

confidence_threshold

number

default:"0.85"

Minimum confidence score (0.0 to 1.0) required before accepting Tier 1 results.

Lower values (e.g., 0.70): Faster and cheaper (accepts Tier 1 results more often)
Higher values (e.g., 0.95): More accurate but more expensive (triggers Tier 2 fallback more often)

Default: 0.85

enable_verification

boolean

default:"false"

Enable math verification to ensure extracted numeric data is mathematically consistent.When enabled, Parsefy automatically:

Verifies totals match subtotals + tax
Validates line item sums
Performs shadow extraction for single-field verification

Default: false

Authorization

string

required

Bearer token authentication.Format: Bearer pk_your_api_key

Response

object

The extracted data matching your schema.

Show properties

_meta

object

Quality metrics and field-level evidence for the extraction.

Show properties

confidence_score

number

Overall confidence level from 0.0 to 1.0

field_confidence

array

Per-field confidence with evidence

Show items

field

string

JSON path to the field (e.g., $.invoice_number)

score

number

Confidence score (0.0 to 1.0)

reason

string

Explanation: “Exact match”, “Inferred from header”, etc.

page

integer

Page number where the value was found

text

string

Source text evidence from the document

issues

array

Array of issue descriptions (strings)

metadata

object

Processing information.

Show properties

processing_time_ms

integer

Total processing time in milliseconds

credits

integer

Credits used (~1 per page, more if fallback triggered)

fallback_triggered

boolean

Whether the fallback model (Tier 2) was used

verification

object

Math verification results (only present if enable_verification was true).

Show properties

status

string

Overall status: PASSED, FAILED, PARTIAL, CANNOT_VERIFY, or NO_RULES

checks_passed

integer

Number of verification checks that passed

checks_failed

integer

Number of verification checks that failed

cannot_verify_count

integer

Number of checks that could not be verified

checks_run

array

Detailed results for each verification check

Show items

type

string

Type of check: HORIZONTAL_SUM or VERTICAL_SUM

status

string

Check status: PASSED, FAILED, or CANNOT_VERIFY

fields

array

Fields involved in this check

passed

boolean

Whether the check passed

delta

number

Difference between expected and actual values

expected

number

Expected value from the verification rule

actual

number

Actual value extracted from the document

error

object

Present only if extraction failed.

Show properties

code

string

Error code: EXTRACTION_FAILED, LLM_ERROR, PARSING_ERROR, TIMEOUT_ERROR

message

string

Human-readable error message

Examples

Basic Invoice Extraction with Confidence Threshold

curl -X POST "https://api.parsefy.io/v1/extract" \
  -H "Authorization: Bearer pk_your_api_key" \
  -F "file=@invoice.pdf" \
  -F 'output_schema={
    "type": "object",
    "properties": {
      "invoice_number": {
        "type": "string",
        "description": "The invoice number"
      },
      "date": {
        "type": "string",
        "description": "Invoice date"
      },
      "total": {
        "type": "number",
        "description": "Total amount including tax"
      },
      "vendor": {
        "type": "string",
        "description": "Vendor name"
      }
    },
    "required": ["invoice_number", "total"]
  }' \
  -F "confidence_threshold=0.85"

Response with Field-Level Confidence

{
  "object": {
    "invoice_number": "INV-2024-0042",
    "date": "01/15/2024",
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "vendor": "Acme Corp",
    "_meta": {
      "confidence_score": 0.94,
      "field_confidence": [
        { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "Invoice # INV-2024-0042" },
        { "field": "$.date", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Date: 01/15/2024" },
        { "field": "$.subtotal", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Subtotal: $1,150.00" },
        { "field": "$.tax", "score": 0.95, "reason": "Exact match", "page": 1, "text": "Tax: $100.00" },
        { "field": "$.total", "score": 0.92, "reason": "Formatting ambiguous", "page": 1, "text": "Total: $1,250.00" },
        { "field": "$.vendor", "score": 0.90, "reason": "Inferred from header", "page": 1, "text": "Acme Corp" }
      ],
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 2340,
    "credits": 1,
    "fallback_triggered": false
  },
  "verification": {
    "status": "PASSED",
    "checks_passed": 1,
    "checks_failed": 0,
    "cannot_verify_count": 0,
    "checks_run": [
      {
        "type": "HORIZONTAL_SUM",
        "status": "PASSED",
        "fields": ["total", "subtotal", "tax"],
        "passed": true,
        "delta": 0.0,
        "expected": 1250.00,
        "actual": 1250.00
      }
    ]
  }
}

Complex Schema with Line Items

curl -X POST "https://api.parsefy.io/v1/extract" \
  -H "Authorization: Bearer pk_your_api_key" \
  -F "file=@invoice.pdf" \
  -F 'output_schema={
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string", "description": "Invoice number"},
      "vendor": {
        "type": "object",
        "properties": {
          "name": {"type": "string", "description": "Company name"},
          "address": {"type": "string", "description": "Address"}
        }
      },
      "line_items": {
        "type": "array",
        "description": "Line items on the invoice",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer"},
            "unit_price": {"type": "number"},
            "amount": {"type": "number"}
          }
        }
      },
      "subtotal": {"type": "number"},
      "tax": {"type": "number"},
      "total": {"type": "number", "description": "Total amount due"}
    },
    "required": ["invoice_number", "total", "line_items"]
  }' \
  -F "confidence_threshold=0.85"

Response with Line Items and Verification

{
  "object": {
    "invoice_number": "INV-2024-0042",
    "vendor": {
      "name": "Acme Corp",
      "address": "123 Business Ave, New York, NY 10001"
    },
    "line_items": [
      {
        "description": "Professional Services",
        "quantity": 10,
        "unit_price": 100.00,
        "amount": 1000.00
      },
      {
        "description": "Software License",
        "quantity": 1,
        "unit_price": 150.00,
        "amount": 150.00
      }
    ],
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "_meta": {
      "confidence_score": 0.95,
      "field_confidence": [
        { "field": "$.invoice_number", "score": 0.98, "reason": "Exact match", "page": 1, "text": "INV-2024-0042" },
        { "field": "$.total", "score": 0.96, "reason": "Exact match", "page": 1, "text": "Total: $1,250.00" },
        { "field": "$.line_items[0].description", "score": 0.94, "reason": "Exact match", "page": 1, "text": "Professional Services" }
      ],
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 3200,
    "credits": 1,
    "fallback_triggered": false
  },
  "verification": {
    "status": "PASSED",
    "checks_passed": 2,
    "checks_failed": 0,
    "cannot_verify_count": 0,
    "checks_run": [
      {
        "type": "HORIZONTAL_SUM",
        "status": "PASSED",
        "fields": ["total", "subtotal", "tax"],
        "passed": true,
        "delta": 0.0,
        "expected": 1250.00,
        "actual": 1250.00
      },
      {
        "type": "VERTICAL_SUM",
        "status": "PASSED",
        "fields": ["subtotal", "line_items"],
        "passed": true,
        "delta": 0.0,
        "expected": 1150.00,
        "actual": 1150.00
      }
    ]
  }
}

Fallback Behavior

When a required field returns null or falls below confidence_threshold, the API automatically triggers the fallback model (Tier 2):

{
  "object": { ... },
  "metadata": {
    "processing_time_ms": 5500,
    "credits": 2,
    "fallback_triggered": true,
    "confidence_score": 0.97
  }
}

The fallback model (Tier 2) consumes more credits. To avoid unexpected costs, mark fields as optional if they might be missing in >20% of your documents.

Error Responses

Invalid File Type (400)

{
  "detail": "Invalid file type. Supported formats: PDF, DOCX"
}

Invalid Schema (400)

{
  "detail": "Invalid JSON schema: Expecting property name enclosed in double quotes"
}

Unauthorized (401)

{
  "detail": "Invalid or missing API key"
}

Rate Limited (429)

{
  "detail": "Rate limit exceeded. Please retry after 1 second."
}

Extraction Failed (200 with error)

{
  "object": null,
  "metadata": {
    "processing_time_ms": 5000,
    "credits": 1,
    "fallback_triggered": true
  },
  "error": {
    "code": "EXTRACTION_FAILED",
    "message": "Unable to extract data from document"
  }
}

Rate Limits

Request Rate: 1 request per second per IP
File Size: Maximum 10 MB

See Rate Limits for more details.

PlaygroundTest extraction without an API key

Extract

curl --request POST \
  --url https://api.example.com/v1/extract \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "output_schema": "<string>",
  "confidence_threshold": 123,
  "enable_verification": true
}
'

{
  "object": {
    "_meta": {
      "confidence_score": 123,
      "field_confidence": [
        {
          "field": "<string>",
          "score": 123,
          "reason": "<string>",
          "page": 123,
          "text": "<string>"
        }
      ],
      "issues": [
        {}
      ]
    }
  },
  "metadata": {
    "processing_time_ms": 123,
    "credits": 123,
    "fallback_triggered": true
  },
  "verification": {
    "status": "<string>",
    "checks_passed": 123,
    "checks_failed": 123,
    "cannot_verify_count": 123,
    "checks_run": [
      {
        "type": "<string>",
        "status": "<string>",
        "fields": [
          {}
        ],
        "passed": true,
        "delta": 123,
        "expected": 123,
        "actual": 123
      }
    ]
  },
  "error": {
    "code": "<string>",
    "message": "<string>"
  }
}

​Overview

​Request

​Response

​Examples

​Basic Invoice Extraction with Confidence Threshold

​Response with Field-Level Confidence

​Complex Schema with Line Items

​Response with Line Items and Verification

​Fallback Behavior

​Error Responses

​Invalid File Type (400)

​Invalid Schema (400)

​Unauthorized (401)

​Rate Limited (429)

​Extraction Failed (200 with error)

​Rate Limits

Overview

Request

Response

Examples

Basic Invoice Extraction with Confidence Threshold

Response with Field-Level Confidence

Complex Schema with Line Items

Response with Line Items and Verification

Fallback Behavior

Error Responses

Invalid File Type (400)

Invalid Schema (400)

Unauthorized (401)

Rate Limited (429)

Extraction Failed (200 with error)

Rate Limits