Skip to main content

Overview

Parsefy extracts structured data from PDF and DOCX files. You define what data you want using a schema, and Parsefy returns perfectly structured JSON.

Basic extraction

import { Parsefy } from 'parsefy';
import * as z from 'zod';

const client = new Parsefy();

const schema = z.object({
  invoice_number: z.string().describe('The invoice number'),
  total: z.number().describe('Total amount'),
});

const { object, error } = await client.extract({
  file: './invoice.pdf',
  schema,
});

File inputs

Parsefy accepts multiple file input types:
InputDescription
File path"./document.pdf" - reads from disk
Buffer/bytesIn-memory file data
File objectBrowser File from form input
BlobRaw binary with MIME type
// File path
const result = await client.extract({ file: './doc.pdf', schema });

// Buffer (Node.js)
import { readFileSync } from 'fs';
const result = await client.extract({ 
  file: readFileSync('./doc.pdf'), 
  schema 
});

// File object (browser)
const fileInput = document.querySelector('input[type="file"]');
const result = await client.extract({ 
  file: fileInput.files[0], 
  schema 
});

Response format

Every extraction returns:
{
  "object": {
    "invoice_number": "INV-2024-001",
    "total": 1500.00,
    "_meta": {
      "confidence_score": 0.95,
      "issues": []
    }
  },
  "metadata": {
    "processing_time_ms": 2340,
    "input_tokens": 1520,
    "output_tokens": 89,
    "credits": 1,
    "fallback_triggered": false
  }
}

The _meta field

Every extraction includes quality metrics:
  • confidence_score: 0.0 to 1.0 indicating extraction certainty
  • issues: Array of any concerns encountered

Metadata

Processing information:
  • processing_time_ms: How long the extraction took
  • credits: Credits consumed (~1 per page)
  • fallback_triggered: Whether the fallback model was used

Supported formats

FormatExtensionProcessing
PDF.pdfNative multimodal AI (can “see” the document)
Microsoft Word.docxConverted to Markdown
Maximum file size: 10 MB

Error handling

const { object, error, metadata } = await client.extract({
  file: './document.pdf',
  schema,
});

// Extraction-level error (request succeeded, but extraction failed)
if (error) {
  console.error(`[${error.code}] ${error.message}`);
  // Still have metadata for debugging
  console.log(`Tokens used: ${metadata.inputTokens}`);
  return;
}

// Success
console.log(object);

Next steps