Skip to main content

Document Extractor Agent

Overview

The Document Extractor Agent specializes in extracting structured information from unstructured text or documents. It uses advanced LLM capabilities to convert raw input into well-structured JSON output based on predefined schemas.

Key Features

  • Structured Output: Forces LLM to output valid JSON using Pydantic schemas
  • Multi-format Support: Process both raw text and documents (PDF, TIFF)
  • Smart Document Processing: Automatically detects and handles scanned vs. digital documents
  • Example-based Learning: Supports few-shot learning through examples

Configuration

Here's a basic Document Extractor agent configuration:

Invoice Processing Example

{
"name": "InvoiceProcessor",
"description": "Extracts invoice information",
"agentType": "document_extractor",
"config": {
"prompt": "Extract invoice details from the document.",
"output_schema": {
"title": "Invoice",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string"},
"amount": {"type": "number"},
"vendor": {"type": "string"}
},
"required": ["invoice_number", "amount"]
},
"examples": [
{
"input": "Invoice #12345 from Acme Corp dated 2024-03-15 for $499.99",
"extraction": {
"invoice_number": "12345",
"date": "2024-03-15",
"amount": 499.99,
"vendor": "Acme Corp"
}
}
],
"max_tokens": 2080,
"model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
}
}

Patient Notes Processing Example

{
"name": "PatientNotesProcessor",
"description": "Medical notes extraction and classification agent",
"agentType": "document_extractor",
"config": {
"prompt": "Extract structured information from patient notes. Include symptoms, diagnosis, and recommended actions. Mark status as pending_review for any notes containing severe symptoms or unclear diagnosis.",
"output_schema": {
"title": "PatientNote",
"type": "object",
"properties": {
"patient_id": {"type": "string", "description": "Patient identifier"},
"visit_date": {"type": "string", "description": "Date of visit"},
"symptoms": {
"type": "array",
"items": {"type": "string"},
"description": "List of reported symptoms"
},
"diagnosis": {"type": "string", "description": "Primary diagnosis"},
"recommended_actions": {
"type": "array",
"items": {"type": "string"},
"description": "Recommended follow-up actions"
},
"status": {
"type": "string",
"enum": ["complete", "pending_review"],
"description": "Review status of the note"
}
},
"required": ["patient_id", "visit_date", "symptoms", "status"]
},
"examples": [
{
"input": "Patient ID: P123 seen on 2024-03-15. Reports mild headache and fatigue for 3 days. Likely viral infection. Recommend rest and fluids.",
"extraction": {
"patient_id": "P123",
"visit_date": "2024-03-15",
"symptoms": ["headache", "fatigue"],
"diagnosis": "viral infection",
"recommended_actions": ["rest", "increase fluid intake"],
"status": "complete"
}
},
{
"input": "Patient ID: P456 on 2024-03-16. Severe chest pain, shortness of breath. ECG shows irregular pattern. Further cardiac evaluation needed.",
"extraction": {
"patient_id": "P456",
"visit_date": "2024-03-16",
"symptoms": ["severe chest pain", "shortness of breath"],
"diagnosis": "possible cardiac condition - evaluation pending",
"recommended_actions": ["cardiac evaluation", "immediate follow-up"],
"status": "pending_review"
}
}
],
"max_tokens": 2080,
"model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
}
}
tip

Notice how the second example is automatically marked as pending_review due to the severity of symptoms and unclear diagnosis. This helps prioritize cases that need immediate attention.

Using the Document Extractor Agent

Text Input

POST /v1/agents/{agent_id}/invoke-document-extractor
{
"text": "Receipt from Walmart dated 2024-03-20. Total amount: $123.45"
}

Document Input

POST /v1/agents/{agent_id}/invoke-document-extractor
{
"pathToDocument": "s3://your-bucket/path/to/document.pdf"
}
Document Upload

Currently, you can use our temporary document upload endpoint to get the S3 path. Note that this endpoint will be deprecated once the Content Lake integration is complete.

Document Processing Flow

  1. Input Detection

    • Determines if input is text or document
    • Validates document format (PDF/TIFF only)
  2. Document Analysis

  3. Processing Strategy

    • Digital Documents: Direct text extraction
    • Scanned Documents: Multimodal LLM processing
Experimental Feature

The multimodal document processing is currently experimental. Future updates will move text extraction to the ingestion phase in the Content Lake.

Best Practices

  1. Schema Design

    • Keep schemas focused and specific
    • Include clear property descriptions
    • Mark required fields appropriately
  2. Example Selection

    • Provide diverse, representative examples
    • Include edge cases
    • Match your use case closely
"examples": [
{
"input": "Simple case...",
"extraction": { /* simple structure */ }
},
{
"input": "Complex case with missing fields...",
"extraction": { /* handles missing data */ }
}
]
  1. Prompt Engineering
    • Be specific about extraction requirements
    • Include validation rules
    • Mention format expectations

Limitations

  • Only supports PDF and TIFF formats
  • Document upload endpoint is temporary
  • Multimodal processing is experimental
  • Maximum file size: 10MB
  • Requires Claude 3.5 Sonnet for multimodal capabilities

Future Enhancements

  • Content Lake integration
  • Additional document format support
  • Improved multimodal processing
  • Batch processing capabilities
  • Enhanced validation options