Document Extractor Agent

Overview

The Document Extractor Agent specializes in extracting structured information from unstructured text or documents. It uses advanced LLM capabilities to convert raw input into well-structured JSON output based on predefined schemas.

Key Features

Structured Output: Forces LLM to output valid JSON using Pydantic schemas
Multi-format Support: Process both raw text and documents (PDF, TIFF)
Smart Document Processing: Automatically detects and handles scanned vs. digital documents
Example-based Learning: Supports few-shot learning through examples

Configuration

Here's a basic Document Extractor agent configuration:

Invoice Processing Example

{
    "name": "InvoiceProcessor",
    "description": "Extracts invoice information",
    "agentType": "document_extractor",
    "config": {
        "prompt": "Extract invoice details from the document.",
        "output_schema": {
            "title": "Invoice",
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "date": {"type": "string"},
                "amount": {"type": "number"},
                "vendor": {"type": "string"}
            },
            "required": ["invoice_number", "amount"]
        },
        "examples": [
            {
                "input": "Invoice #12345 from Acme Corp dated 2024-03-15 for $499.99",
                "extraction": {
                    "invoice_number": "12345",
                    "date": "2024-03-15",
                    "amount": 499.99,
                    "vendor": "Acme Corp"
                }
            }
        ],
        "max_tokens": 2080,
        "model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
    }
}

Patient Notes Processing Example

{
    "name": "PatientNotesProcessor",
    "description": "Medical notes extraction and classification agent",
    "agentType": "document_extractor",
    "config": {
        "prompt": "Extract structured information from patient notes. Include symptoms, diagnosis, and recommended actions. Mark status as pending_review for any notes containing severe symptoms or unclear diagnosis.",
        "output_schema": {
            "title": "PatientNote",
            "type": "object",
            "properties": {
                "patient_id": {"type": "string", "description": "Patient identifier"},
                "visit_date": {"type": "string", "description": "Date of visit"},
                "symptoms": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of reported symptoms"
                },
                "diagnosis": {"type": "string", "description": "Primary diagnosis"},
                "recommended_actions": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Recommended follow-up actions"
                },
                "status": {
                    "type": "string",
                    "enum": ["complete", "pending_review"],
                    "description": "Review status of the note"
                }
            },
            "required": ["patient_id", "visit_date", "symptoms", "status"]
        },
        "examples": [
            {
                "input": "Patient ID: P123 seen on 2024-03-15. Reports mild headache and fatigue for 3 days. Likely viral infection. Recommend rest and fluids.",
                "extraction": {
                    "patient_id": "P123",
                    "visit_date": "2024-03-15",
                    "symptoms": ["headache", "fatigue"],
                    "diagnosis": "viral infection",
                    "recommended_actions": ["rest", "increase fluid intake"],
                    "status": "complete"
                }
            },
            {
                "input": "Patient ID: P456 on 2024-03-16. Severe chest pain, shortness of breath. ECG shows irregular pattern. Further cardiac evaluation needed.",
                "extraction": {
                    "patient_id": "P456",
                    "visit_date": "2024-03-16",
                    "symptoms": ["severe chest pain", "shortness of breath"],
                    "diagnosis": "possible cardiac condition - evaluation pending",
                    "recommended_actions": ["cardiac evaluation", "immediate follow-up"],
                    "status": "pending_review"
                }
            }
        ],
        "max_tokens": 2080,
        "model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
    }
}

tip

Notice how the second example is automatically marked as pending_review due to the severity of symptoms and unclear diagnosis. This helps prioritize cases that need immediate attention.

Using the Document Extractor Agent

Text Input

POST /v1/agents/{agent_id}/invoke-document-extractor
{
    "text": "Receipt from Walmart dated 2024-03-20. Total amount: $123.45"
}

Document Input

POST /v1/agents/{agent_id}/invoke-document-extractor
{
    "pathToDocument": "s3://your-bucket/path/to/document.pdf"
}

Document Upload

Currently, you can use our temporary document upload endpoint to get the S3 path. Note that this endpoint will be deprecated once the Content Lake integration is complete.

Document Processing Flow

Input Detection
- Determines if input is text or document
- Validates document format (PDF/TIFF only)
Document Analysis
Processing Strategy
- Digital Documents: Direct text extraction
- Scanned Documents: Multimodal LLM processing

Experimental Feature

The multimodal document processing is currently experimental. Future updates will move text extraction to the ingestion phase in the Content Lake.

Best Practices

Schema Design
- Keep schemas focused and specific
- Include clear property descriptions
- Mark required fields appropriately
Example Selection
- Provide diverse, representative examples
- Include edge cases
- Match your use case closely

"examples": [
    {
        "input": "Simple case...",
        "extraction": { /* simple structure */ }
    },
    {
        "input": "Complex case with missing fields...",
        "extraction": { /* handles missing data */ }
    }
]

Prompt Engineering
- Be specific about extraction requirements
- Include validation rules
- Mention format expectations

Limitations

Only supports PDF and TIFF formats
Document upload endpoint is temporary
Multimodal processing is experimental
Maximum file size: 10MB
Requires Claude 3.5 Sonnet for multimodal capabilities

Future Enhancements

Content Lake integration
Additional document format support
Improved multimodal processing
Batch processing capabilities
Enhanced validation options

Overview​

Key Features​

Configuration​

Invoice Processing Example​

Patient Notes Processing Example​

Using the Document Extractor Agent​

Text Input​

Document Input​

Document Processing Flow​

Best Practices​

Limitations​

Future Enhancements​