Document Extractor Agent
Overview
The Document Extractor Agent specializes in extracting structured information from unstructured text or documents. It uses advanced LLM capabilities to convert raw input into well-structured JSON output based on predefined schemas.
Key Features
- Structured Output: Forces LLM to output valid JSON using Pydantic schemas
- Multi-format Support: Process both raw text and documents (PDF, TIFF)
- Smart Document Processing: Automatically detects and handles scanned vs. digital documents
- Example-based Learning: Supports few-shot learning through examples
Configuration
Here's a basic Document Extractor agent configuration:
Invoice Processing Example
{
"name": "InvoiceProcessor",
"description": "Extracts invoice information",
"agentType": "document_extractor",
"config": {
"prompt": "Extract invoice details from the document.",
"output_schema": {
"title": "Invoice",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string"},
"amount": {"type": "number"},
"vendor": {"type": "string"}
},
"required": ["invoice_number", "amount"]
},
"examples": [
{
"input": "Invoice #12345 from Acme Corp dated 2024-03-15 for $499.99",
"extraction": {
"invoice_number": "12345",
"date": "2024-03-15",
"amount": 499.99,
"vendor": "Acme Corp"
}
}
],
"max_tokens": 2080,
"model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
}
}
Patient Notes Processing Example
{
"name": "PatientNotesProcessor",
"description": "Medical notes extraction and classification agent",
"agentType": "document_extractor",
"config": {
"prompt": "Extract structured information from patient notes. Include symptoms, diagnosis, and recommended actions. Mark status as pending_review for any notes containing severe symptoms or unclear diagnosis.",
"output_schema": {
"title": "PatientNote",
"type": "object",
"properties": {
"patient_id": {"type": "string", "description": "Patient identifier"},
"visit_date": {"type": "string", "description": "Date of visit"},
"symptoms": {
"type": "array",
"items": {"type": "string"},
"description": "List of reported symptoms"
},
"diagnosis": {"type": "string", "description": "Primary diagnosis"},
"recommended_actions": {
"type": "array",
"items": {"type": "string"},
"description": "Recommended follow-up actions"
},
"status": {
"type": "string",
"enum": ["complete", "pending_review"],
"description": "Review status of the note"
}
},
"required": ["patient_id", "visit_date", "symptoms", "status"]
},
"examples": [
{
"input": "Patient ID: P123 seen on 2024-03-15. Reports mild headache and fatigue for 3 days. Likely viral infection. Recommend rest and fluids.",
"extraction": {
"patient_id": "P123",
"visit_date": "2024-03-15",
"symptoms": ["headache", "fatigue"],
"diagnosis": "viral infection",
"recommended_actions": ["rest", "increase fluid intake"],
"status": "complete"
}
},
{
"input": "Patient ID: P456 on 2024-03-16. Severe chest pain, shortness of breath. ECG shows irregular pattern. Further cardiac evaluation needed.",
"extraction": {
"patient_id": "P456",
"visit_date": "2024-03-16",
"symptoms": ["severe chest pain", "shortness of breath"],
"diagnosis": "possible cardiac condition - evaluation pending",
"recommended_actions": ["cardiac evaluation", "immediate follow-up"],
"status": "pending_review"
}
}
],
"max_tokens": 2080,
"model_id": "anthropic.claude-3-5-sonnet-20240307-v1:0"
}
}
Notice how the second example is automatically marked as pending_review
due to the severity of symptoms and unclear diagnosis. This helps prioritize cases that need immediate attention.
Using the Document Extractor Agent
Text Input
POST /v1/agents/{agent_id}/invoke-document-extractor
{
"text": "Receipt from Walmart dated 2024-03-20. Total amount: $123.45"
}
Document Input
POST /v1/agents/{agent_id}/invoke-document-extractor
{
"pathToDocument": "s3://your-bucket/path/to/document.pdf"
}
Currently, you can use our temporary document upload endpoint to get the S3 path. Note that this endpoint will be deprecated once the Content Lake integration is complete.
Document Processing Flow
-
Input Detection
- Determines if input is text or document
- Validates document format (PDF/TIFF only)
-
Document Analysis
-
Processing Strategy
- Digital Documents: Direct text extraction
- Scanned Documents: Multimodal LLM processing
The multimodal document processing is currently experimental. Future updates will move text extraction to the ingestion phase in the Content Lake.
Best Practices
-
Schema Design
- Keep schemas focused and specific
- Include clear property descriptions
- Mark required fields appropriately
-
Example Selection
- Provide diverse, representative examples
- Include edge cases
- Match your use case closely
"examples": [
{
"input": "Simple case...",
"extraction": { /* simple structure */ }
},
{
"input": "Complex case with missing fields...",
"extraction": { /* handles missing data */ }
}
]
- Prompt Engineering
- Be specific about extraction requirements
- Include validation rules
- Mention format expectations
Limitations
- Only supports PDF and TIFF formats
- Document upload endpoint is temporary
- Multimodal processing is experimental
- Maximum file size: 10MB
- Requires Claude 3.5 Sonnet for multimodal capabilities
Future Enhancements
- Content Lake integration
- Additional document format support
- Improved multimodal processing
- Batch processing capabilities
- Enhanced validation options