RAG (Retrieval-Augmented Generation)
Overview
RAG combines the power of retrieval-based systems with generative AI models to provide accurate, context-aware responses. Instead of relying solely on the LLM's training data, RAG fetches relevant information from your document repository before generating responses.
Basic RAG Flow
Configuration
Here's how to configure a RAG agent with our enhanced architecture:
{
    "name": "DocumentHelper",
    "description": "Document assistant",
    "agentType": "rag",
    "notes": "Agent for document queries",
    "config": {
        "filter_value": {},
        "limit": 5,
        "system_prompt": """|
      Context information is below.
      ---------------------
      {context_str}
      ---------------------
      Given the context information and not prior knowledge, answer the query.
      Query: {query_str}
      Answer: """,
        "llm_model_id": "amazon.nova-micro-v1:0",
        "inference_config": {
            "max_tokens": 4000,
            "temperature": 0.7
        }
    }
}
Max Retries
When invoking agents that error for any reason (E.g., input is too large, network issue, etc.),
the client you're using to make the request could time out.
To help with this, you could set max_retries in the inference_config to a low number (E.g., 2).
Best Practices
- 
Document Preparation - Ensure documents are properly chunked
- Maintain consistent formatting
- Include metadata for filtering
 
- 
Query Formation - Be specific in queries
- Use natural language
- Include relevant context
 
- 
Filter Usage - The filterValueparameter follows the Semantic API filtering structure
- For detailed filter options and syntax, refer to the Semantic API Documentation
- Common filters include document type, department, and date ranges
- Combine with semantic search for better results
 
- The 
The filtering capabilities are provided by the Semantic API. Check their documentation for the complete list of supported filters and proper filter syntax.
Streaming Support
RAG agents support streaming responses through the unified /v1/agents/{agent_id}/versions/{version_id-or-latest}/invoke-stream endpoint:
POST /v1/agents/{agent_id}/versions/{version_id-or-latest}/invoke-stream
{
    "messages": [
        {
            "role": "user",
            "content": "What's in our HR policy about vacation days?"
        }
    ],
    "filterValue": {
        "department": "HR",
        "documentType": "policy"
    }
}
When using streaming, responses come in chunks. Each chunk is a valid JSON object containing a portion of the complete response. The concatenation of all chunks forms the complete text response, which is not in JSON format.
Deep Search
You can enhance RAG agents with Deep Search capabilities for more comprehensive analysis.
To enable Deep Search, include enable_deep_search: true in your request to the standard invoke-agent or invoke-agent-stream endpoints.
hx_env_id is also required; however, it will be automatically be extracted from the JWT token if your authentication is setup properly.
Example Deep Search Request
POST /v1/agents/{agent_id}/invoke-agent
{
    "messages": [
        {
            "role": "user",
            "content": "What's in our HR policy about vacation days? Provide a detailed analysis."
        }
    ],    
    "enable_deep_search": true,
    "filterValue": {
        "department": "HR",
        "documentType": "policy"
    }
}
- Deep Search performs a more thorough, multi-step research process and can take several minutes to respond.
- It is ideal for complex queries that require in-depth analysis across multiple documents.
Due to the nature of streaming, returning data chunk by chunk, failures that occur during runtime are difficult to return to the client. Therefore, the returned status code might be a 200, but we do our best to return the error in the response. However, it's also difficult doing that, as silent failures could occur. Please check the logs for errors that occur.