AI-Ready Content Starts Here - Introducing Knowledge Enrichment
AI isn’t failing because models aren’t good enough. It’s failing because the data isn’t.
Today, we’re launching Knowledge Enrichment, a breakthrough API that transforms fragmented, unstructured enterprise content into deeply structured, semantically rich, AI-ready data. Powered by Hyland’s proven Document Filters and AI. It delivers what developers and architects have been missing, clean, contextual, vectorized output across 600+ file types, ready for direct integration into your AI stack.
If you're building AI Agents, retrieval-augmented generation (RAG) systems, or enterprise LLMs, stop what you’re doing. This changes everything.
Why AI Is Only As Good As Its Input
Large language models are powerful, but they work best when the content they see is accurate, complete, and grounded in the original source.
That’s not how most enterprise content looks. It’s buried in PDFs, spreadsheets, presentations, and proprietary formats. It lives across ECMs, network drives, and cloud repositories. And even when you extract it, traditional methods often strip away layout, context, and structure, leaving behind only raw text.
Some teams try to solve this by prompting LLMs to infer structure, meaning, or metadata from that text. But LLMs don’t actually know what was in the file, they’re guessing. Their outputs are based on pattern recognition and probability, not ground truth.
Knowledge Enrichment does the opposite. It pulls structure, semantics, and content directly from the file itself using deterministic extraction, not AI hallucination. Backed by Hyland Document Filters, it knows what was a table, what was a heading, what was a footer, and what content belonged together, with no inference required.
The result is structured outputs you can trust, and clean, contextual content your LLMs don’t have to guess about.
Under the Hood: How Knowledge Enrichment Transforms Content
Knowledge Enrichment is built on two core components:
1. Hyland Document Filters - The Content Intelligence Foundation
At the foundation of Knowledge Enrichment is Hyland Document Filters, the same high-performance SDK trusted by leading security tools, compliance platforms, and search products. It processes over 600 file formats across documents, emails, CAD files, and more, with zero reliance on external services or cloud APIs.
Document Filters doesn’t just extract text. It:
- Preserves layout and hierarchy (e.g., columns, tables, headers)
- Maintains semantic zones (titles, footnotes, sidebars)
- Identifies metadata, annotations, embedded objects, and more
This is the raw substrate Knowledge Enrichment builds on; a consistent, portable, and accurate foundation for AI structuring.
2. Contextual Structuring and Semantic Enrichment
On top of that, Knowledge Enrichment performs deep analysis and transforms content into structured outputs using AI where needed:
Capability | Description |
---|---|
Entity Extraction | Identifies named entities (people, orgs, places) with labeling and context |
Table Structuring | Extracts and normalizes tables with headers, rows, and data types |
Contextual Chunking | Segments content into meaningfully grouped sections based on structure and semantics |
Summarization | Generates document-level summaries |
Classification | Tags documents by type, topic, or intent using AI classifiers |
Contextual Metadata | Derives rich metadata by interpreting document meaning, layout, and semantics, far beyond basic file properties |
Embeddings Generation | Creates vector representations for content, enabling RAG and clustering |
These outputs are returned in Markdown or JSON formats and are ready for direct ingestion into:
- Vector databases (e.g., Pinecone, Weaviate, FAISS)
- Data lakes (e.g., Delta Lake, S3, ADLS)
- Data catalogs (e.g., Unity Catalog, Informatica, Collibra)
- MLOps pipelines (e.g., LangChain, LlamaIndex, Haystack)
- LLMs (e,g., OpenAI, Claude, Llama)
Real-World Workflows: Where Knowledge Enrichment Delivers
Insurance – Claims Intake and Routing
Scanned documents, handwritten notes, and damage photos are all enriched into structured JSON with extracted metadata (claim number, policy ID, repair estimate) and chunked content for routing or summarization. LLMs consume the cleaned data to assist adjusters or validate claims.
Legal – Discovery Preparation and Context Extraction
Contracts, pleadings, and emails are parsed into structured sections with semantic labeling and summaries. The output can be ingested into eDiscovery tools or used to fine-tune LLMs that assist with clause comparison or timeline construction.
Financial Services – RAG-Powered Agent Workflows
Annual reports, prospectuses, and policies are vectorized and embedded with layout-preserved markup. RAG agents can then retrieve precisely the right section for question answering or summarization with full traceability back to the source.
Built for Developers: API Access from Day One
Knowledge Enrichment is designed as a developer-first API. Every feature is exposed via simple REST endpoints.
Example: Convert a file to markdown and get chunked content with embeddings
curl -L 'https://knowledge-enrichment.ai.experience.hyland.com/latest/api/data-curation/presign' \
-H 'Content-Type: application/json' \
-H 'Accept: text/json' \
-H 'Authorization: Bearer <token>' \
-d '{
"normalization": {
"quotations": true
},
"chunking": true,
"embedding": true,
}'
The response contains structured content in Markdown, as well as contextually chunked content and embeddings. Each element is aligned by page and coordinate so you can trace insights back to the original document with full fidelity. See more information on this endpoint in the documentation.
Note: The above UI was created to showcase the capabilities of Knowledge Enrichment and is not part of the Knowledge Enrichment product.
Example: Generate a summary and additional metadata of a file
curl -L 'https://cin-context-api.experience.hyland.com/context/api/content/process' \
-H 'Content-Type: application/json' \
-H 'Accept: text/plain' \
-H 'Authorization: Bearer <token>' \
-d '{
"objectKeys": [
"string"
],
"actions": [
"image-description, image-metadata-generation"
],
"kSimilarMetadata": [
{
"estimate_details": {
"job_number": "R-2024-0568",
"creation_date": "2024-06-15",
"expiration_date": "2024-07-15",
"estimate_total": "8,750.00",
"status": "pending"
},
"property": {
"address": "123 Main Street",
"city": "Springfield",
"state": "IL",
"zip": "62701",
"year_built": "1995",
"roof_size_sqft": "2,400"
},
"damage_assessment": {
"damage_cause": "hail_storm",
"date_of_damage": "2024-05-20",
"affected_areas": "southwest_slope|ridge_caps|flashing",
"severity": "moderate",
"roof_condition": "significantly damaged",
"potential_cause": "recent storm or wind event",
"damage_types": "water intrusion|shingle displacement|structural exposure",
"additional_risk": "potential hidden damage in surrounding roof areas",
"urgency_level": "moderate to high"
},
"repair_scope": {
"materials": "asphalt_shingles|underlayment|flashing",
"warranty_period": "15 years",
"estimated_completion_time": "3 days",
"repair_recommendations": [
"replace damaged shingles",
"inspect and repair flashing",
"clean gutters"
]
}
}
]
}'
The response contains a JSON representation of the information we asked for from the document, including a summary and additional metadata interpreted from the document. See more information on this endpoint in the documentation.
Note: The above UI was created to showcase the capabilities of Knowledge Enrichment and is not part of the Knowledge Enrichment product.
Flexible Deployment Across the Enterprise
Whether you’re building an LLM-powered copilot or modernizing legacy document automation, Knowledge Enrichment unlocks new possibilities.
Popular Patterns:
- RAG Pipelines – Enrich large document sets, vectorize, and index
- Pre-Training Data Prep – Generate structured corpora from enterprise content
- Compliance Automation – Extract key fields for audit, alerting, and validation
- LLM Fine-Tuning – Create summaries and entity-tagged data for better models
No Centralization Needed: Enrich Content Where It Lives
Knowledge Enrichment doesn't require content centralization to deliver value. It's built to operate across:
- ECM platforms (Hyland, OpenText, SharePoint, Box)
- Cloud storage (S3, Azure Blob, Google Cloud Storage)
- On-prem document stores (file shares, FTP, local repositories)
Its lightweight API model allows you to point it at content wherever it lives, without migration.
600+ File Formats, 0 Headaches
Enterprise content is messy. That’s why Knowledge Enrichment inherits the full breadth of file support from Document Filters, including:
- Documents: DOCX, PDF, PPTX, RTF, ODT, EPUB
- Spreadsheets: XLSX, CSV, NUMBERS
- Emails: MSG, EML, OLK14
- CAD & Engineering: DWG, DXF, DGN
- Text and Markup: XML, JSON, HTML
- Raster Image: PSD, DCM
- Vector Image: INDD, VSDX
Using other services outside of Document Filters, Knowledge Enrichment is able to support additional formats, including:
- Images: JPEG, PNG, TIFF
- Audio: FLAC, M4A, MP3, WAV
- Video: MP4, WebM
Each format is parsed with layout, structure, and content zones preserved, giving AI systems a richer, more accurate context to reason with.
Fuel for Agents, Automation, and the Next Generation of AI
As enterprise AI shifts from model development to agent deployment, the need for structured, explainable, and semantically meaningful content becomes critical. Knowledge Enrichment creates the substrate that enables agents to:
- Understand document context, not just isolated facts
- Respond based on intent-rich information rather than unstructured noise
- Justify answers with traceable references back to source documents
Whether you’re building a smart assistant for internal operations or an external-facing customer support AI, the difference between helpful and harmful will be the data layer underneath.
Get Started: Structure Smarter, Deploy Faster
Knowledge Enrichment is available now as part of the Content Innovation Cloud.
The future of enterprise AI won’t be defined by prompts, it’ll be defined by the data you feed the models. With Knowledge Enrichment, you finally have the ability to deliver content that’s not just extracted, but structured, enriched, and ready for reasoning.
Let your AI start smarter. Let your content speak with context.
Explore the full developer documentation or express interest in getting access.