Skip to main content

Document Filters 25.2 Release

· 4 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

Document Filters 25.2 advances our shift-left strategy by enhancing traceability, data integrity, and extensibility within content pipelines. With this release, Document Filters becomes the first solution to embed positional metadata directly into Markdown for all our supported formats, setting a new benchmark for transparency and explainability in AI and search-driven workflows. We’ve also improved Markdown’s handling of complex tables, enabling seamless extraction of structured data from even the most irregular layouts. In addition, table extraction is now supported for XFA PDFs, a long-standing challenge for automation and compliance initiatives. Finally, a new custom OCR callback interface gives teams the freedom to integrate any OCR engine into their workflow, unlocking multilingual, domain-specific, and image-heavy content for broader automation. Each of these updates contributes to cleaner, more connected data earlier in the process—reducing errors, manual fixes, and integration complexity. Let’s take a closer look at what’s new.

Watch as we walk through a few of the new features in the 25.2 release of Document Filters.

Positional Metadata in Markdown and JSON

Document Filters now supports embedding positional metadata in both Markdown and JSON outputs, allowing each extracted element to be traced back to its original location in the source file. This includes paragraphs, tables, lists, and bounding boxes that are critical details for high-trust workflows like intelligent search, summarization, and document auditing. For AI systems, this positional context improves explainability, allowing answers or insights to be visually or programmatically tied back to the source, boosting confidence in model outputs.

Hyland Document Filters - Positional Metadata Hyland Document Filters - Positional Metadata in Markdown

To enable this, set the MARKDOWN_INCLUDE_LOCATIONS or JSON_INCLUDE_BOUNDS option in your configuration.

Complex Table Support in Markdown

Tables with embedded lists, multi-line content, or irregular layouts (complex tables) are now preserved more faithfully in Markdown using sanitized inline HTML. This improves support for documents like financial reports, regulatory filings, and academic papers where real-world tables often defy simple formatting.

Hyland Document Filters - Complex Tables Hyland Document Filters - Complex Tables in Markdown

Previously, these tables could be flattened or misinterpreted, requiring downstream rework. Now, data integrity is maintained across all formats, making Markdown more reliable for analytics pipelines and downstream AI processing, especially for models that ingest tabular data for classification or trend analysis.

Table Extraction from XFA PDFs

XFA-based PDF files, which are common in enterprise forms and legacy systems, can now be parsed for tabular data using Document Filters. This opens new possibilities for automating extraction from static or interactive PDFs that were previously difficult to analyze.

For industries like government, insurance, and healthcare, where XFA PDFs are still in use, this enhancement accelerates automation without requiring expensive format conversions or manual intervention.

Custom OCR Integration with Callback Support

A new OCR callback interface lets teams plug in any OCR engine, including open-source tools like EasyOCR or services like AWS Textract, directly into the Document Filters workflow. This supports multilingual, domain-specific, or high-accuracy OCR strategies, while maintaining a consistent output structure.

The callback is lightweight and language-agnostic, making it ideal for AI pipelines that require customized text recognition from images, scanned forms, or handwritten documents. Whether you’re building classification models, search tools, or extraction engines, this update gives you full control over how OCR fits into your document processing flow.

Document Filters Resources