Skip to main content

Document Filters 24.3 Release

· 4 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

We're excited to announce the latest release of Document Filters, packed with powerful new features designed to enhance your document processing capabilities. This update introduces a JSON Output Type for structured data handling, a Markdown Output Type for streamlined document conversion, advanced PDF Table Extraction for improved data accuracy, and MSI Installer Sub-File Extraction for comprehensive file analysis. Additionally, we've added community-inspired support for Hancom Hangul HWPX text extraction and HD rendering. Read on to discover how these new features can elevate your workflows and drive better results.

Watch as we walk through a few of the new features in the 24.3 release of Document Filters.

PDF Table Extraction

Document Filters now supports the identification and extraction of tables from untagged PDF files. This feature preserves the logical structure of tables, rows, and cells, ensuring accurate table detection for PDFs. Preserving the structures of tables allows AI/ML systems to improve their data quality and accuracy when processing documents.

Hyland Document Filters - PDF Table Extraction

Detecting and extracting the table information from an invoice. The image on the left displays how information is extracted with table detection disabled, while the right displays how information is extrated with table detection enabled.

JSON Output Type

Document Filters now includes support for a JSON output type, which structures document data in a detailed, hierarchical format. This enhancement facilitates seamless integration with AI and other JSON-compatible applications, ensuring efficient parsing and utilization of document content for improved AI/ML-driven data analysis and processing.

Hyland Document Filters - JSON Output Type

Converting an invoice with a table into JSON, and keeping its table structure.

Markdown Output Type

Converting documents to a Markdown output type is now supported. This feature allows users to effortlessly convert documents to Markdown, providing a lightweight and efficient way to present formatted content. It is ideal for use with AI/ML systems, as the lightweight and structured composition of Markdown reduces computing costs, as well as enhances data processing.

Hyland Document Filters - Markdown Output Type

Converting an invoice with a table into Markdown, and keeping its table structure.

Markdown AI Use Case

Hyland Document Filters - Markdown AI Use Case

Here we passed in the Markdown generated from the invoice into ChatGPT and gave it the prompt of "Complete this invoice and display it in a readable format." ChatGPT was able to correctly identify and understand the context of the table and fill in the missing values. Using Markdown with table detection allowed ChatGPT to have better context compared to plain text, as well as need less resources compared to full JSON, HTML, or XML.

MSI Installer Sub-File Extraction

Our latest update introduces support for sub-file extraction from MSI installer files. This addition enables better handling and analysis of MSI files by extracting embedded files for further processing, enhancing the overall file analysis workflow in security systems.

Hancom Hangul HWPX Text and HD Support (Community-Inspired)

In response to community feedback, we have introduced text extraction and HD rendering support for Hancom Hangul HWPX files. This update enhances the ability to process and utilize content from these files, as well as ensures enhanced rendering fidelity of complex Korean language documents, maintaining their original formatting and layout.

Document Filters Resources