Skip to main content

Document Filters 25.2 Release

· 4 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

Document Filters 25.2 advances our shift-left strategy by enhancing traceability, data integrity, and extensibility within content pipelines. With this release, Document Filters becomes the first solution to embed positional metadata directly into Markdown for all our supported formats, setting a new benchmark for transparency and explainability in AI and search-driven workflows. We’ve also improved Markdown’s handling of complex tables, enabling seamless extraction of structured data from even the most irregular layouts. In addition, table extraction is now supported for XFA PDFs, a long-standing challenge for automation and compliance initiatives. Finally, a new custom OCR callback interface gives teams the freedom to integrate any OCR engine into their workflow, unlocking multilingual, domain-specific, and image-heavy content for broader automation. Each of these updates contributes to cleaner, more connected data earlier in the process—reducing errors, manual fixes, and integration complexity. Let’s take a closer look at what’s new.

Beyond Magic Numbers: The Complexity of File Type Identification

· 8 min read
Ben Truscott
Ben Truscott
Document Filters Principal Engineer

In the realm of enterprise software, managing and processing files from diverse sources is a common challenge. Whether you're developing AI-driven solutions, building compliance-focused applications, or ensuring data security, the ability to accurately identify file types is crucial. The files you encounter could be anything—from legacy documents dating back to the 1980s and 1990s to modern formats uploaded from smartphones or cloud services.

When people think about identifying a file type, they often assume that the first few bytes—commonly known as a "magic number"—are enough to determine what kind of file they’re dealing with. While this works for some formats, it’s far from a general rule. Modern file formats frequently use container structures that obscure their actual content. For example, many file types—including Microsoft Office documents (DOCX, XLSX, PPTX) and EPUB ebooks—are essentially ZIP archives with structured data inside. Similarly, older Microsoft formats like DOC and XLS rely on the Compound File Binary (CFB) format, which acts like a mini file system within a file. At a glance, these container formats don’t immediately reveal what kind of document they hold.

Document Filters 25.1 Release

· 3 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

Document Filters 25.1 continues our 'shift-left' strategy by enhancing structured data extraction earlier in the pipeline, reducing the need for downstream corrections and transformations. This release introduces expanded structured output with improved heading detection and list recognition across multiple formats, ensuring cleaner, more reliable data for AI/ML workflows and business applications. PDF processing is also more precise, with better list mapping and automatic text unwrapping for a more natural reading experience. Additionally, file type identification now works even when only part of a file is available, minimizing unnecessary data transfers and improving processing efficiency. By delivering cleaner, more structured data from the start, Document Filters 25.1 helps streamline integration, reduce complexity, and optimize large-scale document workflows. Let’s take a closer look at what’s new.

Document Filters 24.4 Release

· 4 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

The latest release of Hyland Document Filters introduces features that streamline document processing and enhance efficiency, supporting the broader 'shift-left' strategy. By empowering users to control data earlier in the workflow, these updates reduce complexity and improve performance across AI/ML applications. New content cleaning options simplify data preparation, making it easier to generate machine-friendly content, while the Simplified JSON output format accelerates data extraction and processing. Additionally, the new text-mode Markdown support lowers resource consumption, allowing for more efficient handling of large documents. With the addition of a Python package, users can also integrate Document Filters seamlessly into development workflows, enhancing overall productivity and workflow efficiency.

Shifting Left with Document Filters – A Vision for the Future

· 6 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

As the digital landscape continues to evolve, the need for efficient document processing has never been greater. Applications demand accurate, structured data that's ready for immediate use, reducing the steps required to prepare it. At Hyland, we've embraced this challenge with our Document Filters product. Our strategy? Shift left.

Document Filters 24.3 Release

· 4 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

We're excited to announce the latest release of Document Filters, packed with powerful new features designed to enhance your document processing capabilities. This update introduces a JSON Output Type for structured data handling, a Markdown Output Type for streamlined document conversion, advanced PDF Table Extraction for improved data accuracy, and MSI Installer Sub-File Extraction for comprehensive file analysis. Additionally, we've added community-inspired support for Hancom Hangul HWPX text extraction and HD rendering. Read on to discover how these new features can elevate your workflows and drive better results.

Exploring the Document Comparison APIs

· 7 min read
Ben Truscott
Ben Truscott
Document Filters Principal Engineer

The release of Hyland Document Filters 24.2 marks a significant milestone with the introduction of powerful Document Comparison APIs. These new features are designed to enhance the ability of developers to implement robust document comparison capabilities within their applications, facilitating the identification and management of changes across various document types.

Document Filters 24.2 Release

· 3 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

The new 24.2 release of Hyland Document Filters introduces a range of features designed to streamline document comparison, improve accessibility, and integrate advanced OCR technology, with features being directly influenced by the Document Filters community. With these updates, Document Filters continues to evolve as a versatile tool that adapts to the diverse needs of its users.

Document Filters for AI Solutions

· 2 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

In the dynamic realm of AI development, Hyland's Document Filters is a game-changer, offering developers a versatile toolkit for file identification, content extraction, document transformation, and document conversion across over 600 file formats. Our presentation underscores its pivotal role in AI solutions, from enhancing data security with robust redaction features to streamlining operations, and to reducing costs. As AI companies integrate Document Filters into their enterprise offerings, it exemplifies the toolkit's potential to revolutionize AI applications, ensuring efficiency, security, and compliance in our data-driven future.

Converting Documents with Comments

· 3 min read
Nabih Metri
Nabih Metri
Document Filters Product Manager

In software development, where every line of code counts, the Hyland Document Filters SDK is a beacon of efficiency for the document conversion process. This SDK is crafted to streamline the document conversion solution, ensuring that embedded comments—the lifeblood of project collaboration, packed with critical feedback and key insights—are not just preserved, but seamlessly integrated into the converted documents. It’s a solution that resonates with the developer community, offering a way to enhance digital workflows with precision and ease. By incorporating this SDK, developers can confidently tackle the document conversion process, armed with the assurance that the collaborative essence of the documents remains intact, bolstering the robustness of their applications.