Skip to main content

Document Filters 26.2 Release

· 4 min read
Nabih Metri
Nabih Metri
Product Manager

The hardest document processing bugs to catch are the silent ones: a file processed without error, output generated without warning, and yet something is wrong. Scanned PDFs returning empty text because annotations tricked the OCR logic. Bounding boxes offset from the wrong origin. Sensitivity labels missing because Microsoft moved where they are stored. Document Filters 26.2 is a release focused on correctness. Each fix closes a real gap encountered in production workflows, and together they make the library more reliable across the full range of content types and use cases it handles.

MSIP Sensitivity Label Detection

Microsoft recently changed where MSIP (Microsoft Information Protection) metadata is stored within the Office document file structure. Documents created or re-saved in newer Office versions write the label to the new location, which meant label extraction was returning empty values for any file produced by a modern Office client.

Document Filters now searches both the original and the new location, so sensitivity labels such as "Confidential", "Internal Only", and "Public" are extracted consistently regardless of which Office version produced the file. This matters for any workflow that uses classification metadata to control access, route documents, enforce compliance policies, or decide what content is eligible for indexing or AI ingestion.

Accurate OCR on Annotated PDFs

A scanned document converted to PDF and stamped with a free-text annotation (such as "Approved", "Reviewed", a date, or a reviewer's name) is a routine artifact of document workflows. Before this release, Document Filters detected that annotation text, concluded the PDF already had a text layer, and skipped OCR entirely. The result was output containing nothing but the annotation string, with no error raised and no indication that the underlying image content was never processed.

The fix corrects the classification logic so that free-text annotations are not treated as a document text layer. Image-only PDFs now run OCR correctly regardless of what annotations are present.

Corrected Markdown Bounding Boxes for Textbox Content

When using MARKDOWN_INCLUDE_LOCATIONS, bounding boxes for text inside textboxes were reported relative to the textbox's own origin rather than the page. A textbox positioned at (400, 300) on the page containing text at (10, 5) within the box would report coordinates of (10, 5), placing the highlight near the top-left of the page rather than where the text actually appears.

Bounding boxes are now always page-relative, ensuring that any feature depending on text location data (search result highlighting, redaction, citation display, or spatial analysis) receives accurate coordinates for textbox content in presentations, dashboards, and designed layouts.

Format Identification and Conversion Reliability

Several format identification and conversion issues have been resolved:

PNG misidentified as WMF: PNG files were incorrectly classified as Windows Metafile, routing them through a vector rendering path that produces incorrect output for raster images. Now correctly identified as PNG.

EML misidentified as plain text: EML files treated as plain text were serialized as raw RFC 822 content, producing 100+ page PDFs filled with headers and MIME boundaries instead of the actual email message. Now correctly processed as email.

DOCX to Markdown conversion failure: DOCX files with certain table structures failed Markdown conversion with Table children must be table rows. These now convert successfully.

ODT header and footer extraction: ODT files generated by Microsoft Word were not returning headers or footers despite a partial fix in 26.1. This is now fully resolved for ODT files regardless of origin.

Apple Keynote Rendering Improvements

Two fidelity issues with Apple Keynote files have been addressed. Long paragraphs that were cut at page boundaries due to missing keepOnPage style handling are now preserved correctly. List content that was shifting position or disappearing from output is also fixed, improving the accuracy of both HD and Markdown output for Keynote presentations.

OCR DPI Scaling Fix

When using Tesseract OCR or a custom OCR engine with the GRAPHIC_DPI option set, OCR text was placed at incorrect coordinates in rendered output due to a scaling error applied to position data. The scaling is now correct, ensuring OCR text aligns with visual content at all DPI settings.

Security Updates

This release includes patches for CVEs across bundled third-party libraries keeping Document Filters current with the latest security standards.

Document Filters Resources