Getting started with Document Filters¶

Document Filters is a comprehensive SDK designed for software developers looking to integrate robust file identification, data extraction, document transformation, and format conversion features into their applications.

This SDK is available as Dynamic Link Libraries (DLLs) for Windows and Shared Objects (SOs) for UNIX-based systems. With Document Filters, developers can:

Identify a wide range of file types.
Extract text and metadata from hundreds of document formats.
Retrieve sub-documents and attachments from various formats, including MS Office documents, ZIP files, RARs, 7-Zips, ISOs, CABs, PSTs, and OSTs.
Convert popular document formats into High-Definition outputs that preserve styles, layouts, and images, with support for formats such as TIFF, HTML, PDF, and Structured XML.
Utilize Canvas and Drawing functions for document markup, permanent annotations, and redaction.

For sample code and header definitions in C++, C#, HTML, Java, and Python, please visit GitHub - Hyland Document Filters.

In This Section¶

Getting Started with .NET	Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your .NET applications. Follow these instructions to set up your environment.
Getting Started with C	Hyland Document Filters allows you to integrate powerful document processing capabilities into your C applications. Below are instructions on how to set up your application depending on your build system, such as CMake, Visual Studio, or Make.
Getting Started with C++	Hyland Document Filters allows you to integrate powerful document processing capabilities into your C++ applications. Below are instructions on how to set up your application depending on your build system, such as CMake, Visual Studio, or Make.
Getting Started with COM	Hyland Document Filters offers powerful document processing capabilities that can be seamlessly integrated into your Windows applications using COM. However, using COM involves extra setup steps, so first-level bindings (e.g., .NET, Java, or C++) should be chosen when available for easier integration.
Getting Started with Java	Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Java applications. Follow these instructions to set up your environment.
Getting Started with Python	Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Python applications. Follow these instructions to set up your environment.
About Accessibility Info Extraction	Document Filters supports the extraction of accessibility information from MS Office documents, aiding in the development of accessible products. The extracted "alt text" labels include images, drawings, objects, word art, smart art, charts, icons, shapes, and text boxes across Word (DOC and DOCX), PowerPoint (PPT and PPTX), Excel (XLS and XLSX), and Visio (VSDX) files. The extracted alt text is surfaced in various output formats -- HTML5, Classic HTML, XML, and PDF -- when the appropriate options for creating accessible output are provided.
About Conversion Profiles	The Hyland Document Filters allow you to customize conversion settings through a configuration file named `ISYS11df.ini`. This file should be placed in the executable folder. If present, the settings specified in this file will be applied during processing, unless explicitly overridden by the application in the `Open` or `Canvas` methods.
About Custom Streams and Extended Streams	Hyland Document Filters supports customizable streams, allowing you to read from storage systems not natively recognized by Document Filters. For example, you may want to read files directly from a database or an FTP site. Additionally, you can utilize a specialized type of custom stream known as an Extended Stream. Extended Streams help Document Filters retrieve additional information about your stream when needed, such as handling requests for specific parts of a multi-part archive.
About File Type Identification	File type identification is the process of determining a file's format based on its content rather than its extension. This is crucial for accurate processing, security scanning, and compatibility with different software applications. Hyland Document Filters identifies file types by analyzing a file’s byte stream, starting with the first 2048 bytes and reading additional data if necessary. It applies a combination of signature-based detection, heuristics, and container recognition to determine the format.
About Fonts	To effectively render a document into HD format, Document Filters must have access to the appropriate fonts to accurately represent each character on the page. Fonts not only provide the visual representation of characters but also contain size and measurement information necessary for proper content positioning.
About Multi-Part Archives	Multi-part archive files, such as certain ZIP and RAR formats, are comprised of two or more files packaged together. Document Filters supports the processing of these multi-part archives, allowing users to easily manage complex file structures.
About Multithreading	Document Filters may be run in a multithreaded application with minimal effort. Following a basic set of rules is essential to avoid issues such as data corruption or race conditions.
About Optical Character Recognition (OCR)	Document Filters provides comprehensive support for Optical Character Recognition (OCR), enabling users to extract text from a variety of document formats. It includes built-in support for the Tesseract OCR engine, which is a widely-used open-source OCR solution. Additionally, Document Filters allows for the use of other versions of Tesseract, giving you the flexibility to choose the version that best fits your needs.

Content Enrichment¶

Overview

The Content Enrichment section introduces key features that enhance the processing of documents by identifying and extracting valuable structural elements such as tables, headers, and footers. These capabilities allow users to transform raw document data into a more organized and meaningful format, facilitating improved data extraction and analysis. By leveraging these features, you can enrich your documents with critical insights and streamline workflows, making it easier to work with complex content.

Table Detection

Table Detection is a powerful feature of the Hyland Document Filters that enables automatic extraction and processing of tables from various document types. This functionality is particularly valuable when working with file formats that inherently store structured tabular data, such as Microsoft Office documents (Word, Excel, PowerPoint) and other similar productivity applications.

Output Formats¶

Overview	Hyland Document Filters offers a variety of output modes, including text, images, PDF, and XML. This section delves into some of these options in more detail.
JSON Output	Hyland Document Filters provides robust options for generating JSON output from various document formats. By configuring the output settings, you can customize the generated JSON to suit your specific needs, from data structures to metadata inclusion.
Markdown Output	Hyland Document Filters provides powerful options for generating Markdown output from various document formats. By configuring the output settings, you can customize the generated Markdown to suit your specific needs, from table styles to metadata inclusion.
PDF Output	Hyland Document Filters offers robust features for generating PDF output from a variety of document formats. By adjusting the output settings, you can tailor the resulting PDF to meet your specific requirements, including layout preferences, security features, and metadata inclusion.