Skip to content

Getting started with Document Filters

Document Filters is a comprehensive SDK designed for software developers looking to integrate robust file identification, data extraction, document transformation, and format conversion features into their applications.

This SDK is available as Dynamic Link Libraries (DLLs) for Windows and Shared Objects (SOs) for UNIX-based systems. With Document Filters, developers can:

  • Identify a wide range of file types.
  • Extract text and metadata from hundreds of document formats.
  • Retrieve sub-documents and attachments from various formats, including MS Office documents, ZIP files, RARs, 7-Zips, ISOs, CABs, PSTs, and OSTs.
  • Convert popular document formats into High-Definition outputs that preserve styles, layouts, and images, with support for formats such as TIFF, HTML, PDF, and Structured XML.
  • Utilize Canvas and Drawing functions for document markup, permanent annotations, and redaction.

For sample code and header definitions in C++, C#, HTML, Java, and Python, please visit GitHub - Hyland Document Filters.

In This Section

Getting Started with .NET

Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your .NET applications. Follow these instructions to set up your environment.

Getting Started with C

Hyland Document Filters allows you to integrate powerful document processing capabilities into your C applications. Below are instructions on how to set up your application depending on your build system, such as CMake, Visual Studio, or Make.

Getting Started with C++

Hyland Document Filters allows you to integrate powerful document processing capabilities into your C++ applications. Below are instructions on how to set up your application depending on your build system, such as CMake, Visual Studio, or Make.

Getting Started with COM

Hyland Document Filters offers powerful document processing capabilities that can be seamlessly integrated into your Windows applications using COM. However, using COM involves extra setup steps, so first-level bindings (e.g., .NET, Java, or C++) should be chosen when available for easier integration.

Getting Started with Java

Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Java applications. Follow these instructions to set up your environment.

Getting Started with Python

Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Python applications. Follow these instructions to set up your environment.

About Accessibility Info Extraction

Document Filters supports the extraction of accessibility information from MS Office documents, aiding in the development of accessible products. The extracted "alt text" labels include images, drawings, objects, word art, smart art, charts, icons, shapes, and text boxes across Word (DOC and DOCX), PowerPoint (PPT and PPTX), Excel (XLS and XLSX), and Visio (VSDX) files. The extracted alt text is surfaced in various output formats -- HTML5, Classic HTML, XML, and PDF -- when the appropriate options for creating accessible output are provided.

About Conversion Profiles

The Hyland Document Filters allow you to customize conversion settings through a configuration file named ISYS11df.ini. This file should be placed in the executable folder. If present, the settings specified in this file will be applied during processing, unless explicitly overridden by the application in the Open or Canvas methods.

About Custom Streams and Extended Streams

Hyland Document Filters supports customizable streams, allowing you to read from storage systems not natively recognized by Document Filters. For example, you may want to read files directly from a database or an FTP site. Additionally, you can utilize a specialized type of custom stream known as an Extended Stream. Extended Streams help Document Filters retrieve additional information about your stream when needed, such as handling requests for specific parts of a multi-part archive.

About File Type Identification

File type identification is the process of determining a file's format based on its content rather than its extension. This is crucial for accurate processing, security scanning, and compatibility with different software applications. Hyland Document Filters identifies file types by analyzing a file’s byte stream, starting with the first 2048 bytes and reading additional data if necessary. It applies a combination of signature-based detection, heuristics, and container recognition to determine the format.

About Fonts

To effectively render a document into HD format, Document Filters must have access to the appropriate fonts to accurately represent each character on the page. Fonts not only provide the visual representation of characters but also contain size and measurement information necessary for proper content positioning.

About Multi-Part Archives

Multi-part archive files, such as certain ZIP and RAR formats, are comprised of two or more files packaged together. Document Filters supports the processing of these multi-part archives, allowing users to easily manage complex file structures.

About Multithreading

Document Filters may be run in a multithreaded application with minimal effort. Following a basic set of rules is essential to avoid issues such as data corruption or race conditions.

About Optical Character Recognition (OCR)

Document Filters provides comprehensive support for Optical Character Recognition (OCR), enabling users to extract text from a variety of document formats. It includes built-in support for the Tesseract OCR engine, which is a widely-used open-source OCR solution. Additionally, Document Filters allows for the use of other versions of Tesseract, giving you the flexibility to choose the version that best fits your needs.

Content Enrichment

Overview

The Content Enrichment section introduces key features that enhance the processing of documents by identifying and extracting valuable structural elements such as tables, headers, and footers. These capabilities allow users to transform raw document data into a more organized and meaningful format, facilitating improved data extraction and analysis. By leveraging these features, you can enrich your documents with critical insights and streamline workflows, making it easier to work with complex content.

Table Detection

Table Detection is a powerful feature of the Hyland Document Filters that enables automatic extraction and processing of tables from various document types. This functionality is particularly valuable when working with file formats that inherently store structured tabular data, such as Microsoft Office documents (Word, Excel, PowerPoint) and other similar productivity applications.

Output Formats

Overview

Hyland Document Filters offers a variety of output modes, including text, images, PDF, and XML. This section delves into some of these options in more detail.

JSON Output

Hyland Document Filters provides robust options for generating JSON output from various document formats. By configuring the output settings, you can customize the generated JSON to suit your specific needs, from data structures to metadata inclusion.

Markdown Output

Hyland Document Filters provides powerful options for generating Markdown output from various document formats. By configuring the output settings, you can customize the generated Markdown to suit your specific needs, from table styles to metadata inclusion.

PDF Output

Hyland Document Filters offers robust features for generating PDF output from a variety of document formats. By adjusting the output settings, you can tailor the resulting PDF to meet your specific requirements, including layout preferences, security features, and metadata inclusion.