Skip to content

About Optical Character Recognition (OCR)

Document Filters provides comprehensive support for Optical Character Recognition (OCR), enabling users to extract text from a variety of document formats. It includes built-in support for the Tesseract OCR engine, which is a widely-used open-source OCR solution. Additionally, Document Filters allows for the use of other versions of Tesseract, giving you the flexibility to choose the version that best fits your needs.

Moreover, Document Filters is designed to integrate with external OCR engines, providing even greater versatility in processing documents. This means you can configure Document Filters to work with any OCR solution that meets your requirements, whether it's a different version of Tesseract or a completely separate OCR engine.

You can enable the functionality by passing the OCR=ON option as a parameter of the IGR_Open_File_Ex or Extractor::Open method. This option is available for both text-mode and high-definition outputs.

It is important to note that invoking the OCR=ON option will have no effect on formats that are not supported.

Supported Graphic Types

BMP, BRK, CALS, CGM, CUR, DCX, EMF, EPS, GEM, GIF, ICO, IFF, IMNET, JEDICS, JPG, JPK, JXR, MACPAINT, MDI, MSPAINT, NCR, PBM, PCX, PICT, PNG, PSP, SCANNED PDFS, SVM, TGA, TIFF, WBMP, WEBP, WMF, WPG, XBM, XPM, XWD.

When processing a PDF file, only pages that do not contain a text layer will be processed by the OCR engine.

For optimal results, ensure that the input images have sufficient quality, with a recommended resolution of at least 300 dpi. Document Filters includes features to automatically detect images that may be too low in resolution for effective OCR, helping you avoid potential issues during processing.

To enhance the effectiveness of the OCR process, Document Filters skips any image with a width of less than 1,000 pixels. You can adjust this threshold by specifying OCR_MIN_WIDTH in the document open flags.

Using the built-in version of Tesseract

Document Filters integrates with the Tesseract OCR engine 3.02.02 to provide OCR as an optional processing step to extract text from document image formats.

Prerequisites

To utilize the built-in version of Tesseract, ensure that the base set of training data is available. This data is provided in the ISYSreadersocr.dat file, included in the assets.zip archive available on GitHub with each release. The ISYSreadersocr.dat file contains the English and OSD (Orientation and Script Detection) training sets.

The ISYSreadersocr.dat file must be present in the directory specified in the call to either Initialize or Init_Instance. The simplest approach is to place the file in the same directory as the ISYS11df.[dll|so|dylib] file and pass . as the directory argument in the initialization call.

Language Support

To perform OCR on documents in other languages, you can manually install the necessary dictionaries from the Tesseract OCR project site. Ensure that you download the language files that match the version of Tesseract in use (3.02.02). To install a dictionary file, place the .traineddata file in your application directory.

To specify the language for OCR processing, set the OCR_LANGUAGE=xyz, where xyz is the 3-digit language code. For example:

  • English = eng
  • French = fra
  • German = deu

Why Tesseract 3.02.02

The embedded version of Tesseract, 3.02.02, is selected for compatibility across all platforms supported by Document Filters. This specific release ensures optimal support and consistent output. However, you can also configure Document Filters to utilize a system-installed version of Tesseract if preferred.

Disclaimer

Document Filters employs an unaltered version of the Tesseract OCR engine source code and does not guarantee the accuracy of the recognized text or the performance of the engine. While this integration aims to streamline your experience, Tesseract OCR remains a third-party engine and is not a native component of Document Filters. Users are encouraged to integrate any OCR engine that meets their specific needs.

Example Usage in C#

With the ISYSreadersocr.dat in place, you can enable OCR in your code by passing the OCR=ON option when opening a document.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
using Hyland.DocumentFilters;

var api = new Hyland.DocumentFilters.Api();
api.Initialize("YOUR_LICENSE_KEY_HERE", ".");

using var doc = api.OpenExtractor("image.tiff", OpenMode.Paginated, options: "OCR=ON");

while (!doc.EndOfStream)
{
    var text = doc.GetText(4096);
    Console.Out.WriteLine(text);
}

Using a different version of Tesseract

As mentioned earlier, the version of Tesseract bundled with Document Filters is 3.02. You have the flexibility to set up Document Filters to utilize a locally installed version of Tesseract on the host system. When opting for the host system's Tesseract, ensure it aligns with your application or environment's architecture (e.g., 32-bit, 64-bit) and is supplied as a Shared Library or DLL.

The installation process for Tesseract depends on your operating system. Refer to the Tesseract Installation Documentation for details.

To setup Document Filters to support different versions of Tesseract, you must create configure it with the details of the alternate OCR engine. This can be done through either INI file or Environment Variables.

Before you start, you will need to know the name and path of the shared library. This can be found by running find /usr -iname "*tesseract*.so". For example, Ubuntu 20.04 reports as /usr/lib/x86_64-linux-gnu/libtesseract.so.

You can either replace the default OCR engine with the system version, or add a new "named" OCR engine, which can then be specified with the OCR_ENGINE={name} setting. The approach to registering the engine is identical, and is keyed off of the name. Naming your engine TESSERACT will replace the default engine.

Disclaimer

Document Filters includes a convenient pass-through integration with the open-source Tesseract OCR engine as an ancillary service. While we provide this integration to streamline your experience, Tesseract OCR remains an unaltered, third-party engine and is not a native component of Document Filters. Document Filters makes no claims or warranties regarding the accuracy of the recognized text or the performance of Tesseract OCR, or any other OCR engine. Document Filters users are welcome to integrate any OCR engine of their choice to suit their specific needs.

Registering via INI file

Registering an OCR engine via the ISYS11df.ini file is the simplest way to permanently set up the engine for the host system, and it's easy to manage across environments. Multiple OCR engines can be registered, each in its own unique section prefixed with OCR: (e.g., [OCR:tesseract5]).

ISYS11df.ini
[OCR:tesseract5]
enabled=true
tesseract_lib=libtesseract.so.5
tesseract_dll=tesseract53.dll
tesseract_properties=textord_min_blobs_in_row=4 textord_spline_minblobs=8

Each value within the section is described below:

Name Description
enabled Enables or disables the OCR engine. When set to true, the engine can be used by the application.
tesseract_lib Specifies the name of the Tesseract shared library for Linux/macOS. This can be a fully qualified path or relative, relying on standard shared library discovery rules.
tesseract_dll Specifies the name of the Tesseract DLL for Windows. This can be a fully qualified path or relative, relying on standard DLL discovery rules.
tesseract_properties A space-separated list of options to pass to Tesseract during initialization. These options can control Tesseract behavior, such as text layout analysis. For example, textord_min_blobs_in_row and textord_spline_minblobs adjust text line detection parameters.

Registering via Environment Variables

Registering an OCR engine using environment variables is particularly convenient when working with containerized environments or testing different configurations without modifying configuration files. This method allows you to quickly switch between different OCR engines and configurations.

This method is ideal for scenarios such as containerized deployments where modifying configuration files is not practical, or for quickly testing different OCR engines without needing to restart or reconfigure the application.

The environment variable names must follow the pattern DOCFILTERS_OCR_{NAME}__{PROPERTY}, where {NAME} is the OCR engine's name, and {PROPERTY} corresponds to a setting from the INI file. Both {NAME} and {PROPERTY} must be uppercase, and a double underscore (__) separates them.

Name Description
DOCFILTERS_OCR_{NAME}__ENABLED Enables or disables the OCR engine. When set to true, the engine can be used by the application.
DOCFILTERS_OCR_{NAME}__TESSERACT_LIB Specifies the name of the Tesseract shared library for Linux/macOS. This can be a fully qualified path or relative, relying on standard shared library discovery rules.
DOCFILTERS_OCR_{NAME}__TESSERACT_DLL Specifies the name of the Tesseract DLL for Windows. This can be a fully qualified path or relative, relying on standard DLL discovery rules.
DOCFILTERS_OCR_{NAME}__TESSERACT_PROPERTIES A space-separated list of options to pass to Tesseract during initialization. These options can control Tesseract behavior, such as text layout analysis. For example, textord_min_blobs_in_row and textord_spline_minblobs adjust text line detection parameters.

Precedence of Environment Variables

Environment variables override any values stored in the INI file. If both an environment variable and a setting in the INI file define the same property, the environment variable takes precedence and will be applied at runtime.

Example Configuration

To register a new OCR engine called tesseract5 using environment variables, the following variables would be set:

export DOCFILTERS_OCR_tesseract5__ENABLED=true
export DOCFILTERS_OCR_tesseract5__TESSERACT_LIB="libtesseract.so.5"
export DOCFILTERS_OCR_tesseract5__TESSERACT_DLL="tesseract53.dll"
export DOCFILTERS_OCR_tesseract5__TESSERACT_PROPERTIES="textord_min_blobs_in_row=4 textord_spline_minblobs=8"

Using Environment Variables in Containers

When working with containers, you can inject these environment variables during container startup. For example, in Docker:

docker run -e DOCFILTERS_OCR_TESS5__ENABLED=true \
           -e DOCFILTERS_OCR_TESS5__TESSERACT_LIB=/usr/local/lib/libtesseract.so.5 \
           your-container-image

Or in Kubernetes, you can define these in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ocr-config
data:
  DOCFILTERS_OCR_TESS5__ENABLED: "true"
  DOCFILTERS_OCR_TESS5__TESSERACT_LIB: "/usr/local/lib/libtesseract.so.5"

Example Usage in C#

After configuring the OCR engine, you can enable OCR in your code by passing the OCR=ON option when opening a document.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
using Hyland.DocumentFilters;

var api = new Hyland.DocumentFilters.Api();
api.Initialize("YOUR_LICENSE_KEY_HERE", ".");

using var doc = api.OpenExtractor("image.tiff", OpenMode.Paginated, options: "OCR=ON;OCR_ENGINE=tesseract5");

while (!doc.EndOfStream)
{
    var text = doc.GetText(4096);
    Console.Out.WriteLine(text);
}

Using a Different OCR Engine via Callback

Document Filters can integrate with alternative OCR engines by invoking a callback into your application. This allows you to tightly integrate an OCR engine written in the same language as your application.

To provide a custom OCR engine, implement a callback for the OcrImage hook, or handle the IGR_Open_Ex callback for the IGR_OPEN_CALLBACK_ACTION_OCR_IMAGE action. This mechanism is available in C, C++17, Java, .NET, and Python.

Next, enable OCR support by passing OCR=on when opening the file, and set OCR_ENGINE=CUSTOM to trigger your callback.

Your application will then be called whenever an image needs to be OCRed.

At that point, your application is responsible for calling AddText for each word, along with its bounding box and any optional style information. If your OCR engine supports layout detection, you can also call StartBlock and EndBlock to mark the start and end of paragraphs, columns, or other layout regions. If it does not, Document Filters’ built-in layout detection will be used instead.

Using a different OCR engine via External Process

Document Filters can integrate with alternative OCR engines by invoking an external process. To ensure proper interaction with the OCR engine, the engine must meet the following requirements:

  • Read PNG files from disk.
  • Write HOCR output to disk.

Configuration for External OCR Engines

Registering an external OCR engine uses a similar configuration format as the Tesseract engine but includes additional options specific to the external process:

Name Description
exe The executable name for the OCR engine on Windows.
proc The executable name for the OCR engine on all other platforms.
args Command-line arguments to pass to the external process.
output_ext The optional file extension appended to the output filename.

Command-Line Argument Variables

The args value can utilize specific variables that will be expanded at runtime:

Variable Description
${inputFile} The filename of the temporary file that contains the image to be processed.
${outputFile} The filename of a temporary file where the output is to be written.
${language} The OCR language as passed from the user.

Example Configuration

To configure an OCR engine that utilizes the Tesseract command-line interface instead of a shared library, you might set up your configuration like this:

ISYS11df.ini
[OCR:tesseract_cli]
enabled=true
exe=tesseract.exe
proc=/usr/bin/tesseract
args=-c "tessedit_create_hocr=1" -c "hocr_font_info=1" -l "${language}" "${inputFile}" "${outputFile}"
output_ext=.hocr

In this example:

  • The options -c "tessedit_create_hocr=1" and -c "hocr_font_info=1" enable the HOCR output format.
  • The Tesseract CLI automatically appends a .hocr suffix to the output filename, which is why we specify output_ext=.hocr in the configuration.

Activating the External OCR Engine

To utilize this external engine within your application, you can set the options when opening a document:

"OCR=on;OCR_ENGINE=tesseract_cli"

Troubleshooting External OCR Engines

If you encounter issues when using a different OCR engine, consider the following steps:

  1. Verify Executable Paths: Ensure that the paths for exe and proc are correct and that the executable is accessible from the environment where Document Filters runs.
  2. Check Arguments: Review the arguments passed in the args configuration to confirm they are valid and properly formatted. Pay attention to quotes and escape characters if needed.
  3. Output File Permissions: Ensure that the process has the necessary permissions to write to the output file location.
  4. Logging and Error Messages: Enable detailed logging in your application to capture any error messages returned by the external OCR engine. This can provide insights into what may be going wrong.
  5. Testing the External Process: Manually test running the OCR engine from the command line with the same arguments to ensure it functions correctly outside of the Document Filters context.
  6. Review HOCR Output: After processing, inspect the generated HOCR output for correctness and completeness. If issues persist, consider adjusting the command-line options based on Tesseract’s documentation.

Inline OCR of Embedded Images

In addition to performing OCR on full-page raster inputs (for example, scanned TIFF pages or image-only PDF pages), Document Filters can now extract text from images that are embedded inline within supported document formats. This capability helps retain meaningful textual content that appears inside screen captures, diagrams, slide artwork, pasted photos, and other inline graphics.

Enable this behavior by supplying the options:

OCR=ON;OCR_INLINE_IMAGES=ON

You may combine this with any other OCR options (such as OCR_MIN_WIDTH or OCR_REORIENT_PAGES). When enabled, each qualifying embedded image is passed through the OCR pipeline and any recognized text is merged into the output flow (e.g., Markdown, JSON, HTML overlay) without altering the visual appearance of the original rendered image content.

Supported Formats

Inline image OCR is currently supported for: email and web archive formats (HTML, MSG, EML, MHT, ICS/VCAL, OLK15) and office productivity formats (Apple Keynote, Apple Numbers, Apple Pages, HWPX, OpenOffice, RTF, Microsoft Word, WordPerfect, Microsoft Visio, Microsoft PowerPoint).

Option Description
OCR_INLINE_IMAGES Enables OCR on embedded (inline) images inside supported document pages. Default: OFF. Requires OCR=ON.
OCR_MIN_WIDTH Minimum pixel width for an image to be considered for OCR. Smaller images are skipped (default threshold documented above).
OCR_REORIENT_PAGES Attempts to detect / correct rotation prior to recognition (applies to inline images as well).

Notes:

  • If OCR is not enabled, OCR_INLINE_IMAGES is ignored.
  • Images are only pre‑filtered by width (OCR_MIN_WIDTH). If an image meets the width threshold but contains no readable text (e.g., a low‑contrast logo, photograph, gradient, or decorative icon), the OCR engine may simply return no words; this is expected and not an additional skip rule.
  • Caching strategies within the implementation prevent re‑OCR of identical embedded images encountered multiple times in a single document session.

Example (C#) – Extracting Markdown With Inline Image OCR

using Hyland.DocumentFilters;

var api = new Api();
api.Initialize("YOUR_LICENSE_KEY", ".");

using var doc = api.OpenExtractor("example.docx", OpenMode.Paginated,
    options: "OCR=ON;OCR_INLINE_IMAGES=ON;OCR_MIN_WIDTH=800");

while (!doc.EndOfStream)
{
    var md = doc.GetText(8192); // Markdown output
    System.Console.WriteLine(md);
}

Example (Command Options)

OCR=ON;OCR_INLINE_IMAGES=ON;OCR_MIN_WIDTH=800

When to Use Inline Image OCR

Use OCR_INLINE_IMAGES=ON when downstream workflows (search, summarization, semantic enrichment, AI classification) benefit from textual content that is visually embedded but not otherwise present in the document’s natural text layer. Leaving it OFF avoids extra processing when inline graphics are unlikely to contain meaningful text.

Performance & Tuning

Keep runtime in check by:

  • Enabling OCR_INLINE_IMAGES only where inline image text adds value.
  • Raising or lowering OCR_MIN_WIDTH to balance coverage vs. cost.
  • Using OCR_CACHE for repeated / duplicate documents.