Getting Stated with Python¶

Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Python applications. Follow these instructions to set up your environment.

Clone and Include the Document Filters Repository¶

The Document Filters GitHub repository contains the necessary files and libraries.

Note

Before version 22.2, Python bindings for Document Filters were only available through shared object packages, supporting Python versions up to 3.3. Version 22.2 introduced ctypes-based Python bindings, which are compatible with Python versions 3.7 and later, while still supporting the legacy shared object bindings. Starting from version 24.4, the shared object bindings have been deprecated and removed.

Installing the Bindings¶

The Python bindings are provided as source in the bindings/python directory of the Document Filters GitHub Document Filters GitHub repository.

The easiest way to work with them is to install them as a system package, however you can also include them within your application by copying the directory.

You can install the Python bindings in three ways:

1. Install Directly from the Directory¶

Navigate to the bindings/python directory and run the following command:

pip install .

2. Install from GitHub¶

You can also install the bindings directly from GitHub using the following command:

pip install git+https://github.com/Hyland/DocumentFilters.git@<desired-tag-or-branch>#subdirectory=bindings/python

Replace <desired-tag-or-branch> with the specific tag or branch you want to use.

3. Add to `requirements.txt`¶

If you're using a requirements.txt file for your project, you can add the following line to include the Python bindings:

DocumentFilters @ git+https://github.com/Hyland/DocumentFilters.git@<desired-tag-or-branch>#subdirectory=bindings/python

Again, replace <desired-tag-or-branch> with the specific tag or branch you want to use.

After updating the requirements.txt file, run the following command to install the dependencies:

pip install -r requirements.txt

Initializing and calling Document Filters¶

Once the package is installed, you can begin using it in your application.

Python

from DocumentFilters import *

df = Api()
df.Initialize("YOUR_LICENSE_KEY_HERE", ".")

The code above imports the DocumentFilters package into the global scope, then creates an instance of the Api class and initializes it with a license key. Replace "YOUR_LICENSE_KEY_HERE" with your actual license key.

The second argument specifies the directory where Document Filters should search for resources like configuration files and fonts. The . indicates that it should look in the same directory as the Document Filters shared libraries.

Extracting Text¶

Once the Document Filters library is initialized, you can begin extracting text from documents. The following Python code snippet demonstrates how to load a document and extract its contents using the Document Filters API. This example focuses on extracting text from a Word document (.doc file).

Python

from DocumentFilters import *

df = Api()
df.Initialize("YOUR_LICENSE_KEY_HERE", ".")

with df.GetExtractor("filename.doc") as doc:
  doc.Open(IGR_BODY_AND_META)

  while not doc.EOF:
    text = doc.GetText(4096, True)
    print(text)

In this code snippet, the file filename.doc is loaded into an extractor named doc. By using a with statement, the extractor is automatically closed when it goes out of scope.

The extractor is opened with the IGR_BODY_AND_META option, which allows for the extraction of both the document's body text and its metadata.

The loop then repeatedly calls GetText until the extractor indicates that it has reached the end of the file (EOF).

Converting a Document¶

After initializing the Document Filters library, you can convert documents into different formats, such as PDF. The following Python code snippet demonstrates how to load a Word document (.doc file) and convert it into a PDF using the Document Filters API.

Python

from DocumentFilters import *

df = Api()
df.Initialize("YOUR_LICENSE_KEY_HERE", ".")

with df.GetExtractor("filename.doc") as doc:
  with df.MakeOutputCanvas("output.pdf", IGR_DEVICE_IMAGE_PDF) as canvas:
    doc.Open(IGR_FORMAT_IMAGE)

    canvas.RenderPages(doc)

This code snippet loads filename.doc into an extractor named doc. Again, using a with block ensures that the extractor is closed automatically when it goes out of scope.

Additionally, a new canvas object of type IGR_DEVICE_IMAGE_PDF is created, which will also be closed automatically when it is no longer needed.

The extractor is opened with the IGR_FORMAT_IMAGE option, indicating that the file should be converted into an image-based output, triggering pagination.

Finally, RenderPages is called to render each page from doc onto the canvas. Alternatively, you could iterate over the pages manually and call RenderPages for each one.

Did you know?

You can render more than one document to a canvas. If you want to stitch multiple files together, simply load each document into it's own Extractor, then call RenderPage/s onto a single canvas.

Troubleshooting¶

libISYS11df.so: cannot open shared object file: No such file or directory

If you see an error similar to above, it means that Python was unable to locate the Document Filters Shared Libraries. Standard dlopen rules are used to locate the libraries.

This can often be worked around by adding the path containing the libraries to the LD_LIBRARY_PATH environment variable.

Getting Stated with Python¶

Clone and Include the Document Filters Repository¶

Installing the Bindings¶

1. Install Directly from the Directory¶

2. Install from GitHub¶

3. Add to requirements.txt¶

Initializing and calling Document Filters¶

Extracting Text¶

Converting a Document¶

Troubleshooting¶

3. Add to `requirements.txt`¶