Getting Stated with Python¶
Hyland Document Filters provides robust document processing capabilities that can be easily integrated into your Python applications. Follow these instructions to set up your environment.
Clone and Include the Document Filters Repository¶
The Document Filters GitHub repository contains the necessary files and libraries.
Note
Before version 22.2, Python bindings for Document Filters were only available through shared object packages, supporting Python versions up to 3.3. Version 22.2 introduced ctypes-based Python bindings, which are compatible with Python versions 3.7 and later, while still supporting the legacy shared object bindings. Starting from version 24.4, the shared object bindings have been deprecated and removed.
Installing the Bindings¶
The Python bindings are provided as source in the bindings/python
directory of the Document Filters GitHub Document Filters GitHub repository.
The easiest way to work with them is to install them as a system package, however you can also include them within your application by copying the directory.
You can install the Python bindings in three ways:
1. Install Directly from the Directory¶
Navigate to the bindings/python
directory and run the following command:
pip install .
2. Install from GitHub¶
You can also install the bindings directly from GitHub using the following command:
pip install git+https://github.com/Hyland/DocumentFilters.git@<desired-tag-or-branch>#subdirectory=bindings/python
Replace <desired-tag-or-branch>
with the specific tag or branch you want to use.
3. Add to requirements.txt
¶
If you're using a requirements.txt
file for your project, you can add the following line to include the Python bindings:
DocumentFilters @ git+https://github.com/Hyland/DocumentFilters.git@<desired-tag-or-branch>#subdirectory=bindings/python
Again, replace <desired-tag-or-branch>
with the specific tag or branch you want to use.
After updating the requirements.txt
file, run the following command to install the dependencies:
pip install -r requirements.txt
Initializing and calling Document Filters¶
Once the package is installed, you can begin using it in your application.
1 2 3 4 |
|
The code above imports the DocumentFilters
package into the global scope, then creates an instance of the Api
class and initializes it with a license key. Replace "YOUR_LICENSE_KEY_HERE" with your actual license key.
The second argument specifies the directory where Document Filters should search for resources like configuration files and fonts. The .
indicates that it should look in the same directory as the Document Filters shared libraries.
Extracting Text¶
Once the Document Filters library is initialized, you can begin extracting text from documents. The following Python code snippet demonstrates how to load a document and extract its contents using the Document Filters API. This example focuses on extracting text from a Word document (.doc file).
1 2 3 4 5 6 7 8 9 10 11 |
|
In this code snippet, the file filename.doc
is loaded into an extractor named doc
. By using a with
statement, the extractor is automatically closed when it goes out of scope.
The extractor is opened with the IGR_BODY_AND_META
option, which allows for the extraction of both the document's body text and its metadata.
The loop then repeatedly calls GetText
until the extractor indicates that it has reached the end of the file (EOF).
Converting a Document¶
After initializing the Document Filters library, you can convert documents into different formats, such as PDF. The following Python code snippet demonstrates how to load a Word document (.doc
file) and convert it into a PDF using the Document Filters API.
1 2 3 4 5 6 7 8 9 10 |
|
This code snippet loads filename.doc
into an extractor named doc
. Again, using a with
block ensures that the extractor is closed automatically when it goes out of scope.
Additionally, a new canvas
object of type IGR_DEVICE_IMAGE_PDF
is created, which will also be closed automatically when it is no longer needed.
The extractor is opened with the IGR_FORMAT_IMAGE
option, indicating that the file should be converted into an image-based output, triggering pagination.
Finally, RenderPages
is called to render each page from doc
onto the canvas. Alternatively, you could iterate over the pages manually and call RenderPages
for each one.
Did you know?
You can render more than one document to a canvas. If you want to stitch multiple files together, simply load each document into it's own Extractor
, then call RenderPage/s
onto a single canvas.
Troubleshooting¶
libISYS11df.so: cannot open shared object file: No such file or directory
If you see an error similar to above, it means that Python
was unable to locate the Document Filters Shared Libraries. Standard dlopen
rules are used to locate the libraries.
This can often be worked around by adding the path containing the libraries to the LD_LIBRARY_PATH
environment variable.