How do I convert a PDF to Markdown with Table Detection?¶

Choose a language:

C#
Java
Python
C++17
C

In this Article¶

This sample illustrates how to utilize the Hyland Document Filters SDK to convert a PDF document into HiDef Markdown format with enabled table detection. It provides a straightforward workflow for initializing the Document Filters API, opening the PDF document, and rendering its content into a well-structured Markdown output.

What You Will Learn:

API Initialization: Understand how to initialize the Hyland Document Filters API with a valid license code to enable document processing.
Document Opening: Learn how to open a PDF document for conversion to HiDef Markdown using the OpenExtractor method, ensuring the use of the correct open mode (e.g., Paginated) and enabling table detection with specific options.
Output Canvas Creation: Discover how to create an output canvas using the MakeOutputCanvas method, specifying the desired output file name and canvas type (e.g., MARKDOWN) to generate the Markdown structure.
Content Rendering: Learn how to render the document's pages into the output canvas, efficiently transforming the document's content, including detected tables, into Markdown format.
Resource Management: Gain insights into effective resource management in .NET by utilizing using statements to ensure that document and canvas objects are properly disposed of after use.

By following this sample, you will become familiar with the basics of setting up the Document Filters API, converting PDF documents to HiDef Markdown with table detection, and efficiently rendering document content for web applications.

See Table Detection for more details.

Converting a PDF to Markdown with Table Detection¶

app.cs
using Hyland.DocumentFilters;

var api = new Hyland.DocumentFilters.Api();
api.Initialize("License Code", ".");

var documentOptions = "PDF_TABLE_DETECTION=ON;";

var canvasOptions = "MARKDOWN_SIMPLE_TABLE_STYLE=GRID;";
canvasOptions += "MARKDOWN_COMPLEX_TABLE_STYLE=HTML;";
canvasOptions += "MARKDOWN_INCLUDE_FOOTERS=OFF;";
canvasOptions += "MARKDOWN_INCLUDE_HEADERS=OFF;";
canvasOptions += "MARKDOWN_INCLUDE_FIELDS=OFF;";

using var doc = api.OpenExtractor("invoice.pdf", OpenMode.Paginated, documentOptions);
using var canvas = api.MakeOutputCanvas("output.md", CanvasType.MARKDOWN, canvasOptions);

canvas.RenderPages(doc);

See our C# samples on GitHub

App.java
import com.perceptive.documentfilters.*;

public class App
{
    public static void main(String[] args) throws Exception
    {
        DocumentFilters df = new DocumentFilters();
        df.Initialize(("License Code"), ".");

        String documentOptions = "PDF_TABLE_DETECTION=ON;";

        String canvasOptions = "MARKDOWN_SIMPLE_TABLE_STYLE=GRID;";
        canvasOptions += "MARKDOWN_COMPLEX_TABLE_STYLE=HTML;";
        canvasOptions += "MARKDOWN_INCLUDE_FOOTERS=OFF;";
        canvasOptions += "MARKDOWN_INCLUDE_HEADERS=OFF;";
        canvasOptions += "MARKDOWN_INCLUDE_FIELDS=OFF;";

        try (Extractor doc = df.GetExtractor("filename.doc"))
        {
            try (Canvas canvas = df.MakeOutputCanvas("filename.md", isys_docfiltersConstants.IGR_DEVICE_MARKDOWN, canvasOptions))
            {
                doc.Open(isys_docfiltersConstants.IGR_FORMAT_IMAGE, documentOptions);

                for (int i = 0, c = doc.GetPageCount(); i < c; ++i)
                {
                    try (Page page = doc.GetPage(i)) {
                        canvas.RenderPage(page);
                    }
                }
            }
        }
    }
}

See our Java samples on GitHub

app.py
from DocumentFilters import *

api = DocumentFilters()
api.Initialize("License Code", ".")

document_options = (
    "PDF_TABLE_DETECTION=OFF;"
)

canvas_options = (
    "MARKDOWN_SIMPLE_TABLE_STYLE=GRID;"
    "MARKDOWN_COMPLEX_TABLE_STYLE=HTML;"
    "MARKDOWN_INCLUDE_FOOTERS=OFF;"
    "MARKDOWN_INCLUDE_HEADERS=OFF;"
    "MARKDOWN_INCLUDE_FIELDS=OFF;"
)

with api.OpenExtractor("filename.doc", mode=IGR_FORMAT_IMAGE, options=document_options) as doc:
    with api.MakeOutputCanvas("output.md", canvasType=IGR_DEVICE_MARKDOWN) as canvas:
        canvas.RenderPages(doc)

See our Python samples on GitHub

app.cpp
#include <DocumentFiltersObjects.h>

int main() {
    try {
        // Create and initialize the API object
        Hyland::DocFilters::Api api;
        api.Initialize("License Code", ".");

        std::wstring documentOptions = L"PDF_TABLE_DETECTION=ON;";

        std::wstring canvasOptions = L"MARKDOWN_SIMPLE_TABLE_STYLE=GRID;";
        canvasOptions += L"MARKDOWN_COMPLEX_TABLE_STYLE=HTML;";
        canvasOptions += L"MARKDOWN_INCLUDE_FOOTERS=OFF;";
        canvasOptions += L"MARKDOWN_INCLUDE_HEADERS=OFF;";
        canvasOptions += L"MARKDOWN_INCLUDE_FIELDS=OFF;";

        // Open the input file
        Hyland::DocFilters::Extractor doc = api.OpenExtractor("filename.doc", Hyland::DocFilters::OpenMode::Paginated, 0, documentOptions);

        // Create the output canvas 
        Hyland::DocFilters::Canvas canvas = api.MakeOutputCanvas("output.md", Hyland::DocFilters::CanvasType::MARKDOWN, canvasOptions);

        // Render all pages to the output
        canvas.RenderPages(doc);
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1; // Indicate an error
    }

    return 0; // Successful execution
}

See our C++ samples on GitHub

app.c
#include <DocumentFilters.h>
#include <stdio.h>
#include <string.h>

// License code and input file name
const char* license_code = "";
const char* input_file = "filename.doc";
const char* output_file = "output.md";

// Function prototypes for UCS2 and UTF8 conversion
IGR_UCS2* UCS2(const char* src, IGR_UCS2** dest);

int main() {
    // Initialization of status and control blocks
    Instance_Status_Block isb = { 0 };
    Error_Control_Block ecb = { 0 };
    IGR_SHORT instance = 0;
    IGR_LONG caps = 0, type = 0, docHandle = 0, res = 0, pageCount = 0;
    IGR_HCANVAS canvasHandle = 0;
    IGR_HPAGE pageHandle = 0;
    IGR_UCS2* tempBuffer = NULL, *tempBuffer2 = NULL, *tempBuffer3 = NULL;;

    // Set license code
    strncpy(isb.Licensee_ID1, license_code, sizeof(isb.Licensee_ID1) - 1);
    // Initialize instance
    Init_Instance(0, ".", &isb, &instance, &ecb);

    const char* document_options = "PDF_TABLE_DETECTION=ON;";
        "MARKDOWN_COMPLEX_TABLE_STYLE=HTML;"
        "MARKDOWN_INCLUDE_FOOTERS=OFF;"
        "MARKDOWN_INCLUDE_HEADERS=OFF;"
        "MARKDOWN_INCLUDE_FIELDS=OFF;";    

    const char* canvas_options = "MARKDOWN_SIMPLE_TABLE_STYLE=GRID;";
        "MARKDOWN_COMPLEX_TABLE_STYLE=HTML;"
        "MARKDOWN_INCLUDE_FOOTERS=OFF;"
        "MARKDOWN_INCLUDE_HEADERS=OFF;"
        "MARKDOWN_INCLUDE_FIELDS=OFF;";

    // Open the document file
    res = IGR_Open_File_Ex(UCS2(input_file, &tempBuffer), IGR_FORMAT_IMAGE, UCS2(input_file, &tempBuffer2), &caps, &type, &docHandle, &ecb);
    if (res != IGR_OK)
        goto error;

    // Create the output canvas
    res = IGR_Make_Output_Canvas(IGR_DEVICE_MARKDOWN, UCS2(output_file, &tempBuffer), UCS2(canvas_options, &tempBuffer3), &canvasHandle, &ecb);
    if (res != IGR_OK)
        goto error;

    // Count the number of printable pages
    res = IGR_Get_Page_Count(docHandle, &pageCount, &ecb);
    if (res != IGR_OK)
        goto error;

    // Render the pages to the canvas
    for (IGR_LONG i = 0; i < pageCount; i++) {
        if (IGR_Open_Page(docHandle, i, &pageHandle, &ecb) != IGR_OK)
            goto error;

        IGR_Render_Page(pageHandle, canvasHandle, &ecb);
        IGR_Close_Page(pageHandle, &ecb);
    }

    goto cleanup;

error:
    // Print error message
    if (ecb.Msg[0] != 0)
        fprintf(stderr, "Error: %s\n", ecb.Msg);
    else
        fprintf(stderr, "Error: %d\n", res);
cleanup:
    // Free allocated resources
    if (tempBuffer) free(tempBuffer);
    if (tempBuffer2) free(tempBuffer2);
    if (tempBuffer3) free(tempBuffer3);
    if (canvasHandle) IGR_Close_Canvas(canvasHandle, &ecb);
    if (docHandle) IGR_Close_File(docHandle, &ecb);
    return 0;
}

// Convert a UTF8 string to UCS2
IGR_UCS2* UCS2(const char* src, IGR_UCS2** dest) {
    size_t len = strlen(src);
    size_t destSize = len * 2 + 2;

    if (!*dest)
        free(*dest);

    *dest = malloc(destSize);
    if (!*dest) 
        return NULL;

    // Perform the conversion
    UTF8_to_Widechar_Ex(src, len, *dest, destSize);
    return *dest;
}

Reviewing the Output¶

Let's examine the following PDF invoice file.

With PDF_TABLE_DETECTIONWithout PDF_TABLE_DETECTION

With PDF_TABLE_DETECTION enabled, the content is accurately represented in table format:

| **Description** | **Quantity** | **Unit Price** | **Cost** |
| --------------- | ------------ | -------------- | -------- |
| Item 1          | 55           |   100          |   5,500  |
| Item 2          | 13           |   90           |   1,170  |
| Item 3          | 25           |   50           |   1,250  |

|   |     |          |        |
| - | --- | -------- | ------ |
|   |     | Subtotal |  7,920 |
|   | Tax | 8.25%    |   653  |
|   |     | Total    |  8,573 |

Thank you for your business. It’s a pleasure to work with you on your project.  
Your next order will ship in 30 days.

When PDF_TABLE_DETECTION is not enabled, the line items in the table are treated as standard text, resulting in the following markdown output:

Item 1

55   100   5,500

Item 2

13   90   1,170

Item 3

25   50   1,250

Subtotal  7,920

Tax 8.25%   653

Total  8,573

Thank you for your business. It’s a pleasure to work with you on your project.  
Your next order will ship in 30 days.

Note

Table detection is not an exact science; it involves interpreting the original intent from the available information in the file. As a result, it may occasionally misidentify content as a table or fail to detect existing tables.