About JSON Output¶

Hyland Document Filters provides robust options for generating JSON output from various document formats. By configuring the output settings, you can customize the generated JSON to suit your specific needs, from data structures to metadata inclusion.

Hyland Document Filters supports three JSON schemas: Full, Simplified, and MDAST. The Full schema contains the complete object hierarchy, representing the original document's DOM. The MAST schema is a JSON representation of the Markdown output. The Simplified schema flattens the structure, making it ideal for use with AI applications.

What is JSON¶

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is widely used for APIs, configuration files, and data storage due to its simplicity and interoperability with various programming languages.

JSON's structured format, with its key-value pairs and arrays, makes it more than just a data representation tool. Its flexibility and ease of use make it an ideal format for various advanced applications in AI, analytics, and data processing systems. Here are some key use cases for JSON beyond traditional data interchange:

JSON in AI and Machine Learning Systems
Training Data Generation: JSON, with its structured format, is useful for generating training data for AI models that require labeled datasets. Its ability to represent complex data structures helps in organizing data hierarchically, making it easier to identify features and relationships.
- Example: A machine learning model trained to classify documents can easily parse JSON to focus on relevant fields and values.
Natural Language Processing (NLP) Tasks: JSON documents can be preprocessed by NLP engines for various tasks, such as sentiment analysis, named entity recognition, or topic modeling. Its clear structure allows AI models to focus on content while ignoring irrelevant information.
AI-Driven Data Transformation: JSON is used in systems where AI transforms data from one format to another, such as converting datasets into different structures or summarizing large datasets. JSON's format facilitates easier processing and manipulation of data.
JSON in Analytics and Reporting Systems
Structured Data Aggregation: In analytics and reporting systems, JSON is often used as a lightweight way to store structured data, allowing analysts to include explanations alongside visualizations. JSON's text format enables easy integration with reporting tools that generate automated reports.
Data-Driven Documentation: JSON is ideal for automatically generating documentation that evolves as data changes. When combined with automated tools, JSON can be used to create real-time updated reports, financial documents, or data dashboards where content is generated based on underlying data models.
JSON in Data Processing Systems
Data Comparison and Diffing: JSON's text-based format is ideal for data comparison (diffing) systems, which need to detect changes between different data versions. Since JSON is plain text, it can be diffed more easily than complex formats like XML or proprietary formats. Systems can highlight changes in data structure or content, facilitating version control.
Content Cleaning and Normalization: In data processing pipelines, JSON can be used as an intermediate format for content cleaning and normalization. JSON simplifies the structure of datasets, allowing processing systems to focus on cleaning up data, such as removing duplicates or formatting inconsistencies before further analysis.
Automated Documentation Generation: JSON serves as a common format in automated documentation generation systems where structured content like API responses, configurations, and data specifications are programmatically converted into documentation. Systems can dynamically generate JSON based on data inputs, enabling easy updates.

Creating JSON Output¶

Generating JSON output with Hyland Document Filters is a powerful feature that allows you to create structured data representations from various source formats. Please note that JSON creation is supported exclusively in Hi-Def mode, which ensures that the document's structure and semantics are preserved accurately during the conversion process.

To create JSON output in Hi-Def mode, follow these steps:

Select Your Source Document: Choose the document you wish to convert to JSON. Ensure it is a supported format for Hi-Def mode.
Configure Output Settings: Set your desired output options, including:
Security settings (e.g., permissions for accessing the data)
Metadata to include (e.g., title, author, keywords)
Create JSON Canvas: Ensure that Hi-Def mode is selected in your conversion settings to define the schema for the JSON output. This schema will be used to accurately represent the data structure of the source document.
Render Data to Canvas: Initiate the rendering process, where the system will convert each relevant element of the document to the JSON format, ensuring fidelity in structure and content.
Review the Output: Once the JSON is created, review the data to ensure that the structure, metadata, and overall representation meet your expectations.

By following these steps, you can effectively create professional-grade JSON documents that retain the integrity of your original content, making them suitable for sharing, processing, and integration with other systems.

Explore our tutorials on creating JSON output to enhance your workflow:

How do I convert a document to a JSON file?

JSON Schema and Formatting¶

This section outlines the options available for configuring the JSON schema and formatting in Hyland Document Filters. It includes details on how to select the output schema for Markdown generation and whether to enable formatted output for improved readability.

JSON Schema Options¶

JSON_OUTPUT_SCHEMA¶

JSON_OUTPUT_SCHEMA determines the schema to be used when generating Markdown output.

FULL: Represents the complete document object model.
PIPELINE: Represents a simplified, flatter version of the document object model.
MDAST: Provides a JSON representation of the Markdown output.

JSON_FORMAT_OUTPUT¶

JSON_FORMAT_OUTPUT specifies whether the JSON output should be formatted with newlines and indentation. While this increases the file size, it enhances human readability.

true: Output will include newlines and indentation.
false: Output will not include newlines and indentation.

Content Inclusion¶

The Content Inclusion options allow users to control various elements in the generated JSON output. These options specify whether bookmarks, fields, headers, footers, images, links, and metadata are included, providing flexibility in how the final document is structured. Additionally, users can choose the format for including metadata, enhancing compatibility with different systems. By customizing these settings, users can create JSON documents that meet their specific needs and preferences.

Content Inclusion Options¶

JSON_INCLUDE_BOUNDS¶

This option controls the inclusion of element bounds information in the generated JSON. The default value is:

Default value: ON, which includes bounds information in the output.

JSON_INCLUDE_BOOKMARKS¶

This option controls the inclusion of bookmarks in the generated JSON. The default value is:

Default value: ON, which includes bookmarks in the output.

JSON_INCLUDE_DOC_METADATA_PER_ELEMENT¶

This option enables or disables the inclusion of document-level metadata for each element in the generated JSON. The default value is:

Default value: OFF, which does not include metadata for each element.

JSON_INCLUDE_ELEMENT_ID¶

This option controls whether an element ID is included in the generated JSON output. The default value is:

Default value: OFF, which does not include element IDs.

JSON_INCLUDE_FIELDS¶

This option determines whether fields are included in the generated JSON. The default value is:

Default value: ON, which includes fields in the output.

JSON_INCLUDE_FOOTERS¶

This option specifies whether page footers are included in the JSON output. The available values for this option are:

Default value: OFF, which does not include footers.
ON: Includes footers for all pages.
FIRST: Includes the footer from the first page only.

JSON_INCLUDE_HEADERS¶

This option specifies whether page headers are included in the JSON output. The available values for this option are:

Default value: OFF, which does not include headers.
ON: Includes headers for all pages.
FIRST: Includes the header from the first page only.

JSON_INCLUDE_IMAGE_DATA¶

This option controls the inclusion of image data in the generated JSON. The default value is:

Default value: ON, which includes image data in the output.

JSON_INCLUDE_IMAGES¶

This option controls the inclusion of images in the generated JSON. The default value is:

Default value: ON, which includes images in the output.

JSON_INCLUDE_LINKS¶

This option determines whether hyperlinks are included in the generated JSON. The default value is:

Default value: ON, which includes links in the output.

JSON_INCLUDE_METADATA¶

This option specifies whether document metadata is included in the JSON output. The default value is:

Default value: OFF, which does not include metadata.

JSON_INCLUDE_STYLES¶

This option determines whether element style information is included in the generated JSON. The default value is:

Default value: ON, which includes style information in the output.

JSON_INCLUDE_WHITESPACE¶

This option enables or disables the inclusion of whitespace words in the generated JSON. The default value is:

Default value: OFF, which does not include whitespace words.

JSON_INCLUDE_WORDS¶

This option determines whether word-level information is included in the generated JSON. The default value is:

Default value: OFF, which does not include word-level information.

JSON_INCLUDE_FORMATTING¶

This option determines whether text formatting (e.g., bold, italic) is included in the JSON output. The default value is:

Default value: ON, which includes text formatting in the output.

Content Cleaning¶

The Content Cleaning feature allows users to improve the readability and machine-friendliness of generated content, making it ideal for downstream processing in AI/ML systems. This feature offers multiple configurable cleaning options, including removing non-ASCII characters and normalizing quotes.

Content Cleaning Option¶

JSON_CLEAN_CONTENT¶

JSON_CLEAN_CONTENT specifies an unordered list of cleaning functions to apply to each element's text when generating simplified JSON output with option JSON_OUTPUT_SCHEMA=PIPELINE.

The unordered list of cleaning functions is separated by commas and is case insensitive.

The default value is an empty string, which applies no cleaning functions.

Cleaning Functions¶

clean_non_ascii_chars: Removes non-ASCII characters, leaving only standard ASCII characters.
normalize_quotes: Standardizes a variety of Unicode single and double quotes by replacing them with ASCII single and double quotes.

The cleaning functions are applied in the following fixed order:

normalize_quotes
clean_non_ascii_chars

Examples¶

Option value normalize_quotes applies the normalize_quotes function.
Option value clean_non_ascii_chars,normalize_quotes applies the normalize_quotes and clean_non_ascii_chars functions.