Skip to main content

Processing Options

All pipeline behavior can be customized by supplying a JSON options object in the body of the /presign request. Every field is optional; omitting a field falls back to the environment-level default.

POST /presign
Content-Type: application/json
Authorization: Bearer <token>

Reference

The following sections detail the processing options of the /presign request body.

normalization

Type: object | Default: environment-configured

Controls Unicode normalization applied to extracted text before further processing. Normalization replaces visually similar characters with their canonical ASCII equivalents, which improves downstream chunking, search, and embedding quality.

FieldTypeDefaultDescription
quotationsbooleantrueReplaces "smart" (curly) quotation marks and apostrophes with their straight, ASCII equivalents (" and ').
dashesbooleantrueReplaces en-dashes () and em-dashes () with a standard ASCII hyphen-minus (-).

chunking

Type: boolean | Default: false

Enables or disables the chunking stage of the pipeline. When false, the extracted text is returned as a single block and the embedding stage is skipped regardless of the embedding flag.


chunking_strategy

Type: string | Default: context

The algorithm used to split text into chunks when chunking is true.

ValueDescription
contextText-aware chunking that respects sentence and paragraph boundaries, producing semantically coherent chunks.
fixedFixed-size chunking that splits text into uniform-sized chunks. Use when consistent chunk sizes are required.

chunk_size

Type: integer

Target character count for each chunk when chunking is true. Must be a positive integer no greater than the selected embedding model's maximum chunk size. Values outside this range or non-integer values fall back to the model's configured default chunk size.


embedding

Type: boolean | Default: false

Enables or disables the embedding stage of the pipeline. Requires chunking to also be true; if chunking is disabled, no embeddings are generated.


embeddings_model

Type: string | Default: environment default

Identifier of the embedding model to use. Must be one of the models available in the environment's configured allow-list. When omitted, the environment's default model is used (for example, cohere.embed-multilingual-v3).


json_schema

Type: string | Default: false

Controls whether a structured JSON representation of the document is included in the pipeline output, and which schema variant to use. Set to false or omit the field to exclude JSON output entirely.

ValueDescription
falseNo JSON output.
MDASTMarkdown Abstract Syntax Tree — a structured representation of the document's Markdown content following the MDAST specification.
FULLFull document JSON including all extracted metadata, structural elements, and content.
PIPELINEInternal pipeline representation intended for debugging and integration testing. Includes intermediate processing artifacts.

pii

Type: object | Default: false

Controls Personally Identifiable Information (PII) processing. Set to false or omit the field to skip PII processing entirely.

When PII processing is enabled, supply an object with the following fields:

FieldTypeRequiredDefaultDescription
modestringYesSets the processing mode (see below).
entity_redactionbooleanNofalseControls whether named entities (such as people, organisations, or locations) are also redacted. Requires mode to also be redaction.

mode values:

ValueDescription
detectionIdentifies and annotates PII entities in the output without modifying the source text.
redactionReplaces detected PII with placeholder tokens, removing sensitive data from the pipeline output.

Examples

The following sections display examples of some common types of /presign request bodies.

All defaults (minimal request)

{}

All options explicitly set

{
"normalization": {
"quotations": true,
"dashes": true
},
"chunking": true,
"chunking_strategy": "context",
"chunk_size": 2000,
"embedding": true,
"embeddings_model": "cohere.embed-multilingual-v3",
"json_schema": "PIPELINE",
"pii": {
"mode": "redaction",
"entity_redaction": false
}
}

PII detection only

{
"normalization": {
"quotations": true,
"dashes": true
},
"chunking": true,
"chunk_size": 1500,
"embedding": true,
"json_schema": false,
"pii": {
"mode": "detection"
}
}

Chunking only (no embeddings)

{
"normalization": {
"quotations": true,
"dashes": true
},
"chunking": true,
"chunking_strategy": "fixed",
"chunk_size": 1000,
"embedding": false,
"json_schema": "MDAST",
"pii": false
}