PII Handling Details

Overview

The Data Curation API includes built-in Personally Identifiable Information (PII) processing capabilities to help organizations maintain compliance and data privacy when processing unstructured content.

PII processing supports the following modes:

Processing Mode	Description
Detection	Identifies PII entities in the extracted content and reports them in the output.
Redaction	Masks detected PII and removes it from the output content.

When enabled, masking is performed prior to any enabled chunking or embedding steps.

Supported PII Types

The Data Curation API supports the following categories of PII detection:

Detection Type	Description
Pattern-Based	Use deterministic rules and regular expressions.
Named Entity Recognition (NER)	Use a machine-learning model to identify entities in text.

These categories are toggled independently (see Enabling PII Processing).

Pattern-Based PII Types

The following types are detected using deterministic pattern matching and are included whenever PII processing is enabled.

PII Type	Description
Addresses	Street addresses including a street number; street name; and a street designator, such as Street, Avenue, Boulevard, Lane, Drive, Court, Plaza, Terrace, Place, Trail. Optional unit identifiers, such as Apt, Suite, Unit, Floor, and Building, are also recognized. Example: `123 Main Street Apt 4B`.
Email Addresses	Standard email addresses in `user@domain.tld` format. Supports alphanumeric characters, dots, hyphens, underscores, and plus signs in the local part.
Phone Numbers	US phone numbers in 10-digit formats with optional country code prefix (`+1`). Recognized separators include hyphens, dots, spaces, and parentheses around the area code. Examples: `555-123-4567`, `(555) 123.4567`, `+1 5551234567`.
Social Security Numbers (SSN)	US Social Security Numbers in 9-digit format. Supports hyphen-separated (for example, `123-45-6789`), space-separated (for example, `123 45 6789`), and consecutive digit (for example, `123456789`) formats. Separators must be consistent within a single number.
Credit Card Numbers	Payment card numbers in 13-, 15-, 16-, or 19-digit formats. Supports Visa, Mastercard, American Express, Discover, JCB, Diners Club, and UnionPay card shapes. Recognized formats include consecutive digits (for example, `4111111111111111`), hyphen-separated groups (for example, `4111-1111-1111-1111`), space-separated groups (for example, `4111 1111 1111 1111`), and the 4-6-5 grouping used by American Express (for example, `3782-822463-10005`).
CVV Numbers	Card verification codes (3–4 digits) detected when preceded by a keyword, such as `CVV`, `CVC`, `CSC`, `CID`, `security code`, or `card code`. Also recognized in JSON fields (for example, `"cvv": "123"`) and query string parameters (for example, `?cvv=123`).
Credit Card Issuers	Explicit mentions of card network names: `Visa`, `MasterCard`, `American Express`, `Amex`, `Discover`, `JCB`, `Diners Club`, `UnionPay`, and `Maestro`.
US Bank Account Numbers	Numeric sequences of 8-, 10-, 11-, 12-, 14-, or 17-consecutive digits surrounded by whitespace or string boundaries. Digit lengths that overlap with SSNs (9) or credit cards (13, 15, 16, 19) are excluded to prevent conflicts.
IBAN Codes	International Bank Account Numbers beginning with a two-letter country code followed by two check digits and up to 30 alphanumeric characters (for example, `GB29NWBK60161331926819`).
Passwords / Secrets	Passwords, API keys, tokens, and secrets detected through multiple methods: key-value pairs with keywords (for example, `password=`, `api_key:`, or `token=`), JSON fields containing credential keywords, URLs with embedded credentials (for example, `://user:pass@host`), JWT tokens (for example, `eyJ...`), Base64-encoded tokens, and high-entropy strings that exhibit characteristics of generated secrets.
IP Addresses	IPv4 addresses in dotted-decimal notation (for example, `192.168.1.1`). Each octet is matched as 1–3 digits separated by dots.
URLs	Web addresses beginning with `http://`, `https://`, or `www.` followed by a domain name with a valid TLD.
Zip Codes	US ZIP codes in 5-digit (for example, `12345`) or ZIP+4 (for example, `12345-6789`) format. Codes with prefixes that are unassigned in the US postal system are excluded to reduce false positives.
Date of Birth	Dates in `MM/DD/YYYY` or `MM-DD-YYYY` format, detected only when preceded by a contextual keyword, such as `birth`, `born`, `dob`, `date of birth`, or `birthdate`.
Gender Information	Gender identity terms, including but not limited to: `transgender`, `cisgender`, `non-binary`, `genderqueer`, `genderfluid`, `two-spirit`, `pangender`, and related compound forms. Simple terms like "male" or "female" in isolation are not matched.
Passport IDs	US passport numbers matching the format of one letter followed by 7 digits (for example, `A1234567`). The letter range excludes certain characters (`Q`, `X`, `Z`) per US passport conventions.
Medical License IDs	Medical license identifiers in the format of one letter followed by 5–7 digits (for example, `D123456`) or two digits, a hyphen, and five digits (for example, `12-34567`).
Prefixes and Titles	Personal titles and honorifics, including common forms (`Mr.`, `Mrs.`, `Ms.`, `Dr.`, `Prof.`, `Rev.`), military ranks (`Capt.`, `Col.`, `Gen.`, `Lt.`, `Maj.`, `Sgt.`), nobility titles (`Baron`, `Duke`, `Earl`, `Marquis`), religious titles (`Rabbi`, `Imam`, `Cardinal`, `Bishop`, `Archbishop`, `Pope`), and civic titles (`President`, `Chancellor`, `Dean`, `Judge`, `Justice`, `Ambassador`).
Crypto Wallet Addresses	Ethereum addresses in `0x` prefix with 40 hex characters, and general cryptocurrency addresses starting with `bc1`, `1`, or `3` followed by 25–39 Base58 characters.
Bitcoin Addresses	Bitcoin-specific addresses in Legacy format (starting with `1` or `3`, 25–34 Base58 characters) and Bech32/SegWit format (starting with `bc1`, 39–59 lowercase alphanumeric characters).

Named Entity Recognition (NER) Types

The following types are detected using a Named Entity Recognition (NER) model. NER-based detection is controlled separately by the entity_redaction option and is disabled by default.

PII Type	Description
Person Names	Full or partial names of individuals
Organization Names	Names of companies, agencies, or institutions
Locations	Geographic locations, such as cities, states, or countries
Affiliations	National, religious, or political group identifiers

note

Because NER detection and redaction is not as deterministic as other PII redaction methods, using it may yield false positives or unexpected results. Review NER output carefully before relying on it in production workflows.

Enabling PII Processing

PII processing is controlled by the pii field in the JSON request body sent to the POST /presign endpoint. If the pii field is omitted from the request, it defaults to false and no PII processing occurs.

For the full request schema and available options, see the Endpoints reference for the /presign endpoint.

Output Structure

When PII processing is enabled and matches are found, the results JSON returned from the pipeline includes two additional top-level fields: pii_matches and pii_match_type_counts. If no PII is detected, these fields are omitted from the output.

pii_matches

An array of match objects describing each detected PII occurrence. Each object contains the following fields:

Field	Type	Description
`start`	`integer`	Start character offset of the match in the original text.
`end`	`integer`	End character offset of the match in the original text.
`match_type`	`string`	The PII category that was detected (for example, `SSN`, `EMAIL`, `PERSON`).
`redacted_start`	`integer` or `null`	Start character offset of the mask in the redacted text. Present only in redaction mode.
`redacted_end`	`integer` or `null`	End character offset of the mask in the redacted text. Present only in redaction mode.

pii_match_type_counts

A summary object where each key is a match_type string and the value is the number of times that type was detected. This provides a quick overview without iterating through the full pii_matches array.

Detection Mode Output

In detection mode, the markdown.output field contains the original, unmodified text. The pii_matches array reports where PII was found, but no masking is applied.

{
  "markdown": {
    "output": "Contact John Smith at john.smith@example.com or 555-123-4567."
  },
  "pii_matches": [
    {
      "start": 8,
      "end": 18,
      "match_type": "PERSON",
      "redacted_start": null,
      "redacted_end": null
    },
    {
      "start": 22,
      "end": 44,
      "match_type": "EMAIL",
      "redacted_start": null,
      "redacted_end": null
    },
    {
      "start": 48,
      "end": 60,
      "match_type": "PHONE",
      "redacted_start": null,
      "redacted_end": null
    }
  ],
  "pii_match_type_counts": {
    "PERSON": 1,
    "EMAIL": 1,
    "PHONE": 1
  }
}

Redaction Mode Output

In redaction mode, each detected PII occurrence in markdown.output is replaced with the mask [PII]. The pii_matches array includes redacted_start and redacted_end fields indicating where each mask appears in the masked text.

Masking is performed prior to any enabled chunking or embedding steps.

{
  "markdown": {
    "output": "Contact [PII] at [PII] or [PII]."
  },
  "pii_matches": [
    {
      "start": 8,
      "end": 18,
      "match_type": "PERSON",
      "redacted_start": 8,
      "redacted_end": 13
    },
    {
      "start": 22,
      "end": 44,
      "match_type": "EMAIL",
      "redacted_start": 17,
      "redacted_end": 22
    },
    {
      "start": 48,
      "end": 60,
      "match_type": "PHONE",
      "redacted_start": 26,
      "redacted_end": 31
    }
  ],
  "pii_match_type_counts": {
    "PERSON": 1,
    "EMAIL": 1,
    "PHONE": 1
  }
}

note

The pii_matches and pii_match_type_counts fields are only present when at least one PII match is found. If no PII is detected, the output contains only the standard fields (for example, markdown, json).

Limitations

PII detection and redaction accuracy depends on the quality of the extracted text from the source document.
Redaction masks PII in the output; it does not modify the original uploaded file.

Overview​

Supported PII Types​

Pattern-Based PII Types​

Named Entity Recognition (NER) Types​

Enabling PII Processing​

Output Structure​

pii_matches​

pii_match_type_counts​

Detection Mode Output​

Redaction Mode Output​

Limitations​