PII Handling Details
Overview
The Data Curation API includes built-in Personally Identifiable Information (PII) processing capabilities to help organizations maintain compliance and data privacy when processing unstructured content.
PII processing supports the following modes:
| Processing Mode | Description |
|---|---|
| Detection | Identifies PII entities in the extracted content and reports them in the output. |
| Redaction | Masks detected PII and removes it from the output content. |
When enabled, masking is performed prior to any enabled chunking or embedding steps.
Supported PII Types
The Data Curation API supports the following categories of PII detection:
| Detection Type | Description |
|---|---|
| Pattern-Based | Use deterministic rules and regular expressions. |
| Named Entity Recognition (NER) | Use a machine-learning model to identify entities in text. |
These categories are toggled independently (see Enabling PII Processing).
Pattern-Based PII Types
The following types are detected using deterministic pattern matching and are included whenever PII processing is enabled.
| PII Type | Description |
|---|---|
| Addresses | Street addresses including a street number; street name; and a street designator, such as Street, Avenue, Boulevard, Lane, Drive, Court, Plaza, Terrace, Place, Trail. Optional unit identifiers, such as Apt, Suite, Unit, Floor, and Building, are also recognized. Example: 123 Main Street Apt 4B. |
| Email Addresses | Standard email addresses in user@domain.tld format. Supports alphanumeric characters, dots, hyphens, underscores, and plus signs in the local part. |
| Phone Numbers | US phone numbers in 10-digit formats with optional country code prefix (+1). Recognized separators include hyphens, dots, spaces, and parentheses around the area code. Examples: 555-123-4567, (555) 123.4567, +1 5551234567. |
| Social Security Numbers (SSN) | US Social Security Numbers in 9-digit format. Supports hyphen-separated (for example, 123-45-6789), space-separated (for example, 123 45 6789), and consecutive digit (for example, 123456789) formats. Separators must be consistent within a single number. |
| Credit Card Numbers | Payment card numbers in 13-, 15-, 16-, or 19-digit formats. Supports Visa, Mastercard, American Express, Discover, JCB, Diners Club, and UnionPay card shapes. Recognized formats include consecutive digits (for example, 4111111111111111), hyphen-separated groups (for example, 4111-1111-1111-1111), space-separated groups (for example, 4111 1111 1111 1111), and the 4-6-5 grouping used by American Express (for example, 3782-822463-10005). |
| CVV Numbers | Card verification codes (3–4 digits) detected when preceded by a keyword, such as CVV, CVC, CSC, CID, security code, or card code. Also recognized in JSON fields (for example, "cvv": "123") and query string parameters (for example, ?cvv=123). |
| Credit Card Issuers | Explicit mentions of card network names: Visa, MasterCard, American Express, Amex, Discover, JCB, Diners Club, UnionPay, and Maestro. |
| US Bank Account Numbers | Numeric sequences of 8-, 10-, 11-, 12-, 14-, or 17-consecutive digits surrounded by whitespace or string boundaries. Digit lengths that overlap with SSNs (9) or credit cards (13, 15, 16, 19) are excluded to prevent conflicts. |
| IBAN Codes | International Bank Account Numbers beginning with a two-letter country code followed by two check digits and up to 30 alphanumeric characters (for example, GB29NWBK60161331926819). |
| Passwords / Secrets | Passwords, API keys, tokens, and secrets detected through multiple methods: key-value pairs with keywords (for example, password=, api_key:, or token=), JSON fields containing credential keywords, URLs with embedded credentials (for example, ://user:pass@host), JWT tokens (for example, eyJ...), Base64-encoded tokens, and high-entropy strings that exhibit characteristics of generated secrets. |
| IP Addresses | IPv4 addresses in dotted-decimal notation (for example, 192.168.1.1). Each octet is matched as 1–3 digits separated by dots. |
| URLs | Web addresses beginning with http://, https://, or www. followed by a domain name with a valid TLD. |
| Zip Codes | US ZIP codes in 5-digit (for example, 12345) or ZIP+4 (for example, 12345-6789) format. Codes with prefixes that are unassigned in the US postal system are excluded to reduce false positives. |
| Date of Birth | Dates in MM/DD/YYYY or MM-DD-YYYY format, detected only when preceded by a contextual keyword, such as birth, born, dob, date of birth, or birthdate. |
| Gender Information | Gender identity terms, including but not limited to: transgender, cisgender, non-binary, genderqueer, genderfluid, two-spirit, pangender, and related compound forms. Simple terms like "male" or "female" in isolation are not matched. |
| Passport IDs | US passport numbers matching the format of one letter followed by 7 digits (for example, A1234567). The letter range excludes certain characters (Q, X, Z) per US passport conventions. |
| Medical License IDs | Medical license identifiers in the format of one letter followed by 5–7 digits (for example, D123456) or two digits, a hyphen, and five digits (for example, 12-34567). |
| Prefixes and Titles | Personal titles and honorifics, including common forms (Mr., Mrs., Ms., Dr., Prof., Rev.), military ranks (Capt., Col., Gen., Lt., Maj., Sgt.), nobility titles (Baron, Duke, Earl, Marquis), religious titles (Rabbi, Imam, Cardinal, Bishop, Archbishop, Pope), and civic titles (President, Chancellor, Dean, Judge, Justice, Ambassador). |
| Crypto Wallet Addresses | Ethereum addresses in 0x prefix with 40 hex characters, and general cryptocurrency addresses starting with bc1, 1, or 3 followed by 25–39 Base58 characters. |
| Bitcoin Addresses | Bitcoin-specific addresses in Legacy format (starting with 1 or 3, 25–34 Base58 characters) and Bech32/SegWit format (starting with bc1, 39–59 lowercase alphanumeric characters). |
Named Entity Recognition (NER) Types
The following types are detected using a Named Entity Recognition (NER) model. NER-based detection is controlled separately by the entity_redaction option and is disabled by default.
| PII Type | Description |
|---|---|
| Person Names | Full or partial names of individuals |
| Organization Names | Names of companies, agencies, or institutions |
| Locations | Geographic locations, such as cities, states, or countries |
| Affiliations | National, religious, or political group identifiers |
Because NER detection and redaction is not as deterministic as other PII redaction methods, using it may yield false positives or unexpected results. Review NER output carefully before relying on it in production workflows.
Enabling PII Processing
PII processing is controlled by the pii field in the JSON request body sent to the POST /presign endpoint. If the pii field is omitted from the request, it defaults to false and no PII processing occurs.
For the full request schema and available options, see the Endpoints reference for the /presign endpoint.
Output Structure
When PII processing is enabled and matches are found, the results JSON returned from the pipeline includes two additional top-level fields: pii_matches and pii_match_type_counts. If no PII is detected, these fields are omitted from the output.
pii_matches
An array of match objects describing each detected PII occurrence. Each object contains the following fields:
| Field | Type | Description |
|---|---|---|
start | integer | Start character offset of the match in the original text. |
end | integer | End character offset of the match in the original text. |
match_type | string | The PII category that was detected (for example, SSN, EMAIL, PERSON). |
redacted_start | integer or null | Start character offset of the mask in the redacted text. Present only in redaction mode. |
redacted_end | integer or null | End character offset of the mask in the redacted text. Present only in redaction mode. |
pii_match_type_counts
A summary object where each key is a match_type string and the value is the number of times that type was detected. This provides a quick overview without iterating through the full pii_matches array.
Detection Mode Output
In detection mode, the markdown.output field contains the original, unmodified text. The pii_matches array reports where PII was found, but no masking is applied.
{
"markdown": {
"output": "Contact John Smith at john.smith@example.com or 555-123-4567."
},
"pii_matches": [
{
"start": 8,
"end": 18,
"match_type": "PERSON",
"redacted_start": null,
"redacted_end": null
},
{
"start": 22,
"end": 44,
"match_type": "EMAIL",
"redacted_start": null,
"redacted_end": null
},
{
"start": 48,
"end": 60,
"match_type": "PHONE",
"redacted_start": null,
"redacted_end": null
}
],
"pii_match_type_counts": {
"PERSON": 1,
"EMAIL": 1,
"PHONE": 1
}
}
Redaction Mode Output
In redaction mode, each detected PII occurrence in markdown.output is replaced with the mask [PII]. The pii_matches array includes redacted_start and redacted_end fields indicating where each mask appears in the masked text.
Masking is performed prior to any enabled chunking or embedding steps.
{
"markdown": {
"output": "Contact [PII] at [PII] or [PII]."
},
"pii_matches": [
{
"start": 8,
"end": 18,
"match_type": "PERSON",
"redacted_start": 8,
"redacted_end": 13
},
{
"start": 22,
"end": 44,
"match_type": "EMAIL",
"redacted_start": 17,
"redacted_end": 22
},
{
"start": 48,
"end": 60,
"match_type": "PHONE",
"redacted_start": 26,
"redacted_end": 31
}
],
"pii_match_type_counts": {
"PERSON": 1,
"EMAIL": 1,
"PHONE": 1
}
}
The pii_matches and pii_match_type_counts fields are only present when at least one PII match is found. If no PII is detected, the output contains only the standard fields (for example, markdown, json).
Limitations
- PII detection and redaction accuracy depends on the quality of the extracted text from the source document.
- Redaction masks PII in the output; it does not modify the original uploaded file.