Skip to main content

PII Handling Details

Overview

The Data Curation API includes built-in Personally Identifiable Information (PII) processing capabilities to help organizations maintain compliance and data privacy when processing unstructured content.

PII processing supports the following modes:

Processing ModeDescription
DetectionIdentifies PII entities in the extracted content and reports them in the output.
RedactionMasks detected PII and removes it from the output content.

When enabled, masking is performed prior to any enabled chunking or embedding steps.

Supported PII Types

The Data Curation API supports the following categories of PII detection:

Detection TypeDescription
Pattern-BasedUse deterministic rules and regular expressions.
Named Entity Recognition (NER)Use a machine-learning model to identify entities in text.

These categories are toggled independently (see Enabling PII Processing).

Pattern-Based PII Types

The following types are detected using deterministic pattern matching and are included whenever PII processing is enabled.

PII TypeDescription
AddressesStreet addresses including a street number; street name; and a street designator, such as Street, Avenue, Boulevard, Lane, Drive, Court, Plaza, Terrace, Place, Trail. Optional unit identifiers, such as Apt, Suite, Unit, Floor, and Building, are also recognized. Example: 123 Main Street Apt 4B.
Email AddressesStandard email addresses in user@domain.tld format. Supports alphanumeric characters, dots, hyphens, underscores, and plus signs in the local part.
Phone NumbersUS phone numbers in 10-digit formats with optional country code prefix (+1). Recognized separators include hyphens, dots, spaces, and parentheses around the area code. Examples: 555-123-4567, (555) 123.4567, +1 5551234567.
Social Security Numbers (SSN)US Social Security Numbers in 9-digit format. Supports hyphen-separated (for example, 123-45-6789), space-separated (for example, 123 45 6789), and consecutive digit (for example, 123456789) formats. Separators must be consistent within a single number.
Credit Card NumbersPayment card numbers in 13-, 15-, 16-, or 19-digit formats. Supports Visa, Mastercard, American Express, Discover, JCB, Diners Club, and UnionPay card shapes. Recognized formats include consecutive digits (for example, 4111111111111111), hyphen-separated groups (for example, 4111-1111-1111-1111), space-separated groups (for example, 4111 1111 1111 1111), and the 4-6-5 grouping used by American Express (for example, 3782-822463-10005).
CVV NumbersCard verification codes (3–4 digits) detected when preceded by a keyword, such as CVV, CVC, CSC, CID, security code, or card code. Also recognized in JSON fields (for example, "cvv": "123") and query string parameters (for example, ?cvv=123).
Credit Card IssuersExplicit mentions of card network names: Visa, MasterCard, American Express, Amex, Discover, JCB, Diners Club, UnionPay, and Maestro.
US Bank Account NumbersNumeric sequences of 8-, 10-, 11-, 12-, 14-, or 17-consecutive digits surrounded by whitespace or string boundaries. Digit lengths that overlap with SSNs (9) or credit cards (13, 15, 16, 19) are excluded to prevent conflicts.
IBAN CodesInternational Bank Account Numbers beginning with a two-letter country code followed by two check digits and up to 30 alphanumeric characters (for example, GB29NWBK60161331926819).
Passwords / SecretsPasswords, API keys, tokens, and secrets detected through multiple methods: key-value pairs with keywords (for example, password=, api_key:, or token=), JSON fields containing credential keywords, URLs with embedded credentials (for example, ://user:pass@host), JWT tokens (for example, eyJ...), Base64-encoded tokens, and high-entropy strings that exhibit characteristics of generated secrets.
IP AddressesIPv4 addresses in dotted-decimal notation (for example, 192.168.1.1). Each octet is matched as 1–3 digits separated by dots.
URLsWeb addresses beginning with http://, https://, or www. followed by a domain name with a valid TLD.
Zip CodesUS ZIP codes in 5-digit (for example, 12345) or ZIP+4 (for example, 12345-6789) format. Codes with prefixes that are unassigned in the US postal system are excluded to reduce false positives.
Date of BirthDates in MM/DD/YYYY or MM-DD-YYYY format, detected only when preceded by a contextual keyword, such as birth, born, dob, date of birth, or birthdate.
Gender InformationGender identity terms, including but not limited to: transgender, cisgender, non-binary, genderqueer, genderfluid, two-spirit, pangender, and related compound forms. Simple terms like "male" or "female" in isolation are not matched.
Passport IDsUS passport numbers matching the format of one letter followed by 7 digits (for example, A1234567). The letter range excludes certain characters (Q, X, Z) per US passport conventions.
Medical License IDsMedical license identifiers in the format of one letter followed by 5–7 digits (for example, D123456) or two digits, a hyphen, and five digits (for example, 12-34567).
Prefixes and TitlesPersonal titles and honorifics, including common forms (Mr., Mrs., Ms., Dr., Prof., Rev.), military ranks (Capt., Col., Gen., Lt., Maj., Sgt.), nobility titles (Baron, Duke, Earl, Marquis), religious titles (Rabbi, Imam, Cardinal, Bishop, Archbishop, Pope), and civic titles (President, Chancellor, Dean, Judge, Justice, Ambassador).
Crypto Wallet AddressesEthereum addresses in 0x prefix with 40 hex characters, and general cryptocurrency addresses starting with bc1, 1, or 3 followed by 25–39 Base58 characters.
Bitcoin AddressesBitcoin-specific addresses in Legacy format (starting with 1 or 3, 25–34 Base58 characters) and Bech32/SegWit format (starting with bc1, 39–59 lowercase alphanumeric characters).

Named Entity Recognition (NER) Types

The following types are detected using a Named Entity Recognition (NER) model. NER-based detection is controlled separately by the entity_redaction option and is disabled by default.

PII TypeDescription
Person NamesFull or partial names of individuals
Organization NamesNames of companies, agencies, or institutions
LocationsGeographic locations, such as cities, states, or countries
AffiliationsNational, religious, or political group identifiers
note

Because NER detection and redaction is not as deterministic as other PII redaction methods, using it may yield false positives or unexpected results. Review NER output carefully before relying on it in production workflows.

Enabling PII Processing

PII processing is controlled by the pii field in the JSON request body sent to the POST /presign endpoint. If the pii field is omitted from the request, it defaults to false and no PII processing occurs.

For the full request schema and available options, see the Endpoints reference for the /presign endpoint.

Output Structure

When PII processing is enabled and matches are found, the results JSON returned from the pipeline includes two additional top-level fields: pii_matches and pii_match_type_counts. If no PII is detected, these fields are omitted from the output.

pii_matches

An array of match objects describing each detected PII occurrence. Each object contains the following fields:

FieldTypeDescription
startintegerStart character offset of the match in the original text.
endintegerEnd character offset of the match in the original text.
match_typestringThe PII category that was detected (for example, SSN, EMAIL, PERSON).
redacted_startinteger or nullStart character offset of the mask in the redacted text. Present only in redaction mode.
redacted_endinteger or nullEnd character offset of the mask in the redacted text. Present only in redaction mode.

pii_match_type_counts

A summary object where each key is a match_type string and the value is the number of times that type was detected. This provides a quick overview without iterating through the full pii_matches array.

Detection Mode Output

In detection mode, the markdown.output field contains the original, unmodified text. The pii_matches array reports where PII was found, but no masking is applied.

{
"markdown": {
"output": "Contact John Smith at john.smith@example.com or 555-123-4567."
},
"pii_matches": [
{
"start": 8,
"end": 18,
"match_type": "PERSON",
"redacted_start": null,
"redacted_end": null
},
{
"start": 22,
"end": 44,
"match_type": "EMAIL",
"redacted_start": null,
"redacted_end": null
},
{
"start": 48,
"end": 60,
"match_type": "PHONE",
"redacted_start": null,
"redacted_end": null
}
],
"pii_match_type_counts": {
"PERSON": 1,
"EMAIL": 1,
"PHONE": 1
}
}

Redaction Mode Output

In redaction mode, each detected PII occurrence in markdown.output is replaced with the mask [PII]. The pii_matches array includes redacted_start and redacted_end fields indicating where each mask appears in the masked text.

Masking is performed prior to any enabled chunking or embedding steps.

{
"markdown": {
"output": "Contact [PII] at [PII] or [PII]."
},
"pii_matches": [
{
"start": 8,
"end": 18,
"match_type": "PERSON",
"redacted_start": 8,
"redacted_end": 13
},
{
"start": 22,
"end": 44,
"match_type": "EMAIL",
"redacted_start": 17,
"redacted_end": 22
},
{
"start": 48,
"end": 60,
"match_type": "PHONE",
"redacted_start": 26,
"redacted_end": 31
}
],
"pii_match_type_counts": {
"PERSON": 1,
"EMAIL": 1,
"PHONE": 1
}
}
note

The pii_matches and pii_match_type_counts fields are only present when at least one PII match is found. If no PII is detected, the output contains only the standard fields (for example, markdown, json).

Limitations

  • PII detection and redaction accuracy depends on the quality of the extracted text from the source document.
  • Redaction masks PII in the output; it does not modify the original uploaded file.