How to Redact Social Security Numbers from OpenAI API Calls (Python)
If your application sends user-generated text to the OpenAI API, it will eventually send a Social Security number. Support tickets, form fields, chat messages — SSNs show up in production data constantly. This guide shows three ways to catch and redact them before they reach any LLM provider.
The Problem
When you call client.chat.completions.create(), the entire prompt — including any PII embedded in it — is sent to OpenAI's servers. Even with OpenAI's zero-retention API policy, the data still transits their infrastructure, which may violate HIPAA, CCPA, GDPR, or your internal data classification policies.
from openai import OpenAI
client = OpenAI()
# This sends "John Smith, SSN 123-45-6789" to OpenAI
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{
"role": "user",
"content": "Summarize this support ticket: Customer John Smith "
"(SSN 123-45-6789) called about billing issue #4521."
}]
)Approach 1: Regex Pattern Matching
The fastest approach. Catches SSNs in standard formats (XXX-XX-XXXX, XXX XX XXXX, XXXXXXXXX). Works for known patterns but misses free-text PII like names or addresses.
import re
SSN_PATTERN = re.compile(
r'\b(?!000|666|9\d{2})\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b'
)
def redact_ssn(text: str) -> str:
return SSN_PATTERN.sub("[SSN_REDACTED]", text)
# Usage
prompt = "Customer SSN is 123-45-6789 and their card is 4111-1111-1111-1111"
safe_prompt = redact_ssn(prompt)
# "Customer SSN is [SSN_REDACTED] and their card is 4111-1111-1111-1111"
# ^ SSN caught, but credit card missedLimitation: Regex only catches SSNs. You'd need separate patterns for credit cards, phone numbers, email addresses, driver's licenses, passport numbers, IBAN codes, and every other PII type. Maintaining dozens of regex patterns is error-prone and misses context-dependent PII like names.
Approach 2: NLP-Based Detection (Presidio)
Microsoft Presidio uses NLP models + pattern matching to detect 30+ entity types including SSNs, credit cards, names, addresses, and more. Much broader coverage than regex.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def redact_pii(text: str) -> str:
results = analyzer.analyze(
text=text,
language="en",
entities=[
"US_SSN", "CREDIT_CARD", "PHONE_NUMBER",
"EMAIL_ADDRESS", "PERSON", "US_DRIVER_LICENSE",
],
)
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text
# Usage
prompt = "Customer John Smith (SSN 123-45-6789) called about billing."
safe = redact_pii(prompt)
# "Customer <PERSON> (SSN <US_SSN>) called about billing."Trade-off: Presidio adds ~30-200ms latency per request (depending on text length and model). You also need to deploy and maintain the NLP models. Works well for batch processing; can be tight for real-time chat.
Approach 3: Gateway-Level Redaction (Zero Code Changes)
Instead of adding redaction code to every API call, route requests through an AI gateway that automatically detects and redacts PII before forwarding to OpenAI. Change two lines of code; get 30+ entity type protection including SSNs, credit cards, names, addresses, and more.
from openai import OpenAI
# Change these two lines — everything else stays the same
client = OpenAI(
base_url="https://api.aisecuritygateway.ai/v1",
api_key="aisg_your_key_here",
)
# SSNs, credit cards, names, addresses — all auto-redacted
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{
"role": "user",
"content": "Customer John Smith (SSN 123-45-6789) called about "
"billing issue #4521. Card ending 4111-1111-1111-1111."
}]
)
# What OpenAI sees: "Customer [PERSON] (SSN [US_SSN]) called about
# billing issue #4521. Card ending [CREDIT_CARD]."Which Approach Should You Use?
| Criteria | Regex | Presidio NLP | Gateway |
|---|---|---|---|
| Entity coverage | SSN only (per pattern) | 30+ entity types | 30+ entity types |
| Setup time | 5 minutes | 30-60 minutes | 2 minutes |
| Added latency | < 1ms | 30-200ms | < 50ms |
| Catches names/addresses | No | Yes | Yes |
| Code changes required | Per API call | Per API call | 2 lines total |
| Maintenance | High (pattern updates) | Medium (model updates) | None |
| Works with all providers | Manual per provider | Manual per provider | Automatic |
Beyond SSNs: Other PII You Should Redact
SSNs are the most obvious, but production LLM traffic contains much more PII:
- Credit/debit card numbers — Luhn-validated, all major networks
- Phone numbers — US, UK, international formats
- Email addresses — including corporate domains
- Person names — NLP-based, handles "John Smith" and "Dr. Jane Doe"
- Physical addresses — street, city, state, ZIP
- Driver's license numbers — state-specific formats
- Medical record numbers — HIPAA-relevant
- IBAN / bank account numbers — EU banking identifiers
- Passport numbers — multi-country formats
Stop writing PII regex patterns
AI Security Gateway auto-redacts 30+ entity types from every API call. Two lines of code, under 50ms latency. Works with OpenAI, Anthropic, Google, Meta, and 8+ more providers.
Join the Community