How to Redact Social Security Numbers from OpenAI API Calls (Python)

Share
May 29, 2026·6 min read·security

If your application sends user-generated text to the OpenAI API, it will eventually send a Social Security number. Support tickets, form fields, chat messages — SSNs show up in production data constantly. This guide shows three ways to catch and redact them before they reach any LLM provider.

The Problem

When you call client.chat.completions.create(), the entire prompt — including any PII embedded in it — is sent to OpenAI's servers. Even with OpenAI's zero-retention API policy, the data still transits their infrastructure, which may violate HIPAA, CCPA, GDPR, or your internal data classification policies.

The risk: raw PII in prompts
from openai import OpenAI
client = OpenAI()

# This sends "John Smith, SSN 123-45-6789" to OpenAI
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{
        "role": "user",
        "content": "Summarize this support ticket: Customer John Smith "
                   "(SSN 123-45-6789) called about billing issue #4521."
    }]
)

Approach 1: Regex Pattern Matching

The fastest approach. Catches SSNs in standard formats (XXX-XX-XXXX, XXX XX XXXX, XXXXXXXXX). Works for known patterns but misses free-text PII like names or addresses.

Regex-based SSN redaction
import re

SSN_PATTERN = re.compile(
    r'\b(?!000|666|9\d{2})\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b'
)

def redact_ssn(text: str) -> str:
    return SSN_PATTERN.sub("[SSN_REDACTED]", text)

# Usage
prompt = "Customer SSN is 123-45-6789 and their card is 4111-1111-1111-1111"
safe_prompt = redact_ssn(prompt)
# "Customer SSN is [SSN_REDACTED] and their card is 4111-1111-1111-1111"
#  ^ SSN caught, but credit card missed

Limitation: Regex only catches SSNs. You'd need separate patterns for credit cards, phone numbers, email addresses, driver's licenses, passport numbers, IBAN codes, and every other PII type. Maintaining dozens of regex patterns is error-prone and misses context-dependent PII like names.

Approach 2: NLP-Based Detection (Presidio)

Microsoft Presidio uses NLP models + pattern matching to detect 30+ entity types including SSNs, credit cards, names, addresses, and more. Much broader coverage than regex.

Presidio-based PII redaction
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact_pii(text: str) -> str:
    results = analyzer.analyze(
        text=text,
        language="en",
        entities=[
            "US_SSN", "CREDIT_CARD", "PHONE_NUMBER",
            "EMAIL_ADDRESS", "PERSON", "US_DRIVER_LICENSE",
        ],
    )
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    return anonymized.text

# Usage
prompt = "Customer John Smith (SSN 123-45-6789) called about billing."
safe = redact_pii(prompt)
# "Customer <PERSON> (SSN <US_SSN>) called about billing."

Trade-off: Presidio adds ~30-200ms latency per request (depending on text length and model). You also need to deploy and maintain the NLP models. Works well for batch processing; can be tight for real-time chat.

Approach 3: Gateway-Level Redaction (Zero Code Changes)

Instead of adding redaction code to every API call, route requests through an AI gateway that automatically detects and redacts PII before forwarding to OpenAI. Change two lines of code; get 30+ entity type protection including SSNs, credit cards, names, addresses, and more.

Gateway-level redaction with AI Security Gateway
from openai import OpenAI

# Change these two lines — everything else stays the same
client = OpenAI(
    base_url="https://api.aisecuritygateway.ai/v1",
    api_key="aisg_your_key_here",
)

# SSNs, credit cards, names, addresses — all auto-redacted
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{
        "role": "user",
        "content": "Customer John Smith (SSN 123-45-6789) called about "
                   "billing issue #4521. Card ending 4111-1111-1111-1111."
    }]
)
# What OpenAI sees: "Customer [PERSON] (SSN [US_SSN]) called about
#   billing issue #4521. Card ending [CREDIT_CARD]."

Which Approach Should You Use?

CriteriaRegexPresidio NLPGateway
Entity coverageSSN only (per pattern)30+ entity types30+ entity types
Setup time5 minutes30-60 minutes2 minutes
Added latency< 1ms30-200ms< 50ms
Catches names/addressesNoYesYes
Code changes requiredPer API callPer API call2 lines total
MaintenanceHigh (pattern updates)Medium (model updates)None
Works with all providersManual per providerManual per providerAutomatic

Beyond SSNs: Other PII You Should Redact

SSNs are the most obvious, but production LLM traffic contains much more PII:

  • Credit/debit card numbers — Luhn-validated, all major networks
  • Phone numbers — US, UK, international formats
  • Email addresses — including corporate domains
  • Person names — NLP-based, handles "John Smith" and "Dr. Jane Doe"
  • Physical addresses — street, city, state, ZIP
  • Driver's license numbers — state-specific formats
  • Medical record numbers — HIPAA-relevant
  • IBAN / bank account numbers — EU banking identifiers
  • Passport numbers — multi-country formats

Stop writing PII regex patterns

AI Security Gateway auto-redacts 30+ entity types from every API call. Two lines of code, under 50ms latency. Works with OpenAI, Anthropic, Google, Meta, and 8+ more providers.

Related Articles

Security8 min read

How to Prevent PII Leaks in ChatGPT API Calls

3 approaches to stop sensitive data from reaching AI providers.

Security11 min read

Prompt-Level PII Redaction Under 50ms

Gateway-layer DLP without introducing unacceptable latency.