What is semantic caching for LLM APIs?

Semantic caching stores LLM responses keyed by the content of the request. When an identical prompt is sent again, the cached response is returned instantly without calling the LLM provider — saving 100% of the token cost for that request.

Does semantic caching work with PII redaction?

Yes. AI Security Gateway runs full DLP processing (PII redaction, prompt injection detection) before computing the cache key. This means the cache operates on cleaned prompts, and PII is never stored in the cache.

How much can semantic caching save on LLM costs?

Cost savings depend on your workload's prompt duplication rate. Agentic workloads with fixed system prompts, support chatbots, and batch pipelines commonly exhibit 20-40% duplication, translating to equivalent cost reductions with zero code changes.

Does semantic caching add latency?

Cache hits are faster than normal requests — sub-millisecond response time versus hundreds of milliseconds for a provider round-trip. Cache misses add negligible overhead from the cache lookup.

SHIPPED

Semantic Prompt Caching

Identical DLP-cleaned prompts are served from a distributed cache at zero LLM cost. Cache hits bypass the provider entirely — sub-millisecond response time, zero tokens consumed, zero dollars spent.

Semantic caching works automatically at the gateway layer. When the same prompt (after PII redaction and normalization) has been seen before, the cached response is returned instantly. No SDK changes, no application code changes, no configuration required.

The Problem

Production AI workloads frequently send identical or near-identical prompts to LLM providers. Common scenarios include:

• Multiple users asking the same question to a support chatbot
• Agent frameworks re-sending the same system prompt with minor variations
• Batch processing pipelines that run the same template across identical inputs
• Development and testing environments replaying the same prompts repeatedly

Typical impact: Agentic workloads, support chatbots, and batch pipelines commonly exhibit 20-40% prompt duplication rates. At scale, this represents thousands of dollars in unnecessary LLM spend per month — with no change in output quality.

How It Works

DLP processing first

Every request passes through the full DLP pipeline (PII redaction, prompt injection detection) before cache evaluation. The cache key is derived from the cleaned prompt, not the raw input.

Cache key derivation

Cache key derivation accounts for the model, cleaned content, and project context. Cache hits are always project-scoped and never shared across accounts. Requests with different models, different content, or different projects always get independent cache entries.

Cache lookup

The key is checked against a distributed cache shared across all proxy instances. On a hit, the cached response is returned immediately without contacting any LLM provider.

Cache storage

On a cache miss, the request is forwarded to the provider normally. Successful non-streaming responses are stored in the cache with a configurable TTL for future hits.

How matching works: Cache matching is performed on DLP-cleaned prompt content. Prompts that differ only in PII values are treated as cache-equivalent after normalization — for example, two users asking the same question with different personal details produce the same cache key after redaction. This is exact-match on the normalized output, not embedding-based semantic similarity.

Privacy Architecture

Semantic caching is built on the same privacy-first principles as the rest of the gateway:

✓ DLP runs before caching — PII is redacted before any cache key is computed or any response is stored
✓ Project-scoped isolation — cache entries are scoped to the originating project; one project's cache is never accessible to another
✓ One-way key derivation — cache keys are cryptographic hashes; the original prompt content cannot be reconstructed from the key
✓ Configurable TTL — cached entries expire automatically; no permanent storage of LLM responses
✓ Encryption in transit — all cache communication uses TLS encryption

Cache Hit Response

Cache hits return the same response format as a normal LLM response. The aisg_metadata object indicates a cache hit with associated cost and latency savings:

Response — cache hit

{
  "id": "chatcmpl-abc123",
  "model": "oah/llama-4-maverick",
  "choices": [ ... ],
  "aisg_metadata": {
    "request_id": "req_def456",
    "cache_hit": true,
    "mode": "cached",
    "latency_ms": 2,
    "upstream_latency_ms": 0,
    "cost_usd": 0.0,
    "pii_detected": false,
    "dlp_latency_ms": 8
  }
}

Response headers also indicate a cache hit:

Response headers

x-aisg-cache: HIT
x-aisg-request-id: req_def456

What Gets Cached

Cached

✓ Non-streaming chat completions
✓ Successful responses (HTTP 200)
✓ Responses without tool/function calls

Not cached

— Streaming responses (SSE)
— Error responses
— Responses containing tool calls
— Image generation requests

Streaming responses are excluded because they require real-time token delivery. Tool call responses are excluded because they typically produce side effects that should not be replayed from cache.

Integration

Semantic caching requires no code changes. If you're already using the AISG proxy, caching is active by default. You can detect cache hits in your application by inspecting the metadata:

Python — detecting cache hits

from aisg import AISG

client = AISG()
response = client.chat.completions.create(
    model="oah/llama-4-maverick",
    messages=[{"role": "user", "content": "What is GDPR?"}],
)

metadata = response.aisg_metadata
if metadata.get("cache_hit"):
    print(f"Cache hit — {metadata['latency_ms']}ms, $0 cost")
else:
    print(f"Cache miss — {metadata['latency_ms']}ms")

Cost Impact

Cache hits cost $0. No tokens are consumed, no provider API call is made. The response is served directly from the distributed cache. Workloads with 20-40% prompt duplication typically see equivalent reductions in LLM spend, with zero changes to application code.

Cache hit rates vary by workload type. Agentic frameworks with fixed system prompts, customer support chatbots, and batch processing pipelines typically see the highest hit rates. Conversational applications with unique user messages see lower hit rates but still benefit when users ask common questions.

Distributed Architecture

The cache layer is shared across all proxy instances, ensuring consistent cache hits regardless of which instance handles the request. This is critical for horizontally scaled deployments where requests are load-balanced across multiple nodes.

• Shared state — a cache entry written by one proxy instance is immediately available to all others
• Automatic failover — if the cache layer is temporarily unavailable, requests fall through to the LLM provider transparently; no errors, no downtime
• Self-hosted support — self-hosted deployments require a Redis-compatible cache backend (Redis 7+, Valkey, or Dragonfly). Managed AISG handles cache infrastructure automatically