Semantic Prompt Caching
Identical DLP-cleaned prompts are served from a distributed cache at zero LLM cost. Cache hits bypass the provider entirely — sub-millisecond response time, zero tokens consumed, zero dollars spent.
Semantic caching works automatically at the gateway layer. When the same prompt (after PII redaction and normalization) has been seen before, the cached response is returned instantly. No SDK changes, no application code changes, no configuration required.
The Problem
Production AI workloads frequently send identical or near-identical prompts to LLM providers. Common scenarios include:
- • Multiple users asking the same question to a support chatbot
- • Agent frameworks re-sending the same system prompt with minor variations
- • Batch processing pipelines that run the same template across identical inputs
- • Development and testing environments replaying the same prompts repeatedly
Typical impact: Agentic workloads, support chatbots, and batch pipelines commonly exhibit 20-40% prompt duplication rates. At scale, this represents thousands of dollars in unnecessary LLM spend per month — with no change in output quality.
How It Works
DLP processing first
Every request passes through the full DLP pipeline (PII redaction, prompt injection detection) before cache evaluation. The cache key is derived from the cleaned prompt, not the raw input.
Cache key derivation
Cache key derivation accounts for the model, cleaned content, and project context. Cache hits are always project-scoped and never shared across accounts. Requests with different models, different content, or different projects always get independent cache entries.
Cache lookup
The key is checked against a distributed cache shared across all proxy instances. On a hit, the cached response is returned immediately without contacting any LLM provider.
Cache storage
On a cache miss, the request is forwarded to the provider normally. Successful non-streaming responses are stored in the cache with a configurable TTL for future hits.
How matching works: Cache matching is performed on DLP-cleaned prompt content. Prompts that differ only in PII values are treated as cache-equivalent after normalization — for example, two users asking the same question with different personal details produce the same cache key after redaction. This is exact-match on the normalized output, not embedding-based semantic similarity.
Privacy Architecture
Semantic caching is built on the same privacy-first principles as the rest of the gateway:
- ✓ DLP runs before caching — PII is redacted before any cache key is computed or any response is stored
- ✓ Project-scoped isolation — cache entries are scoped to the originating project; one project's cache is never accessible to another
- ✓ One-way key derivation — cache keys are cryptographic hashes; the original prompt content cannot be reconstructed from the key
- ✓ Configurable TTL — cached entries expire automatically; no permanent storage of LLM responses
- ✓ Encryption in transit — all cache communication uses TLS encryption
Cache Hit Response
Cache hits return the same response format as a normal LLM response. The aisg_metadata object indicates a cache hit with associated cost and latency savings:
{
"id": "chatcmpl-abc123",
"model": "oah/llama-4-maverick",
"choices": [ ... ],
"aisg_metadata": {
"request_id": "req_def456",
"cache_hit": true,
"mode": "cached",
"latency_ms": 2,
"upstream_latency_ms": 0,
"cost_usd": 0.0,
"pii_detected": false,
"dlp_latency_ms": 8
}
}Response headers also indicate a cache hit:
x-aisg-cache: HIT
x-aisg-request-id: req_def456What Gets Cached
Cached
- ✓ Non-streaming chat completions
- ✓ Successful responses (HTTP 200)
- ✓ Responses without tool/function calls
Not cached
- — Streaming responses (SSE)
- — Error responses
- — Responses containing tool calls
- — Image generation requests
Streaming responses are excluded because they require real-time token delivery. Tool call responses are excluded because they typically produce side effects that should not be replayed from cache.
Integration
Semantic caching requires no code changes. If you're already using the AISG proxy, caching is active by default. You can detect cache hits in your application by inspecting the metadata:
from aisg import AISG
client = AISG()
response = client.chat.completions.create(
model="oah/llama-4-maverick",
messages=[{"role": "user", "content": "What is GDPR?"}],
)
metadata = response.aisg_metadata
if metadata.get("cache_hit"):
print(f"Cache hit — {metadata['latency_ms']}ms, $0 cost")
else:
print(f"Cache miss — {metadata['latency_ms']}ms")Cost Impact
Cache hits cost $0. No tokens are consumed, no provider API call is made. The response is served directly from the distributed cache. Workloads with 20-40% prompt duplication typically see equivalent reductions in LLM spend, with zero changes to application code.
Cache hit rates vary by workload type. Agentic frameworks with fixed system prompts, customer support chatbots, and batch processing pipelines typically see the highest hit rates. Conversational applications with unique user messages see lower hit rates but still benefit when users ask common questions.
Distributed Architecture
The cache layer is shared across all proxy instances, ensuring consistent cache hits regardless of which instance handles the request. This is critical for horizontally scaled deployments where requests are load-balanced across multiple nodes.
- • Shared state — a cache entry written by one proxy instance is immediately available to all others
- • Automatic failover — if the cache layer is temporarily unavailable, requests fall through to the LLM provider transparently; no errors, no downtime
- • Self-hosted support — self-hosted deployments require a Redis-compatible cache backend (Redis 7+, Valkey, or Dragonfly). Managed AISG handles cache infrastructure automatically
Related Documentation
Join the Community