How Semantic Caching Cuts LLM API Costs by 15-40% Without Code Changes
The short version
If the same DLP-cleaned prompt has been seen before, return the cached response instantly. No LLM call, no tokens consumed, no cost. AI Security Gateway now does this automatically at the proxy layer — backed by a distributed cache shared across all instances. Workloads with 20-40% prompt duplication typically see equivalent reductions in LLM spend, with zero application changes.
The Problem: You're Paying for the Same Answer Twice
Production AI applications send duplicate prompts more often than most teams realize. The duplication isn't always obvious — it's hidden across users, sessions, and instances:
- 1. Support chatbots — hundreds of customers asking the same question about your pricing, refund policy, or API limits. Each one triggers a full LLM round-trip.
- 2. Agent frameworks — LangChain, CrewAI, and AutoGPT agents resend the same system prompt with every tool call. A 10-step agent chain sends the same 2,000-token system prompt 10 times.
- 3. Batch pipelines — classification, summarization, and extraction jobs running the same template over identical inputs. Retries and error recovery compound the problem.
- 4. Dev/staging — developers and QA testers replaying the same prompts during debugging and testing. This isn't wasted work, but it is wasted money.
Example scenario: A mid-size team running a customer support chatbot on GPT-4.1 ($2/1M input, $8/1M output) with 50,000 queries/day and a 30% duplicate rate would spend ~$3,600/month on answers it has already generated. At Claude Sonnet 4 pricing ($3/$15), that rises to ~$8,100/month. Actual savings depend on your workload's specific duplication rate.
Why Not Just Cache in Your Application?
You can. But it creates problems that a gateway-level cache avoids:
Every service reimplements it
If you have 3 microservices calling LLMs, you need caching logic in all 3. Different languages, different cache backends, different bugs.
Cache keys don't account for DLP
If you redact PII before sending to the LLM, the cache key should be computed on the cleaned prompt — not the raw input. Otherwise, two requests with different PII but the same intent generate different cache keys despite producing the same redacted prompt.
No cross-instance sharing
In-process caches (LRU dicts, functools.lru_cache) don't share state across instances. Instance A caches a response; instance B re-requests it from the provider.
Cache invalidation is your problem
TTL management, eviction policies, memory pressure — you own all of it.
A gateway-level cache solves all four. Every LLM request passes through the proxy regardless — so caching at that layer is automatic, universal, DLP-aware, and distributed.
How AI Security Gateway Does It
Every request through the AISG proxy follows this pipeline:
DLP runs first
PII redaction, prompt injection detection, and policy enforcement happen before anything else. The cache never sees raw PII.
Cache key is derived
Cache key derivation accounts for the model, cleaned content, and project context. Cache hits are always project-scoped and never shared across accounts.
Cache lookup
The key is checked against a distributed cache shared across all proxy instances. Hit? Return the response immediately. Miss? Forward to the provider.
Cache write (on miss)
Successful non-streaming responses are stored with a configurable TTL. The next identical request — from any instance, any user in the same project — gets a cache hit.
What a cache hit looks like
{
"model": "oah/llama-4-maverick",
"choices": [ ... ],
"aisg_metadata": {
"request_id": "req_abc123",
"cache_hit": true,
"mode": "cached",
"latency_ms": 2,
"upstream_latency_ms": 0,
"cost_usd": 0.0,
"dlp_latency_ms": 8,
"pii_detected": false
}
}The response body is identical to what the LLM would return. The aisg_metadata object tells you it was a cache hit: cost_usd: 0.0, upstream_latency_ms: 0, and a total latency measured in single-digit milliseconds instead of hundreds.
Why “DLP-Aware” Caching Matters
Most caching implementations hash the raw prompt. This is wrong if you're doing any form of PII processing. Consider two requests with different personal details but identical intent:
# Request A
"My name is Alice Johnson, SSN 123-45-6789. What's my balance?"
# Request B
"My name is Bob Smith, SSN 987-65-4321. What's my balance?"
# After DLP redaction, both become:
"My name is [PERSON], SSN [US_SSN]. What's my balance?"After DLP redaction, both requests are semantically identical. A DLP-aware cache computes the key on the redacted version, so Request B gets a cache hit from Request A's response. A naive cache would treat them as different prompts and pay for both.
This is particularly impactful for customer-facing applications where many users ask the same question with different personal details. The DLP pipeline strips the PII, the cache handles the deduplication, and you pay for one LLM call instead of hundreds.
Technical note: Cache matching is performed on DLP-cleaned prompt content. Prompts that differ only in PII values are treated as cache-equivalent after normalization. This is exact-match on the normalized output, not embedding-based semantic similarity — the “semantic” equivalence comes from the DLP normalization step, not from vector embeddings.
What Gets Cached (and What Doesn't)
Cached
- Non-streaming chat completions
- Successful responses (HTTP 200)
- Responses without tool/function calls
Not cached
- Streaming responses (SSE)
- Error responses
- Tool call responses (side effects)
- Image generation requests
Streaming is excluded because cache hits need to return the full response immediately — buffering a stream defeats the purpose. Tool calls are excluded because they produce side effects (database writes, API calls) that should not be replayed from cache.
Distributed, Not In-Memory
In-memory caches don't work in production. Any horizontally-scaled deployment runs multiple proxy instances behind a load balancer. An in-memory cache on Instance A is invisible to Instance B.
AISG uses a distributed cache layer shared across all proxy instances. A cache entry written by one instance is immediately available to every other instance. This means:
- ✓ Cache hits work regardless of which instance handles the request
- ✓ Cache hit rates scale with total traffic, not per-instance traffic
- ✓ If the cache layer is temporarily unavailable, requests fall through transparently — no errors, no downtime
Self-hosted deployments require a Redis-compatible cache backend (Redis 7+, Valkey, or Dragonfly). Managed AISG handles the cache infrastructure automatically with encryption in transit.
Integration: Zero Code Changes
If you're already routing through the AISG proxy, caching is active by default. There's nothing to enable, no SDK to update, no configuration to set. Your existing code works as-is:
from openai import OpenAI
client = OpenAI(
base_url="https://api.aisecuritygateway.ai/v1",
api_key="your-aisg-key",
)
# First call: cache miss -> provider round-trip (~400ms)
response = client.chat.completions.create(
model="oah/llama-4-maverick",
messages=[{"role": "user", "content": "What is GDPR?"}],
)
# Second identical call: cache hit -> cached response (~2ms, $0)
response = client.chat.completions.create(
model="oah/llama-4-maverick",
messages=[{"role": "user", "content": "What is GDPR?"}],
)If you want to detect cache hits in your application logic (for analytics, logging, or conditional behavior), check the metadata:
from aisg import AISG
client = AISG()
response = client.chat.completions.create(
model="oah/llama-4-maverick",
messages=[{"role": "user", "content": "What is GDPR?"}],
)
meta = response.aisg_metadata
if meta.get("cache_hit"):
print(f"Cache hit: {meta['latency_ms']}ms, $0")
else:
print(f"Cache miss: {meta['latency_ms']}ms")Privacy Guarantees
Caching and privacy aren't in tension — they're complementary when the architecture is right:
- ✓ DLP runs before caching — PII is redacted before any cache key is computed or response is stored. The cache only ever holds cleaned content.
- ✓ One-way key derivation — cache keys are cryptographic hashes. The original prompt cannot be reconstructed from the key.
- ✓ Project isolation — cache entries are scoped to the originating project. One project's cache is never accessible to another.
- ✓ Automatic expiry — cached entries expire via configurable TTL. No permanent storage of LLM responses.
- ✓ Encryption in transit — all cache communication uses TLS.
How Caching Combines with Other AISG Features
Semantic caching doesn't exist in isolation. It's one layer in the AISG proxy pipeline, and it interacts with other features in useful ways:
Loop protection
Recursive agent loops are detected and killed before they can fill the cache with junk entries. The loop guard fires before the cache lookup.
Budget enforcement
Cache hits cost $0 and don't count against your spending cap. More cache hits means more budget headroom for requests that actually need provider calls.
Smart routing
Cache misses still benefit from multi-provider smart routing. If the cheapest provider is down, the request falls back automatically.
Webhook notifications
Cache hit/miss metrics are available in the project dashboard. Track your cache hit rate alongside DLP violations, injection attempts, and budget usage.
When Semantic Caching Won't Help
Caching is not a silver bullet. It has the most impact on workloads with high prompt repetition. It adds minimal value for:
- • Unique conversational messages — free-form user conversations where every message is different will see very low cache hit rates.
- • Streaming-only workloads — if 100% of your requests use streaming (SSE), nothing gets cached. Consider using non-streaming for batch/internal calls.
- • Tool-heavy agents — agents that rely entirely on tool calls won't see cache hits because tool-call responses are excluded.
That said, even predominantly unique workloads benefit from caching the “long tail” of repeated requests — system prompt evaluations, health checks, and administrative queries that happen more often than you think.
Getting Started
If you're already using AI Security Gateway, semantic caching is active. Check your project dashboard to see your cache hit rate. If you're not using AISG yet:
from openai import OpenAI
client = OpenAI(
base_url="https://api.aisecuritygateway.ai/v1",
api_key="your-aisg-key", # Get one at aisecuritygateway.ai
)
response = client.chat.completions.create(
model="oah/llama-4-maverick",
messages=[{"role": "user", "content": "Hello, world!"}],
)
# PII redaction, prompt injection blocking, budget enforcement,
# loop protection, and semantic caching -- all automatic.For full technical details, see the Semantic Caching documentation.
Want to self-host this?
AI Security Gateway is open source. Deploy the core AI security proxy on your own infrastructure — PII redaction, prompt injection blocking, and secret detection included. No account required.
Related Articles
LLM Token Budget Strategies for Agents
5 strategies to keep AI agents productive without bankrupting your team.
AI Agent Infinite Loop Protection
Fingerprint-based detection kills runaway agent loops before they drain your budget.
LLM API Cost Comparison 2026
Every major LLM API priced. GPT-4.1, Claude 4, Llama 4, Gemini 2.5, and more.
Prompt-Level PII Redaction Under 50ms
How to implement DLP at the gateway layer without breaking real-time latency.
Join the Community