How Semantic Caching Cuts LLM API Costs by 15-40% Without Code Changes

June 1, 2026·8 min read·engineering

The short version

If the same DLP-cleaned prompt has been seen before, return the cached response instantly. No LLM call, no tokens consumed, no cost. AI Security Gateway now does this automatically at the proxy layer — backed by a distributed cache shared across all instances. Workloads with 20-40% prompt duplication typically see equivalent reductions in LLM spend, with zero application changes.

The Problem: You're Paying for the Same Answer Twice

Production AI applications send duplicate prompts more often than most teams realize. The duplication isn't always obvious — it's hidden across users, sessions, and instances:

1. Support chatbots — hundreds of customers asking the same question about your pricing, refund policy, or API limits. Each one triggers a full LLM round-trip.
2. Agent frameworks — LangChain, CrewAI, and AutoGPT agents resend the same system prompt with every tool call. A 10-step agent chain sends the same 2,000-token system prompt 10 times.
3. Batch pipelines — classification, summarization, and extraction jobs running the same template over identical inputs. Retries and error recovery compound the problem.
4. Dev/staging — developers and QA testers replaying the same prompts during debugging and testing. This isn't wasted work, but it is wasted money.

Example scenario: A mid-size team running a customer support chatbot on GPT-4.1 ($2/1M input, $8/1M output) with 50,000 queries/day and a 30% duplicate rate would spend ~$3,600/month on answers it has already generated. At Claude Sonnet 4 pricing ($3/$15), that rises to ~$8,100/month. Actual savings depend on your workload's specific duplication rate.

Why Not Just Cache in Your Application?

You can. But it creates problems that a gateway-level cache avoids:

Every service reimplements it

If you have 3 microservices calling LLMs, you need caching logic in all 3. Different languages, different cache backends, different bugs.

Cache keys don't account for DLP

If you redact PII before sending to the LLM, the cache key should be computed on the cleaned prompt — not the raw input. Otherwise, two requests with different PII but the same intent generate different cache keys despite producing the same redacted prompt.

No cross-instance sharing

In-process caches (LRU dicts, functools.lru_cache) don't share state across instances. Instance A caches a response; instance B re-requests it from the provider.

Cache invalidation is your problem

TTL management, eviction policies, memory pressure — you own all of it.

A gateway-level cache solves all four. Every LLM request passes through the proxy regardless — so caching at that layer is automatic, universal, DLP-aware, and distributed.

How AI Security Gateway Does It

Every request through the AISG proxy follows this pipeline:

DLP runs first

PII redaction, prompt injection detection, and policy enforcement happen before anything else. The cache never sees raw PII.

Cache key is derived

Cache key derivation accounts for the model, cleaned content, and project context. Cache hits are always project-scoped and never shared across accounts.

Cache lookup

The key is checked against a distributed cache shared across all proxy instances. Hit? Return the response immediately. Miss? Forward to the provider.

Cache write (on miss)

Successful non-streaming responses are stored with a configurable TTL. The next identical request — from any instance, any user in the same project — gets a cache hit.

What a cache hit looks like

Response metadata on cache hit

{
  "model": "oah/llama-4-maverick",
  "choices": [ ... ],
  "aisg_metadata": {
    "request_id": "req_abc123",
    "cache_hit": true,
    "mode": "cached",
    "latency_ms": 2,
    "upstream_latency_ms": 0,
    "cost_usd": 0.0,
    "dlp_latency_ms": 8,
    "pii_detected": false
  }
}

The response body is identical to what the LLM would return. The aisg_metadata object tells you it was a cache hit: cost_usd: 0.0, upstream_latency_ms: 0, and a total latency measured in single-digit milliseconds instead of hundreds.

Why “DLP-Aware” Caching Matters

Most caching implementations hash the raw prompt. This is wrong if you're doing any form of PII processing. Consider two requests with different personal details but identical intent:

Two requests with different PII, same intent

# Request A
"My name is Alice Johnson, SSN 123-45-6789. What's my balance?"

# Request B
"My name is Bob Smith, SSN 987-65-4321. What's my balance?"

# After DLP redaction, both become:
"My name is [PERSON], SSN [US_SSN]. What's my balance?"

After DLP redaction, both requests are semantically identical. A DLP-aware cache computes the key on the redacted version, so Request B gets a cache hit from Request A's response. A naive cache would treat them as different prompts and pay for both.

This is particularly impactful for customer-facing applications where many users ask the same question with different personal details. The DLP pipeline strips the PII, the cache handles the deduplication, and you pay for one LLM call instead of hundreds.

Technical note: Cache matching is performed on DLP-cleaned prompt content. Prompts that differ only in PII values are treated as cache-equivalent after normalization. This is exact-match on the normalized output, not embedding-based semantic similarity — the “semantic” equivalence comes from the DLP normalization step, not from vector embeddings.

What Gets Cached (and What Doesn't)

Cached

Non-streaming chat completions
Successful responses (HTTP 200)
Responses without tool/function calls

Not cached

Streaming responses (SSE)
Error responses
Tool call responses (side effects)
Image generation requests

Streaming is excluded because cache hits need to return the full response immediately — buffering a stream defeats the purpose. Tool calls are excluded because they produce side effects (database writes, API calls) that should not be replayed from cache.

Distributed, Not In-Memory

In-memory caches don't work in production. Any horizontally-scaled deployment runs multiple proxy instances behind a load balancer. An in-memory cache on Instance A is invisible to Instance B.

AISG uses a distributed cache layer shared across all proxy instances. A cache entry written by one instance is immediately available to every other instance. This means:

✓ Cache hits work regardless of which instance handles the request
✓ Cache hit rates scale with total traffic, not per-instance traffic
✓ If the cache layer is temporarily unavailable, requests fall through transparently — no errors, no downtime

Self-hosted deployments require a Redis-compatible cache backend (Redis 7+, Valkey, or Dragonfly). Managed AISG handles the cache infrastructure automatically with encryption in transit.

Integration: Zero Code Changes

If you're already routing through the AISG proxy, caching is active by default. There's nothing to enable, no SDK to update, no configuration to set. Your existing code works as-is:

Python — works with caching, no changes needed

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aisecuritygateway.ai/v1",
    api_key="your-aisg-key",
)

# First call: cache miss -> provider round-trip (~400ms)
response = client.chat.completions.create(
    model="oah/llama-4-maverick",
    messages=[{"role": "user", "content": "What is GDPR?"}],
)

# Second identical call: cache hit -> cached response (~2ms, $0)
response = client.chat.completions.create(
    model="oah/llama-4-maverick",
    messages=[{"role": "user", "content": "What is GDPR?"}],
)

If you want to detect cache hits in your application logic (for analytics, logging, or conditional behavior), check the metadata:

Python — detecting cache hits with the AISG SDK

from aisg import AISG

client = AISG()
response = client.chat.completions.create(
    model="oah/llama-4-maverick",
    messages=[{"role": "user", "content": "What is GDPR?"}],
)

meta = response.aisg_metadata
if meta.get("cache_hit"):
    print(f"Cache hit: {meta['latency_ms']}ms, $0")
else:
    print(f"Cache miss: {meta['latency_ms']}ms")

Privacy Guarantees

Caching and privacy aren't in tension — they're complementary when the architecture is right:

✓ DLP runs before caching — PII is redacted before any cache key is computed or response is stored. The cache only ever holds cleaned content.
✓ One-way key derivation — cache keys are cryptographic hashes. The original prompt cannot be reconstructed from the key.
✓ Project isolation — cache entries are scoped to the originating project. One project's cache is never accessible to another.
✓ Automatic expiry — cached entries expire via configurable TTL. No permanent storage of LLM responses.
✓ Encryption in transit — all cache communication uses TLS.

How Caching Combines with Other AISG Features

Semantic caching doesn't exist in isolation. It's one layer in the AISG proxy pipeline, and it interacts with other features in useful ways:

Loop protection

Recursive agent loops are detected and killed before they can fill the cache with junk entries. The loop guard fires before the cache lookup.

Budget enforcement

Cache hits cost $0 and don't count against your spending cap. More cache hits means more budget headroom for requests that actually need provider calls.

Smart routing

Cache misses still benefit from multi-provider smart routing. If the cheapest provider is down, the request falls back automatically.

Webhook notifications

Cache hit/miss metrics are available in the project dashboard. Track your cache hit rate alongside DLP violations, injection attempts, and budget usage.

When Semantic Caching Won't Help

Caching is not a silver bullet. It has the most impact on workloads with high prompt repetition. It adds minimal value for:

• Unique conversational messages — free-form user conversations where every message is different will see very low cache hit rates.
• Streaming-only workloads — if 100% of your requests use streaming (SSE), nothing gets cached. Consider using non-streaming for batch/internal calls.
• Tool-heavy agents — agents that rely entirely on tool calls won't see cache hits because tool-call responses are excluded.

That said, even predominantly unique workloads benefit from caching the “long tail” of repeated requests — system prompt evaluations, health checks, and administrative queries that happen more often than you think.

Getting Started

If you're already using AI Security Gateway, semantic caching is active. Check your project dashboard to see your cache hit rate. If you're not using AISG yet:

Start using AISG in 2 lines

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aisecuritygateway.ai/v1",
    api_key="your-aisg-key",  # Get one at aisecuritygateway.ai
)

response = client.chat.completions.create(
    model="oah/llama-4-maverick",
    messages=[{"role": "user", "content": "Hello, world!"}],
)
# PII redaction, prompt injection blocking, budget enforcement,
# loop protection, and semantic caching -- all automatic.

For full technical details, see the Semantic Caching documentation.

Want to self-host this?

AI Security Gateway is open source. Deploy the core AI security proxy on your own infrastructure — PII redaction, prompt injection blocking, and secret detection included. No account required.

View on GitHub Learn more

Engineering10 min

GitHub LinkedIn X (Twitter)YouTube