Cache Invalidation in Microservices: The Hard Part

After learning about LRU caching and where it fits in modern systems, I moved on to the next question. how do you keep caches correct when data changes? This led me down the rabbit hole of cache invalidation, and specifically the tricky world of distributed cache invalidation in microservices.

There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton

Some variations of this quote on Martin Fowler’s article - Two Hard Things

After digging into this, I’m starting to understand why.

The Basic Problem

When you have data in two places - your database (source of truth) and your cache - what happens when the source data changes? Your cache now has stale data. how do you handle this?

For a single service with its own cache, this is manageable. But in microservices, the problem multiplies. Imagine you have an Order Service, Inventory Service, Product Service, and Recommendation Service. Each has its own Redis cache. When Product Service updates a product’s price, how do the other services know their cached product data is now stale?

Main Invalidation Strategies

Before getting into the distributed problem, it’s worth understanding the basic patterns:

Cache-Aside (Lazy Loading)

This is the most common pattern. On reads, you check cache first, and if it’s a miss, you read from the database and populate the cache. On writes, you update the database and then invalidate (delete) the cache entry. The next read will be a cache miss and reload fresh data.

You might wonder: why not populate the cache immediately after the write instead of waiting for the next read? There are a few reasons this is typically avoided:

You might not need it - If this data is rarely read, you’ve wasted effort caching it
Race conditions - If another write happens before the cache update completes, you could cache stale data
Write performance - Writes become slower since you’re doing both DB update + cache write synchronously

That said, if you know the data will be read immediately after writes (like a user updating their profile and viewing it), pre-populating the cache can make sense. That’s essentially what the write-through pattern does.

When to use it: Most general-purpose caching scenarios. This is your default.

Write-Through

Every write goes through the cache, which immediately writes to the database synchronously. The cache is always consistent with the database, but writes are slower.

When to use it: When you need strong consistency and can tolerate slower writes.

Write-Back (Write-Behind)

Writes go to the cache and return immediately. The cache asynchronously flushes to the database later, often in batches. Very fast writes, but risk of data loss if cache crashes.

When to use it: High-write scenarios where you can tolerate some data loss risk.

TTL (Time-To-Live)

Instead of actively invalidating, just set an expiration time on cached data. Simple, but you’re accepting that data can be stale for up to the TTL duration.

When to use it: When you can tolerate temporary staleness, often combined with other strategies.

The Microservices Challenge

Here’s where it gets interesting. In microservices, you can’t just have one service call other services directly to tell them to invalidate their caches. That would create tight coupling, slow down your writes, and doesn’t scale.

The standard approach is event-driven invalidation using a message bus like Kafka, RabbitMQ, or AWS SNS/SQS:

// Product Service publishes event
function updateProduct(productId, newData) {
  db.updateProduct(productId, newData);
  cache.delete(`product:${productId}`);
  
  // Publish event - fire and forget
  eventBus.publish('product.updated', {
    productId: productId,
    timestamp: Date.now(),
    fields: ['price', 'description']
  });
}
 
// Other services subscribe
// Inventory Service listener
eventBus.subscribe('product.updated', (event) => {
  cache.delete(`product:${event.productId}`);
  cache.delete(`inventory:${event.productId}`);
});

This decouples services, makes writes fast (async), and makes it easy to add new services that just subscribe to relevant events.

The downside is eventual consistency - there’s a delay between the update and invalidation across services. Event delivery becomes critical, and debugging distributed flows gets harder.

Alternative Approaches

Redis Pub/Sub

If all services share the same Redis instance, you can use Redis’s built-in pub/sub for invalidation messages. Simpler than a full message queue, but less robust (no message persistence if a subscriber is down). Redis pub/sub

Cache Versioning

Instead of invalidating, version your cache keys. Store a version number in a shared location, and when you update data, increment the version. Old cached data at the previous version becomes effectively invalid without actively deleting anything.

Hybrid Approach

Most production systems combine multiple strategies for defense in depth:

function updateProduct(productId, newData) {
  // 1. Update database
  db.updateProduct(productId, newData);
  
  // 2. Invalidate local cache
  localCache.delete(`product:${productId}`);
  
  // 3. Invalidate shared Redis cache
  redis.del(`product:${productId}`);
  
  // 4. Publish event for other services
  kafka.publish('product.updated', { productId, timestamp: Date.now() });
  
  // 5. Set TTL as safety net
  redis.set(`product:${productId}:lastUpdated`, Date.now(), ttl=3600);
}

This combines direct invalidation (fast path), event-driven (decoupled), timestamp comparison (safety net), and TTL (ultimate fallback).

The Hard Problems

Event Ordering

Events can arrive out of order. You update a price to $100, t h e n t o$ 200, but the events arrive reversed. Someone reads between events and caches the old $100 price. Now your cache has stale data even though events were processed.

Solutions include version numbers in events, timestamps for comparison, or using message queue features like Kafka partitions that guarantee ordering.

Partial Failures

Your database update succeeds, but Redis delete fails due to a network issue. Your event publishes successfully. Now Redis has stale data, but other services invalidate correctly. This is where TTL as a fallback becomes critical.

Aggregate Invalidation

If you cache “top 10 products,” when do you invalidate it? When any product updates (too aggressive)? Only when a top-10 product updates (need to track membership)? On a schedule (stale between refreshes)? This is genuinely hard.

The common approach is using short TTLs for aggregates and longer TTLs for individual items.

Cross-Region Caching

Services deployed in different regions (like US-East and EU-West) typically each have their own regional Redis instance for lower latency. Even though the application servers are stateless, the shared cache layer (Redis) is region-specific. An update in US-East needs to invalidate the cache in EU-West’s Redis instance. You need to set up infrastructure to handle this, such as:

A global message bus like Kafka with cross-region replication (your services publish invalidation events that get consumed across regions)
Redis replication across regions (though this replicates data, not invalidation commands)
Or simply accept eventual consistency across regions with TTL as your fallback

Observability is Critical

With distributed invalidation, debugging becomes difficult. You need proper observability tools to understand what’s happening across your system.

Distributed Tracing

Tools like Jaeger, Zipkin, or AWS X-Ray let you follow a single request across multiple services. You instrument your code to create “spans” that track each operation:

function updateProduct(productId, newData) {
  const span = tracer.startSpan('updateProduct');
  span.setTag('productId', productId);
  
  db.updateProduct(productId, newData);
  redis.del(`product:${productId}`);
  kafka.publish('product.updated', { productId, traceId: span.context().traceId });
  
  span.finish();
}

When you publish the event, you include the trace ID. Other services continue the trace when they receive the event. This lets you see the full timeline: database update → cache delete → event publish → event received by service A → service A cache delete, etc.

Event Monitoring

This involves tracking your message bus metrics. Most message systems (Kafka, RabbitMQ, SQS) provide metrics like:

Events published vs events consumed (are messages being lost?) Consumer lag (how far behind are your consumers?) Failed message processing (are invalidations failing?)

You can use tools like Kafka’s built-in monitoring, Prometheus + Grafana, or cloud provider dashboards (AWS CloudWatch, etc.)

Cache Hit Rate Metrics

You instrument your cache access code to track hits vs misses:

function getProduct(productId) {
  const cached = cache.get(`product:${productId}`);
  
  if (cached) {
    metrics.increment('cache.hit', { key: 'product' });
    return cached;
  }
  
  metrics.increment('cache.miss', { key: 'product' });
  const product = db.getProduct(productId);
  cache.set(`product:${productId}`, product);
  return product;
}

You send these metrics to a system like Prometheus, Datadog, or CloudWatch. A sudden drop in hit rate might indicate your invalidation isn’t working properly.

Staleness Detection

This requires storing metadata about when data was cached vs when it was last updated:

function getProduct(productId) {
  const cached = cache.get(`product:${productId}`);
  const lastUpdated = db.getLastUpdated(productId);  // indexed query
  
  if (cached && cached.cachedAt < lastUpdated) {
    metrics.increment('cache.stale_detected');
    cache.delete(`product:${productId}`);
    cached = null;
  }
  
  // ... rest of cache logic
}

You can also periodically run jobs that sample cache entries and compare them to the database, alerting when staleness exceeds thresholds.

What I Learned

Cache invalidation is genuinely one of the hardest problems in distributed systems. The complexity comes not from any single pattern being difficult, but from the combinatorial explosion of failure modes when you have multiple services, multiple cache layers, network delays, and partial failures.

The partial failure problem particularly stands out. When your database update succeeds but your event publish fails, or vice versa, you end up with inconsistent state across your system. The common solution I kept encountering is the transactional outbox pattern - instead of publishing events directly, you write them to a database table in the same transaction as your data update. A separate background process then reads from that table and publishes to your message bus, retrying until successful. This guarantees that if the database write succeeds, the event will eventually publish.

Transaction Outbox Pattern Resources

The key insight is that perfect consistency is often impossible or too expensive. Most production systems accept eventual consistency and build in multiple layers of defense: direct invalidation for the fast path, events for decoupling, timestamps for detection, and TTL as the ultimate fallback. Understanding these patterns helps you make better tradeoffs. Do you need strong consistency or can you tolerate a few seconds of stale data? Is this data read-heavy or write-heavy? How critical is it if the cache is wrong?

Like the LRU cache journey, this reinforced that building distributed systems is about understanding tradeoffs and picking the right tool for your specific constraints. There’s no one-size-fits-all solution - just patterns you can combine based on your needs.

Cheers!

Where LRU Caching actually matters in modern cloud systems

Jones Codes

Explorer

Cache Invalidation in Microservices: The Hard Part

The Basic Problem

Main Invalidation Strategies

Cache-Aside (Lazy Loading)

Write-Through

Write-Back (Write-Behind)

TTL (Time-To-Live)

The Microservices Challenge

Alternative Approaches

Redis Pub/Sub

Cache Versioning

Hybrid Approach

The Hard Problems

Event Ordering

Partial Failures

Aggregate Invalidation

Cross-Region Caching

Observability is Critical

Distributed Tracing

Event Monitoring

Cache Hit Rate Metrics

Staleness Detection

What I Learned

Transaction Outbox Pattern Resources

Graph View

Recent Posts

From LRU Cache to Distributed Systems: A Complete Guide to Caching in Modern Applications

Posts

Testing Distributed Systems: Beyond Unit Tests

Solving the Dual Write Problem - Transactional Outbox and Idempotency

Cache Invalidation in Microservices: The Hard Part