Inside DragonflyDB: Solving Redis's Long-Tail Cache Problem
How DragonflyDB reduces Redis-style cache pollution under long-tail access patterns.
The Long-Tail Distribution Problem
Most cache workloads are not evenly distributed.
A small number of keys get accessed constantly, while millions of other keys are touched rarely. This pattern is called a long-tail distribution.
It looks something like this:
Few keys → very hot
Many keys → rarely accessed
Modern applications naturally generate this pattern:
- trending posts
- viral videos
- popular APIs
- recommendation systems
- millions of low-traffic user records
This becomes a problem for traditional LRU caching.
What is LRU?
LRU stands for Least Recently Used.
When the cache becomes full, the least recently accessed key gets evicted.
Simple idea:
Recently used key → keep it
Old unused key → evict it
This works well for many workloads.
But long-tail traffic exposes an important weakness.
The Core Problem
Imagine your cache contains:
Hot keys:
A, B, C
Rare keys:
X1, X2, X3...
Now suppose a burst of new requests arrives:
X1001
X1002
X1003
...
Redis inserts all these keys into the cache.
Since they were recently accessed, LRU temporarily considers them important.
Meanwhile, slightly older but genuinely valuable keys like A or B may now appear “less recent”.
Result:
Useful hot keys get evicted
Cold one-time keys stay
This is called cache pollution.
Why Cache Pollution Matters
When genuinely hot keys get evicted, future requests miss the cache and fall back to the database or application layer.
Under sustained long-tail traffic, this can create cache thrashing:
insert cold key
↓
evict useful key
↓
cache miss
↓
reload hot key
↓
repeat
The cache still consumes memory, but stores fewer genuinely reusable items.
As hit rate drops, backend load and latency increase.
Perfect LRU vs Redis LRU
A perfect LRU cache would maintain a globally ordered list of keys:
Most recently used
↓
A → B → C → D
↓
Least recently used
Every single access updates this ordering.
When eviction is needed:
Evict the key at the end
This is called a perfect global LRU because the cache always knows the exact usage order of every key.
The problem is scalability.
For millions of keys and huge request rates:
- every read potentially mutates eviction metadata internally
- synchronization overhead increases
- bookkeeping becomes expensive
- contention increases
Redis avoids this cost using approximate LRU.
Instead of maintaining one exact global ordering, Redis stores lightweight access timestamps and samples random keys during eviction.
Simplified version:
Pick random keys
↓
Estimate which one is oldest
↓
Evict it
This is significantly faster and cheaper.
But under sufficiently large scans or one-hit workloads, approximate LRU may begin evicting keys with genuine reuse value.
During bursts of cold keys, Redis may accidentally evict keys that are still highly valuable to the workload.
DragonflyDB’s Approach
DragonflyDB reduces cache pollution using a multi-stage eviction design.
Instead of immediately trusting every newly inserted key, Dragonfly separates the cache into two regions.
Dragonfly’s cache implementation, called Dash Cache, builds on the 2Q algorithm from the 1994 paper “2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm”. But it refines the idea for modern scale:
- It is resistant to fluctuations of recent traffic, unlike LRU.
- It does not require random sampling or other approximations like Redis.
- It has zero memory overhead per item.
- It has very small O(1) run-time overhead.
This is a significant departure from both perfect LRU and approximate LRU.
1. Probationary Buffer
New keys enter here first.
Think of this as a temporary holding area.
First access
↓
Placed into probationary buffer
At this stage, Dragonfly assumes:
“This key might be useful, but we do not know yet.”
The probationary buffer holds only a small portion of the total cache space — less than 10%. All newly added items compete with each other inside this constrained region.
Most long-tail keys are accessed only once.
So many of them naturally expire or get evicted from the probationary region without polluting the main cache.
2. Protected Buffer
If a key gets accessed again, it gets promoted.
Accessed again
↓
Move to protected buffer
The protected buffer contains keys that have demonstrated real reuse. These keys are much less likely to be evicted.
What happens when the protected buffer is full and a new key is promoted? The least recently used item from the protected buffer is evicted back to the probationary buffer, not fully removed from cache. It gets a second chance.
This creates an important distinction:
Recently seen
vs
Actually valuable
Traditional LRU treats both similarly.
DragonflyDB does not.
Why This Works Better
Imagine a cache that can hold 10,000 keys.
Your workload looks like this:
- Hot keys:
user:100,user:200,user:300: accessed 1,000 times per minute each. - Long-tail keys:
user:X1,user:X2,user:X3… : accessed once and never again.
Now a batch job suddenly requests 15,000 unique keys.
With Redis LRU:
All 15,000 keys enter the cache. Because they were recently accessed, LRU considers them important. Your genuinely hot keys now look “older” relative to the flood of new entries. Approximate LRU sampling starts evicting hot keys. Within minutes, cache hit rate can drop dramatically.
With DragonflyDB Dash Cache:
The 15,000 new keys enter the probationary buffer (under 10% of total space). Since each is accessed only once, they naturally get evicted from this small buffer. The protected buffer containing your hot keys remains untouched. Your cache hit rate stays at stable levels.
This is the difference between recently seen and actually valuable.
Under long-tail workloads:
Millions of one-hit keys arrive
In Redis LRU:
Cold keys compete equally with hot keys
In DragonflyDB:
Cold keys mostly stay trapped
inside the probationary region
Only repeatedly accessed keys earn long-term protection.
Most importantly:
cache hit rate stays stable
And that is the real optimization.
Final Takeaway
The problem with long-tail workloads is not Redis speed.
Redis is extremely fast.
The real issue is this:
LRU mistakes temporary activity for long-term value.
That leads to cache pollution, lower hit rates, and unnecessary backend load.
DragonflyDB improves this by being more selective about what deserves to stay in cache.
And at modern scale, that distinction matters a lot.