TL;DR: Open addressing crushed chaining with sequential keys — 19x faster on insert. Then we switched to normally distributed keys and open addressing became 11,000x slower. The “best” hash map depends entirely on your key distribution, and the best-case benchmark is the one you should trust least.
I’m learning performance engineering by building data structures from scratch and benchmarking them. I’m working with Claude (an LLM) as a collaborator — it provides the textbook analysis, I write the code, and we both make predictions before running anything.
This post is about the first thing we built: hash maps. We wrote four implementations, predicted how they’d perform, ran benchmarks with sequential keys, and declared open addressing the winner. Then we changed the key distribution and watched the winner become 11,000x slower.
Source code: seanwilliamcarroll/ds — raw benchmark data: release
The implementations
Four int → int hash maps, each using std::hash<int> — which on most
compilers is the identity function. We confirmed this by inspecting the compiled
assembly: hash(42) compiles to literally 42, no computation at all.1
- Chaining — separate chaining with linked lists. Each node heap-allocated.
- Linear Probing (LP) — open addressing with tombstone deletion.
- Robin Hood (RH) — open addressing with displacement insertion and backshift deletion (no tombstones).
- std::unordered_map — the standard library baseline.
The scenarios
Four operations, all at load factor 0.75 with N = 65,536 entries:
| Scenario | What’s timed |
|---|---|
| Insert | Insert N keys into an empty map |
| FindHit | Find all N keys (all present) |
| FindMiss | Find N keys that aren’t in the map |
| EraseAndFind | Erase half the entries, then find the survivors |
All benchmarks ran on Apple Silicon M4, compiled with LLVM/Clang 21 at -O3. Medians of 10 repetitions.
The three key distributions
We test each scenario with three key distributions. This is the variable that matters — and the one I almost didn’t think to vary.
Sequential: Keys 0, 1, 2, …, N-1. With identity hash and a power-of-two table, these map to consecutive slots. This is the best case for open addressing — the hardware prefetcher handles linear scans perfectly.
Uniform random: N unique keys drawn uniformly from [0, 10N). Scattered across the table. Each lookup is a potential cache miss regardless of data layout.
Normal (clustered): Keys drawn from a normal distribution centered at N/2 with standard deviation N/8. About 68% of keys cluster into roughly N/4 consecutive slots around the center of the table.
Our predictions
Before running anything, we wrote down what we expected.
I knew chaining — it’s the textbook default — and I’d heard the complaints about
std::unordered_map being slow. Linear probing and Robin Hood were new to me.
Claude provided the textbook analysis of clustering, probe distances, and
tombstone behavior.
I worried about Robin Hood: the displacement chains during insert and the backshifting during delete sounded expensive. I also worried about LP: wouldn’t clusters degrade lookup?
Claude worried about clustering: primary clustering should hurt LP under non-ideal conditions.
We wrote a prediction matrix:
| Scenario | Chaining | LP | RH | std |
|---|---|---|---|---|
| Insert | medium | fast | medium — swap overhead | slow |
| FindHit | slow — pointer chase | medium — clusters | fast — uniform probes | slow |
| FindMiss | slow — walk full chain | medium | fast — early termination | slow |
| EraseAndFind | medium | slow — tombstones hurt | fast — clean table | medium |
Our key beliefs:
- Chaining degrades gracefully; open addressing has sharp failure modes
stdis always slowest (chaining overhead + stdlib bloat)
Act 1: Sequential keys
Sequential keys first. This is where most hash map benchmarks begin and end.
Results (cpu_time, nanoseconds)
Bold marks the winner in each row.
| Scenario | LP | RH | Chain | std |
|---|---|---|---|---|
| Insert | 117.4K | 197.7K | 2.30M | 1.20M |
| FindHit | 45.6K | 48.7K | 51.7K | 97.6K |
| FindMiss | 48.8K | 32.6K | 37.6K | 80.7K |
| EraseAndFind | 21.2K | 25.9K | 319.6K | 354.3K |
LP wins three of four scenarios. RH takes FindMiss. Open addressing dominates.
Analysis
Insert — LP is fastest, as predicted. But it’s 20x faster than chaining. We
expected maybe 2-3x. And our chaining is slower than std — we predicted the
opposite. std::unordered_map’s allocator handles per-node allocation better
than our naive make_unique calls.
FindHit — We predicted RH fastest thanks to its more uniform probe distribution. In reality, LP, RH, and Chain are all within 13% of each other. The table is about 780KB at this size — well beyond L1 cache (128KB on M4). At this scale, the cost of a cache miss dominates the number of probes. Whether you probe 2 slots or 4, you’re paying for the same L2 cache hit.
FindMiss — RH wins as predicted. Chaining takes second — an empty bucket is an instant “not found,” and at load 0.75 a quarter of buckets are empty.
EraseAndFind — This is the prediction we were most confident about, and most wrong. We said LP would be slow because tombstones degrade subsequent finds. We said RH would be fast because backshift deletion keeps the table clean. Reality: LP is fastest, 15x faster than chaining. Tombstone damage is real in theory — but with sequential keys, LP has both advantages: probes are short (keys scatter perfectly) and each probe is cache-friendly (a linear scan through a flat array). Chaining has neither: it chases pointers through scattered heap nodes, and each node is a potential cache miss. The constant factor of a cache miss dwarfs the algorithmic difference in probe count.
The tempting conclusion
Open addressing dominates. LP wins three of four scenarios. Chaining is 15-19x slower on insert and erase. Case closed?
This is where many performance blog posts would end. We almost did. But I kept coming back to the same question: these keys are 0, 1, 2, …, N-1, hashed by the identity function into consecutive slots. How often does that actually happen? Almost every optimization here is a tradeoff based on an assumption about the input. What happens when the assumption breaks?
Act 2: Uniform random keys
Same scenarios, but now keys are uniformly random instead of sequential. No more consecutive slots. No more prefetcher-friendly linear scans. Every lookup is a potential cache miss.
Results (cpu_time, nanoseconds)
| Scenario | LP | RH | Chain | std |
|---|---|---|---|---|
| Insert | 698.3K | 1.01M | 4.33M | 2.10M |
| FindHit | 127.3K | 138.3K | 149.4K | 356.0K |
| FindMiss | 423.4K | 289.4K | 272.9K | 589.0K |
| EraseAndFind | 102.3K | 62.7K | 566.3K | 477.0K |
Analysis
The sequential results were partly cache flattery — the hardware prefetcher was doing work that open addressing got credit for. Removing that subsidy reveals how much of Act 1’s gap was the algorithm and how much was the hardware.
LP’s insert advantage drops from 20x to 6x over chaining. FindHit stays close — LP, RH, and Chain are within 17% of each other. RH takes EraseAndFind from LP. And the asymmetry is consistent: random keys hurt open addressing far more than chaining. LP’s insert slows down 6x from sequential; chaining’s slows down 1.9x. LP’s FindMiss is the most dramatic — 8.7x slower, as scattered miss probes through tombstones are expensive when every probe is a cache miss.
The lesson: you can’t separate the algorithm from the hardware. Sequential keys didn’t just test the hash map — they tested the prefetcher. Open addressing still wins on absolute numbers here, but its margin is thinner than Act 1 suggested, and it’s not clear how much further it can erode.
Act 3: Normal keys
After the uniform results, I kept thinking: how often does a real system insert sequential keys into a hash map? With identity hash, the distribution of keys is the distribution of slots. And integer keys cluster in plenty of real scenarios — auto-increment IDs, timestamps, geographic coordinates, sensor readings.
So we tried a normal distribution: keys centered at N/2, standard deviation N/8. About 68% of keys cluster into roughly N/4 consecutive slots. At load factor 0.75, open addressing normally expects ~4 probes per operation. What happens when thousands of keys hash to the same neighborhood?
Results (cpu_time, nanoseconds)
| Scenario | LP | RH | Chain | std |
|---|---|---|---|---|
| Insert | 755M !! | 2,049M !! | 4.37M | 2.11M |
| FindHit | 525M !! | 608M !! | 493K | 363K |
| FindMiss | 2.18M !! | 2.46M !! | 50.8K | 344K |
| EraseAndFind | 789M !! | 52.5M !! | 589K | 449K |
The !! numbers are catastrophic — millions of nanoseconds to two seconds
per batch on just 65,536 entries. Every scenario is catastrophic for open
addressing — including FindMiss, which held up well in Acts 1 and 2.
Analysis
Open addressing with identity hash went quadratic.
LP’s Insert takes 755 million nanoseconds — 6,431x slower than with sequential keys. LP’s FindHit takes 525 million — 11,500x slower. These aren’t “slow.” They’re broken. The normal distribution creates a massive cluster around N/2, and with identity hash mapping those keys directly to those slots, every insert and find has to probe through the entire cluster.
Robin Hood is even worse on some operations. RH Insert takes 2 billion nanoseconds. Robin Hood’s displacement rule — “take from the rich, give to the poor” — means each insert into the dense cluster displaces an element, which displaces another, propagating through thousands of entries. Its backshift deletion (shifting elements backward on erase to avoid tombstones) causes the same cascading effect in reverse.
Chaining barely notices. Chain FindHit goes from 51.7K (sequential) to 493K (normal) — a 9.5x slowdown, not an 11,000x one. The clustered buckets have longer chains, but each chain is independent. There’s no cascading effect. We predicted chaining would degrade gracefully. We were right — we just didn’t appreciate what that would be worth until we saw the alternative.
And std::unordered_map wins — the implementation we predicted would always
be slowest. Here it’s the fastest on Insert, FindHit, and EraseAndFind. std is
chaining under the hood, so it’s immune to the clustering catastrophe. Its
node-based allocation, which was a liability with sequential keys, is irrelevant
when the alternative is probing through 50,000 occupied slots.
FindMiss completes the picture. With sequential and uniform keys, FindMiss was the one scenario where open addressing held up well — RH’s early termination and LP’s short probes kept miss lookups fast. I expected the same with normal keys. The miss keys hash to completely different slots than the hit keys (we verified: 0% home slot overlap). How could a cluster hurt you if your keys don’t even hash near it?
It turns out open addressing’s displacement mechanism spreads the cluster far beyond its home slots. The hit keys cluster around home slots near N/2, but collisions push displaced entries into neighboring slots, which push their neighbors further out. At N=65,536 (table size 131,072), the cluster extends across 23,000+ contiguous occupied slots — roughly 18% of the table. Miss keys that hash to “empty” regions land inside this occupied run and have to probe all the way to the far edge.
We added probe counting to LP to confirm:
| N | Avg probes per miss | Max probes per miss |
|---|---|---|
| 256 | 1.4 | 53 |
| 4,096 | 4.6 | 839 |
| 65,536 | 64.7 | 23,425 |
Every lookup correctly returns “not found.” It just takes thousands of probes to get there. This isn’t a cache problem — we ran a warm-up pass to pre-populate the cache and it made zero difference. It’s genuine O(cluster_size) probing.
Robin Hood doesn’t escape either. I initially expected RH’s early termination to save it — a miss lookup should stop as soon as it sees an occupant with a shorter probe distance. But the miss keys land among entries that have been displaced far from their home slots. These displaced entries have very high probe distances, so early termination never fires — the occupant’s probe distance is always greater than the searcher’s. RH scores 2.46M on FindMiss/Normal, slightly worse than LP’s 2.18M.
Only chaining is immune. Chaining’s collisions are vertical (longer chains at the same bucket), not horizontal (spilling into neighbors). An empty bucket stays empty no matter how dense the neighboring buckets are. Chain scores 50.8K — barely different from its 37.6K with sequential keys.
The normal distribution doesn’t just make your keys slow to find — it pollutes the entire table, making unrelated keys slow too. The cluster is contagious.
Closing
| Scenario | Sequential | Uniform | Normal |
|---|---|---|---|
| Insert | LP | LP | std |
| FindHit | LP | LP | std |
| FindMiss | RH | Chain | Chain |
| EraseAndFind | LP | RH | std |
We started with reasonable predictions, benchmarked with sequential keys, and thought we understood hash map performance. Then we changed the key distribution and every conclusion inverted.
The thing that sticks with me is how confident the sequential results felt. LP was 19x faster — that’s not a marginal difference you can explain away. It felt like a settled question. But the result was an artifact of the best-case input, and the best case is the one you should trust least.
Every optimization is a tradeoff built on an assumption: about the input, the hardware, the access pattern. Open addressing assumes keys scatter reasonably across the table. Identity hash assumes the keys are already well-distributed. When those assumptions hold, you get 19x wins. When they don’t, you get 11,000x losses. The question isn’t “which hash map is fastest?” It’s “what does your data actually look like, and how badly does your design degrade when you’re wrong about that?”
N = 65,536. Apple Silicon M4 (P-core, 128KB L1, 16MB L2). LLVM/Clang 21.1.8, -O3. Load factor 0.75. Medians of 10 repetitions. All code is on GitHub.
Footnotes
-
At -O3 the compiler constant-folds
std::hash<int>()(42)to the literal 42. At -O0 it emits a function call, but the function body isldrsw x0, [sp, #4]; ret— load the int, sign-extend tosize_t, return it. Seescripts/verify_identity_hash.sh. ↩