Skip to content
Sean William Carroll
Go back

Benchmarking Tries, Part 2: What "Arena" Actually Means

TL;DR: My first “arena” trie wasn’t an arena at all — fixing that revealed real cache locality effects I’d only assumed before. Then the disassembly kept raising new questions, and each one led to another variant worth testing.

Note: This post corrects errors from the original trie benchmark post. The “arena” in that post used vector<unique_ptr<Node>>, which still scatters nodes across the heap — the insert speedup was real, but the cache locality explanation was wrong. I’m leaving the original as-is for reference.


Contents

  1. The mistake
  2. Five implementations
  3. Insert: arenas win, deque wins biggest
  4. Search hits: cache locality is real (this time)
  5. Search misses: when one instruction is the whole hot path
  6. Zero-as-null: fixing what the disassembly showed us
  7. Data-oriented layout: separating hot and cold fields
  8. The rest of the picture
  9. Conclusion

Source code: seanwilliamcarroll/ds — raw benchmark data: release, debug


The mistake

In the previous post, I compared two trie implementations: an “arena” that stored nodes in a vector<unique_ptr<Node>>, and a PtrTrie where each node owns its children via unique_ptr. I attributed the arena’s insert speedup partly to cache locality — nodes being contiguous in memory.

The problem: vector<unique_ptr<Node>> is not an arena. Each make_unique<Node>() is a separate heap allocation. The vector holds the unique_ptr objects contiguously — 8-byte pointers packed together — but the nodes themselves are scattered across the heap, exactly like PtrTrie. There may have been some partial cache locality benefit from the contiguous pointer array itself (sequential pointer loads could prefetch well), but the nodes those pointers referred to had no locality guarantees whatsoever. The insert speedup was real (fewer allocation events due to vector’s geometric growth vs individual make_unique calls), but the cache locality story I told about node layout was wrong.

Once I realized this, the obvious question: what happens when you build a real arena?


Five implementations

What started as three implementations grew to five as the benchmarks kept raising questions.

The first three were the straightforward correction — build real arenas and compare:

IndexArenaTrie (sentinel): Nodes live directly in a vector<Node> — actual contiguous memory. Children store size_t indices into the vector rather than pointers. Indices survive reallocation, so the vector can grow freely. This is the “real” arena: nodes are physically adjacent in memory. A missing child is represented by a sentinel value (numeric_limits<size_t>::max()).

DequeArenaTrie: Nodes live in a deque<Node>. A deque allocates in fixed-size chunks — nodes within a chunk are contiguous, but chunks themselves may be scattered. Children store Node* pointers, which are safe because deque guarantees pointer stability on push_back. A middle ground: better locality than scattered heap allocations, but not fully contiguous.

PtrTrie: The baseline from the original post. Each node owns its children via array<unique_ptr<Node>, 26>. Every new node is an individual heap allocation. Nodes end up wherever the allocator puts them.

A side benefit of the index-based approach: it eliminated the const_cast pattern I used in the other implementations. Since find_last_node_index returns a size_t, there’s no const/non-const pointer overload to deduplicate.

The next two came from looking at the disassembly. The search miss results showed IndexArena’s sentinel check compiling to cmn + b.eq (two instructions), while PtrTrie’s nullptr check got cbz — a fused compare-and-branch (one instruction). That raised the question: what if we used zero as our null index instead?

IndexArenaTrie (zero-as-null): Same code, templated with a bool UseSentinelValue parameter. When false, NULL_INDEX = 0 and the root node lives at index 1 (reserving index 0 as “null”). Children default-initialize to 0, making the null check eligible for cbz.

Then, after seeing how much the per-node layout affected cache behavior, I wanted to try something I’d seen in CppCon talks about data-oriented design — organizing data by access pattern rather than logical grouping. I hadn’t tried it before, but the trie seemed like a clean case.

DataOrientedIndexArenaTrie: Same zero-as-null strategy, but is_end_of_word is pulled out of the Node struct into a parallel vector<bool>. Node shrinks from 216 bytes1 to 208 bytes (just the children array, no padding). During search traversal, every cache line is 100% useful data.


Insert: arenas win, deque wins biggest

Benchmark2564K64K
Index Dense (sentinel)16μs154μs3.5ms
DataOriented Dense16μs132μs3.0ms
Deque Dense12μs151μs1.8ms
Ptr Dense44μs544μs11.5ms
Index Sparse (sentinel)18μs232μs3.9ms
DataOriented Sparse16μs220μs3.3ms
Deque Sparse13μs190μs2.8ms
Ptr Sparse50μs672μs19.4ms

All arenas beat PtrTrie convincingly — IndexArena by ~3.3x, DequeArena by ~6.4x at 64K dense. The surprise is that DequeArena is the fastest inserter, nearly twice as fast as IndexArena at 64K.

Why? IndexArena’s vector<Node> occasionally reallocates and copies the entire array when it grows. Each Node is 216 bytes1 so at 64K nodes that’s ~13.5MB being copied on resize. DequeArena’s deque never moves existing nodes — it just allocates a new chunk and updates its internal bookkeeping. The reallocation cost outweighs the locality benefit during construction.

DataOriented is slightly faster than the sentinel IndexArena (3.0ms vs 3.5ms at 64K dense) — smaller 208-byte nodes mean less data to copy on each reallocation.

This is a tradeoff the old post missed entirely, because the old “arena” wasn’t paying this cost either — it was just shuffling 8-byte pointers on resize, not 216-byte nodes.


Search hits: cache locality is real (this time)

Benchmark2564K64K
Index Hit (sentinel)912ns23μs580μs
DataOriented Hit984ns23μs397μs
Deque Hit1.2μs21μs930μs
Ptr Hit1.1μs18μs878μs

This is the result the old post was looking for but couldn’t find. At 64K words, IndexArena (sentinel) searches are 1.5x faster than PtrTrie, and DataOriented pushes that to 2.2x. The old post reported identical search times — because both implementations scattered nodes on the heap, there was no locality difference to measure.

Now there is. IndexArena’s nodes are contiguous in a vector<Node>. When searching for a word, you follow a path of ~6 nodes. In PtrTrie, those 6 nodes could be anywhere in a ~13.5MB heap region. In IndexArena, they’re within a contiguous block — the CPU’s hardware prefetcher can anticipate sequential access, and nodes created close in time (which often means structurally close in the trie) end up physically close in memory.

DequeArena falls in between, as expected: nodes within a chunk get locality benefits, but cross-chunk traversals don’t.

The effect is size-dependent. At 256 words, the entire trie fits comfortably in L2 cache (4 MiB per core on this hardware) regardless of layout, so there’s no meaningful difference. At 64K, the working set exceeds cache, and physical layout starts to matter.

There’s a wrinkle, though. Looking at the full range of sizes, IndexArena is actually ~20-30% slower than PtrTrie at 512 and 4K, roughly tied at 32K, and only pulls ahead decisively at 64K. The disassembly2 reveals the cost: IndexArena pays a madd (multiply-add by 216) every iteration to convert an index to an address. Since 216 isn’t a power of 2, the compiler can’t reduce this to a shift — it’s a real multiply. PtrTrie skips this entirely: the pointer is the address.

At mid-range sizes where everything fits in cache, this per-iteration overhead dominates. At 64K, cache locality overwhelms it.


Search misses: when one instruction is the whole hot path

Benchmark2564K64K
Index Miss (sentinel)127ns2.0μs32μs
Deque Miss108ns1.7μs26μs
Ptr Miss77ns1.1μs17μs

Misses tell a different story: PtrTrie is fastest across all sizes. This makes sense — a miss on “zzzzzzz” in a trie of “a”-prefixed words fails at the root node, one array lookup and done. The benchmark repeats this n times on the same hot node. No traversal means no cache effects — just raw per-lookup overhead.

Looking at the disassembly2, PtrTrie gets cbz — a fused compare-and-branch-if-zero — where IndexArena’s sentinel needs cmn + b.eq (two instructions). My initial guess was that this explained the gap, but disassembling the isolated comparison3 showed both are actually single-instruction checks (cmp x8, #0 vs cmn x0, #1). And disassembling the actual child access (node->children[idx]) produces identical code for both — an add with a left-shift-by-3 and a ldr. So the per-check cost is the same; the miss benchmark difference comes from the overall loop structure, where PtrTrie avoids the madd multiply entirely.

But the cbz observation raised a question worth testing: what if the null check could use cbz?


Zero-as-null: fixing what the disassembly showed us

The sentinel check (cmn + b.eq) costs two instructions in the search loop where a zero comparison gets the fused cbz. Could we eliminate that overhead by using 0 as our null index?

I templated IndexArenaTrie with a bool UseSentinelValue parameter. When false, NULL_INDEX = 0 and the root node lives at index 1 (reserving index 0 as “null”). Children default-initialize to 0, and the null check becomes a comparison against zero — eligible for cbz.

Search hits and inserts showed no meaningful difference — the madd multiply and cache effects dominate. But the miss benchmark told a clear story:

SizeSentinelZero-nullSpeedup
256127ns70ns1.81x
512249ns138ns1.80x
4K1.97μs1.01μs1.95x
32K15.6μs8.0μs1.94x
64K31.6μs16.0μs1.98x

A consistent ~1.9x speedup across all sizes. That’s much larger than a single instruction should account for — until you consider that the miss benchmark’s hot path is literally: load one child index, compare to null, repeat. The null comparison is the bottleneck, so cbz vs cmn + b.eq is the difference between one and two instructions on the critical path.

In debug builds the effect vanishes (sentinel 1.77ms vs zero-null 1.80ms at 64K) — unoptimized function call overhead masks the instruction-level difference, same pattern we saw with search hits.

For any real workload where misses traverse more than one node, this difference would be negligible. But it’s a nice confirmation that the disassembly predictions hold up — and it set the stage for the last experiment.


Data-oriented layout: separating hot and cold fields

All the implementations so far store is_end_of_word inside the Node struct. But think about what happens during a search traversal: you visit ~6 nodes, and at each one you only look at the children array to find the next index. You never check is_end_of_word until the very last node. That bool — plus its 7 bytes of alignment padding — is dead weight in every cache line you pull in along the way.

This is the textbook case for data-oriented design: organize data by how it’s accessed, not by what it logically belongs to. I’d seen this idea in CppCon talks but hadn’t tried it myself. The change is simple: pull is_end_of_word out of Node and into a parallel vector<bool>, indexed the same way. Node shrinks from 216 bytes to 208 bytes.

Comparing against IndexArenaZeroNull (same null strategy, same traversal logic — the only difference is where the bool lives):

Search hits:

SizeIndexArenaDataOrientedSpeedup
256904ns984ns0.92x
4K24.9μs23.4μs1.07x
32K204μs190μs1.07x
64K582μs397μs1.47x

A 1.47x improvement at 64K from removing 8 bytes per node. The total memory savings is modest — 13.0MB vs 13.5MB for the children arrays — but every cache line pulled in during traversal is now 100% useful data. No padding, no cold bools.

The same size-dependent pattern holds: no difference at 256 (fits in cache regardless), growing advantage as the working set exceeds cache.

Insert:

SizeIndexArenaDataOrientedSpeedup
64K3.51ms3.02ms1.16x

Also faster — smaller nodes mean less data to copy on vector reallocation.

Search misses and prefix collection: Identical to IndexArenaZeroNull, as expected — misses don’t traverse far enough to hit cache effects, and prefix collection operates on a small subtree.

Debug flips the result: DataOriented is slower in debug — 6.14ms vs 5.63ms for search hits, 19.0ms vs 14.3ms for inserts at 64K. The extra vector<bool> bookkeeping (separate indexing, an additional push_back per insert) adds overhead that the optimizer eliminates in release. In debug, that abstraction cost outweighs the cache benefit.


The rest of the picture

Prefix collection is identical across all five implementations. getWordsWithPrefix("aaa") collects terminal nodes from a small subtree that fits in cache regardless of the overall trie’s memory layout.

Benchmark2564K64K
Index (sentinel)75ns325ns2.2μs
Deque74ns319ns2.1μs
Ptr74ns314ns2.1μs

Debug vs release reveals what the optimizer is and isn’t responsible for:


Conclusion

The old post’s conclusion was: “the optimizer is excellent at removing abstraction overhead but can’t change your data’s memory layout.” That’s still true — but the irony is that I didn’t have any data layout difference to measure. Both implementations scattered nodes on the heap, so of course search was identical.

Building a real arena and then iterating on it told a more interesting story:

The lesson I keep learning: measure before concluding. The old post had the right instinct about cache effects but the wrong evidence. It took building an actual arena — and then not stopping there — to see whether the theory held up.


Footnotes

Footnotes

  1. 26 × 8-byte children (208) + 1-byte bool + 7 bytes alignment padding = 216. All three Node types are the same size despite storing different child types (size_t, Node*, unique_ptr<Node>) — they’re all 8 bytes on this platform. Verified with:

    struct IndexNode {
        bool is_end_of_word = false;
        std::array<size_t, 26> children;
    };
    printf("IndexNode: %zu bytes\n", sizeof(IndexNode));  // 216
    printf("  is_end_of_word offset: %zu\n", offsetof(IndexNode, is_end_of_word));  // 0
    printf("  children offset: %zu\n", offsetof(IndexNode, children));  // 8
    2
  2. Compiled with c++ -std=c++17 -O2 -S -o search_loop.s search_loop.cpp on Apple Silicon, then extracted the inner loops with awk '/^__Z10index_find/,/\.cfi_endproc/' search_loop.s and awk '/^__Z8ptr_find/,/\.cfi_endproc/' search_loop.s:

    IndexArena:

    mov    w11, #216                  ; multiplier (hoisted before loop)
    ; loop body:
    ldrsb  x12, [x9]                 ; load char
    madd   x13, x0, x11, x8          ; node_addr = index * 216 + base
    add    x12, x13, x12, lsl #3     ; + char_offset * 8
    sub    x12, x12, #768            ; adjust for 'a' offset
    ldr    x0, [x12]                 ; load child index
    cmn    x0, #1                    ; compare to sentinel
    b.eq   exit                      ; branch if NULL_INDEX

    PtrTrie:

    ; loop body:
    ldrsb  x10, [x8]                 ; load char
    add    x10, x0, x10, lsl #3      ; node_ptr + char_offset * 8
    sub    x10, x10, #768            ; adjust for 'a' offset
    ldr    x0, [x10]                 ; load child pointer
    cbz    x0, exit                  ; compare-and-branch-if-zero

    Source used to generate the disassembly:

    __attribute__((noinline))
    size_t index_find(const std::vector<IndexNode>& nodes, const std::string& word) {
        size_t current = 0;
        for (char c : word) {
            size_t child = nodes[current].children[c - 'a'];
            if (child == NULL_INDEX) { return NULL_INDEX; }
            current = child;
        }
        return current;
    }
    
    __attribute__((noinline))
    const PtrNode* ptr_find(const PtrNode* root, const std::string& word) {
        const PtrNode* current = root;
        for (char c : word) {
            const PtrNode* child = current->children[c - 'a'];
            if (child == nullptr) { return nullptr; }
            current = child;
        }
        return current;
    }
    2
  3. Compiled with c++ -std=c++17 -O2 -S -o sentinel_check.s sentinel_check.cpp on Apple Silicon, then extracted with grep -A 8 'check_nullptr\|check_sentinel' sentinel_check.s:

    ; nullptr check              ; sentinel check
    cmp  x8, #0                  cmn  x0, #1
    cset w0, eq                  cset w0, eq

    cmn x0, #1 adds 1 and checks for zero — true exactly when x0 == 0xFFFF...FFFF. One instruction each, no constant load needed. Source:

    static constexpr size_t NULL_INDEX = std::numeric_limits<size_t>::max();
    
    __attribute__((noinline))
    bool check_nullptr(const std::unique_ptr<int>& ptr) {
        return ptr == nullptr;
    }
    
    __attribute__((noinline))
    bool check_sentinel(size_t index) {
        return index == NULL_INDEX;
    }

Share this post on:

Previous Post
Benchmarking Tries: Arena vs unique_ptr
Next Post
When BFS Isn't Enough: From SPFA to Dijkstra on Network Delay Time