Cache and Fragmentation

Heap Common Core

Memory Hierarchy

Modern computer systems use a hierarchy of memory types with vastly different speeds and sizes. Understanding this hierarchy is crucial to writing performant code and understanding why memory fragmentation impacts performance.

CPU Cache Levels

Between the CPU and main RAM, there are typically three levels of cache memory:

L1 Cache: The smallest (32-64 KB per core) and fastest cache, usually accessed in 3-4 CPU cycles. Split into separate instruction and data caches.
L2 Cache: Medium size (256 KB – 1 MB per core), accessed in approximately 10-20 cycles. Usually unified (both instructions and data).
L3 Cache: Largest (2-32 MB shared across cores), accessed in 40-75 cycles. Shared among all CPU cores.

For comparison, main RAM access typically takes 200-300 cycles, making it 50-100 times slower than L1 cache access.

Cache Lines

Cache memory is organized into fixed-size units called cache lines. On modern x86 and x64 processors, cache lines are typically 64 bytes in size.

When the CPU reads a single byte from memory, it actually loads the entire 64-byte cache line containing that byte. This means accessing adjacent memory addresses is much faster than accessing scattered addresses, as adjacent accesses likely hit the same cache line.

Spatial Locality

Spatial locality refers to the tendency of programs to access memory addresses that are close to each other. CPUs take advantage of this by prefetching adjacent cache lines and by loading entire cache lines on every memory access.

Good spatial locality means:

Data structures are tightly packed with no gaps
Related data items are stored adjacently
Algorithms access memory sequentially when possible

Temporal Locality

Temporal locality refers to the tendency to access the same memory location multiple times within a short period. Caches exploit this by keeping recently accessed data available for quick re-access.

What is Memory Fragmentation?

Memory fragmentation occurs when free memory is divided into many small, non-contiguous blocks. There are two types of fragmentation:

External Fragmentation: Free memory exists but is scattered in small chunks, making it impossible to satisfy large allocation requests even though the total free memory is sufficient.
Internal Fragmentation: Allocated memory blocks are larger than requested, wasting space within each allocation. For example, allocating 17 bytes but receiving a 32-byte block wastes 15 bytes.

How Fragmentation Affects Cache Performance

Memory fragmentation has several negative impacts on cache performance:

Poor Spatial Locality: Related objects may be scattered across different memory regions, preventing efficient cache line usage. Instead of loading multiple related objects in one cache line, each object requires a separate cache miss.
Cache Pollution: When objects are spread across many addresses, more cache lines are used to store the same amount of useful data. This reduces the effective cache size available for the working set.
Increased Cache Misses: Scattered allocations lead to more cache misses as the CPU must fetch data from slower memory more frequently.
TLB Thrashing: The Translation Lookaside Buffer (TLB), which caches virtual-to-physical address translations, becomes less effective when memory spans many non-contiguous pages.

Performance Impact Example

Consider allocating 1000 small objects:

Contiguous allocation: All objects in a tight array span roughly 16 KB of memory (assuming 16 bytes per object). This uses approximately 250 cache lines (64 bytes each). Sequential access benefits from cache prefetching.
Fragmented allocation: Objects scattered across 1 MB of address space use many more cache lines and pages. Random access patterns cause frequent cache misses, potentially making operations 10-100x slower.

Real-World Timing Comparison

Approximate latencies for different memory access patterns:

L1 cache hit: 1 ns (nanosecond)
L2 cache hit: 3 ns
L3 cache hit: 10-20 ns
RAM access: 100 ns
Page fault (disk): 1-10 ms (milliseconds) – over 1 million times slower!

How Heaps Address Fragmentation

Well-designed heap managers implement strategies to minimize fragmentation:

Coalescing: Merging adjacent free blocks into larger blocks to reduce external fragmentation.
Size Classes: Grouping allocations by size to reduce internal fragmentation and improve locality.
Segregated Free Lists: Maintaining separate free lists for different size ranges leads to faster allocation and better packing.
Low Fragmentation Heap (LFH): Windows’ LFH subsystem specifically targets small allocations with tight packing to maximize cache efficiency.

Best Practices for Application Developers

To minimize fragmentation and maximize cache performance:

Allocate objects of similar lifetimes together
Use memory pools for frequently allocated objects of the same size
Prefer arrays of objects over arrays of pointers to objects
Free memory in a timely manner to allow reuse
Consider structure padding and alignment to avoid false sharing
Access data structures sequentially when possible