Memory Regions - Cache
#memory | #lowlevel | #cacheCache, in simple terms, is a small buffer of data that sits between different parts of a computer system to speed things up—like memorizing your lines right before a scene. It's temporary but crucial for the presentation with performance.
This concept appears everywhere: memory caches, system caches, browser caches, application caches—you name it. They exist across multiple layers, often unnoticed, quietly making things faster.
In CPU architectures, cache memory refers to a tiny, high-speed memory located close to the CPU. It holds frequently accessed data and instructions, reducing the need to fetch them from "slower" RAM and helping the system perform more efficiently.
Ever noticed how a program opens faster the second time? That's caching at work. But it's not that the entire program is stored in CPU cache. Instead, parts of it might be cached in CPU or disk cache—larger and slower than CPU cache, but still much faster than reading from disk.
Meanwhile, the CPU cache is constantly updated with the hottest data your processor needs right now, accelerating repeated operations and making things feel snappy.
Key characteristics
- Located near the CPU and organized into levels (L1, L2, L3, L4), with L1 being the fastest and smallest;
- Acts as a buffer between RAM and the CPU, minimizing latency;
- Stores frequently accessed data and instructions to reduce redundant memory fetches;
Limitations
- Very fast but limited in size, typically kilobytes or a few megabytes;
- More expensive than RAM per unit of storage;
- One of the biggest problem in software engineering is cache invalidation;
Example 1
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h> // For rdtscp, clflush
#define CACHE_LINE_SIZE 64
int main() {
int size = 1024;
char *buffer = aligned_alloc(CACHE_LINE_SIZE, size);
buffer[0] = 1; // Ensure it's paged in
// Flush from cache to ensure cache miss
_mm_clflush(&buffer[0]);
// Measure time of first access (cache miss)
uint64_t start = __rdtscp(&(unsigned int){0});
volatile char dummy = buffer[0];
uint64_t end = __rdtscp(&(unsigned int){0});
printf("Cache miss access time: %lu cycles\n", end - start);
// Measure time of second access (cache hit)
start = __rdtscp(&(unsigned int){0});
dummy = buffer[0];
end = __rdtscp(&(unsigned int){0});
printf("Cache hit access time: %lu cycles\n", end - start);
free(buffer);
return 0;
}
Raw Output (click to expand)
Cache miss access time: 765 cycles Cache hit access time: 45 cycles
The exact cycle counts vary by CPU, but hits will be significantly faster than misses. This example demonstrates how cache works by measuring the time it takes to access data that is either in the cache (hit) or not (miss). The first access incurs a cache miss, taking more cycles, while the second access is a cache hit, resulting in much fewer cycles.
Key concept
Reading or writing a memory address often causes the CPU to fetch that address into the cache (if caching is enabled).
How to prove
Why this proves it?
- The first access after
_mm_clflush()
is a cache miss (data is not in L1/L2); - The second access is much faster, which strongly indicates a cache hit;
- This proves that simply reading a value causes it to be loaded into cache;
Warn
The volatile int dummy = *(volatile int*)ptr;
causes it to be loaded into cache unless caching is disabled or bypassed via hardware features.
Notes
- Requires an x86 CPU and compiler supporting
rdtscp
,_mm_clflush
; volatile
is important to prevent compiler optimization;- This isn't 100% foolproof at the hardware level, but it's a good approximation;
Example 2
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <x86intrin.h> // For rdtsc()
#define KB 1024
#define MB 1024 * KB
#define MAX_SIZE_MB 8 // Test up to 8MB (exceeds L3)
#define STEP_KB 4 // Step size (increase array size gradually)
#define STRIDE 64 // Typical cache line size
uint8_t *array; // Dynamically allocated array
// Read Time-Stamp Counter
static inline uint64_t rdtsc() {
return __rdtsc();
}
// Measure access time for a given array size
void test_cache_size(size_t array_size) {
uint64_t start, end;
volatile uint8_t temp;
start = rdtsc();
for (size_t i = 0; i < array_size; i += STRIDE) {
temp = array[i]; // Load data (forces cache load)
}
end = rdtsc();
printf("Size: %6lu KB | Access Time: %10lu cycles\n", array_size / KB, (end - start));
}
int main() {
printf("L1 Cache Detection Test\n");
// Allocate large memory block
array = (uint8_t *)malloc(MAX_SIZE_MB * MB);
if (!array) {
printf("Memory allocation failed!\n");
return 1;
}
// Touch all memory to ensure it's allocated
for (size_t i = 0; i < MAX_SIZE_MB * MB; i++) {
array[i] = 0;
}
printf("\nTesting cache access times:\n");
// Measure access time for different sizes
for (size_t size = STEP_KB * KB; size <= MAX_SIZE_MB * MB; size *= 2) {
test_cache_size(size);
}
free(array);
return 0;
}
Raw Output (click to expand)
L1 Cache Detection Test Testing cache access times: Size: 4 KB | Access Time: 1086 cycles Size: 8 KB | Access Time: 1354 cycles Size: 16 KB | Access Time: 3091 cycles Size: 32 KB | Access Time: 4386 cycles Size: 64 KB | Access Time: 9324 cycles Size: 128 KB | Access Time: 22141 cycles Size: 256 KB | Access Time: 48771 cycles Size: 512 KB | Access Time: 91847 cycles Size: 1024 KB | Access Time: 175971 cycles Size: 2048 KB | Access Time: 433171 cycles Size: 4096 KB | Access Time: 741997 cycles Size: 8192 KB | Access Time: 1511005 cycles
Cache levels
Organized by access time (cycles) and size (KB), this table shows how data is accessed in different cache regions. The access time increases as the size of the data grows, indicating that larger data requires more cycles to access, especially when it exceeds the cache size.
This is a practical demonstration of how cache memory works in a CPU, showing the transition from L1 to L2 and L3 caches, and finally to main memory (RAM) as the data size increases.
The access times are measured in cycles, which is a common way to evaluate the performance of cache memory. The smaller the number of cycles, the faster the access time, indicating that the data is likely stored in a faster cache region.
Access Time (cycles) | Size (KB)1 | Cache Region / Level2 |
---|---|---|
1086 | 4 KB | L1 Cache |
1354 | 8 KB | L1 Cache |
3091 | 16 KB | L1 Cache |
4386 | 32 KB | L1 / L2 Transition |
9324 | 64 KB | L2 Cache |
22141 | 128 KB | L2 Cache |
48771 | 256 KB | L2 Cache |
91847 | 512 KB | L2 / L3 Transition |
175971 | 1024 KB (1MB) | L3 Cache |
433171 | 2048 KB (2MB) | L3 Cache |
741997 | 4096 KB (4MB) | L3 Cache |
1511005 | 8192 KB (8MB) | Main Memory (RAM) |
1Keep in mind different CPUs may have cache size variations.
2Some architectures might even have the L4 cache.
Explanation of each level
- L1 Cache (4KB - 16KB)
- - Super low cycles (~951-1917).
- - Data fits perfectly into L1 cache → Fastest access.
- L1 → L2 Transition (32KB)
- - Jump in cycles (4859), indicating L1 is full.
- - Some accesses go to L2, increasing latency.
- L2 Cache (64KB - 512KB)
- - Noticeable slowdown (~6K-50K cycles).
- - Data still fits in L2, but it's slower than L1.
- L2 → L3 Transition (512KB - 1MB)
- - Big jump in latency (~50K → 132K cycles).
- - This means L2 is now full and some data is moving to L3.
- L3 Cache (1MB - 4MB)
- - Latency increases gradually (132K → 480K cycles).
- - L3 is slower but still much faster than RAM.
- RAM Access (8MB+)
- - Massive jump (1.2M cycles) → We're now in main memory (DRAM).
- - RAM is orders of magnitude slower than cache. But still faster than the disc.
A fun analogy
Imagine you're studying for an exam. The Desk, Bookshelf, and Library:
- L1 Cache = Your Desk: The notes and books you keep right on your desk are instantly accessible. You grab them without moving—this is your L1 cache: tiny, super-fast, but limited space.
- L2 Cache = Bookshelf Nearby: If your desk is full, you reach to a bookshelf a few steps away. It holds more books, but takes a bit longer to access—this is your L2 cache.
- Main Memory (RAM) = The Library: If you need something not on your desk or shelf, you walk all the way to the library. It has everything, but it's much slower to get there—this is your RAM.
When you organize your study session so that you mostly use what's on your desk, you work much faster. If you constantly need to walk to the library, you waste a lot of time. CPUs work the same way: the closer the data (L1/L2), the faster the program runs.
Now, imagine you update your notes, but forget to update the copy on your desk or bookshelf. Later, you grab the old version by mistake and study the wrong material. This is similar to cache invalidation in web development: if a website's cache isn't properly cleared when content changes, users might see outdated information. Keeping all copies in sync is tricky and a common source of inconsistency.
Final thoughts
A solid grasp of cache memory is essential for writing high-performance software. By understanding how CPUs access and store data, developers can design algorithms and data structures that take full advantage of the cache hierarchy, reducing latency and improving throughput.
Optimizing for cache is especially valuable in demanding fields like gaming, scientific computing, and real-time applications, where every cycle counts. Thoughtful memory access patterns and cache-aware programming can lead to substantial gains in speed and efficiency, directly impacting user experience and system scalability.