
Mistake #6: The Cache (catch)
Mistake #6: The Cache (catch) 관련
You’ve heard “cache matters,” but here’s the twist: your loops are lying to your CPU. The way you traverse multi-dimensional arrays can turn a 10x speed difference into a mystery that leaves you questioning reality.
Row-Major vs. Column-Major Access
“Iterating over a 2D array is the same whether I go row-by-row or column-by-column. Right?”
Memory is laid out linearly, but CPUs prefetch data in chunks (cache lines). Traversing against the grain forces the CPU to fetch new cache lines every single step.
Example in C
// A "tiny" 1024x1024 matrix
int matrix[1024][1024];
// Fast: Row-major traversal (cache-friendly)
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
matrix[i][j] = i + j;
}
}
// Slow: Column-major traversal (cache-hostile)
for (int j = 0; j < 1024; j++) {
for (int i = 0; i < 1024; i++) {
matrix[i][j] = i + j;
}
}
The result:
- Row-major: ~5ms (data flows like a river).
- Column-major: ~50ms (CPU drowns in cache misses).
Why it’s worse than you think:
In C/C++, arrays are row-major. But in Fortran, MATLAB, or Julia, they’re column-major. Use the wrong traversal order in these languages, and you’ll get the same penalty.
The Plot Twist: Your Programming Language is Gaslighting You
In C and Python (NumPy default), arrays use row-major order. But in Fortran, MATLAB, and Julia, arrays are column-major. If you assume the wrong layout, your loops will be slow for no apparent reason.
Python Example
import numpy as np
# Row-major (C-style) → Fast for row-wise loops
row_major = np.zeros((1024, 1024), order='C')
# Column-major (Fortran-style) → Fast for column-wise loops
col_major = np.zeros((1024, 1024), order='F')
# ❌ Slow: Column-wise access on a row-major array
for i in range(1024):
for j in range(1024):
col_major[i][j] = i + j # Cache-miss chaos!
Why this is a problem:
- Row-major (default in NumPy) expects row-wise access, but the loop accesses it column-wise, causing cache misses.
- Fortran-style arrays are stored column-first, so row-wise loops will be slow instead.
The Fix:
- Match the array order to your access pattern using
order='C'
(row-major) ororder='F'
(column-major). - Convert data layout with
np.asarray()
if needed.
The Multidimensional Illusion: 3D+ Arrays
“3D arrays are just 2D arrays with extra steps. No big deal.”
Each dimension adds a layer of indirection. A 3D array in C is an array of arrays of arrays. Traversing the “wrong” dimension forces the CPU to dereference pointers repeatedly, killing locality.
Example: 3D Array in Traversal in C
// ✅ Fast: Iterate in Row-Major Order (Innermost Dimension Last)
int space[256][256][256];
for (int x = 0; x < 256; x++) {
for (int y = 0; y < 256; y++) {
for (int z = 0; z < 256; z++) {
space[x][y][z] = x + y + z; // Smooth memory access
}
}
}
So what happens is that the innermost loop moves through contiguous memory, making full use of cache lines.
// ❌ Slow: Iterate in the Wrong Order (Innermost Dimension First)
for (int z = 0; z < 256; z++) {
for (int y = 0; y < 256; y++) {
for (int x = 0; x < 256; x++) {
space[x][y][z] = x + y + z; // Constant cache misses
}
}
}
Why this is bad:
- This loop jumps across memory every time
x
changes. - Instead of accessing contiguous memory, it dereferences pointers constantly.
- Penalty: Up to 100x slower for large 3D arrays!
The Nuclear Option: Cache-Aware Algorithms
For extreme performance (game engines, HPC), you need to design for cache lines:
Split arrays into small blocks that fit in L1/L2 cache.
# Process 8x8 tiles to exploit 64-byte cache lines
for (int i = 0; i < 1024; i += 8) {
for (int j = 0; j < 1024; j += 8) {
// Process tile[i:i+8][j:j+8]
}
}
Prefer Structure of Arrays (SoA) over Array of Structures for SIMD.
# Slow: Array of Structures (AoS)
struct Particle { float x, y, z; };
Particle particles[1000000];
# Fast: Structure of Arrays (SoA)
struct Particles {
float x[1000000];
float y[1000000];
float z[1000000];
};