Fundamentals of High-Performance Computing Hardware and Optimization
Modern HPC Hardware and Algorithmic Concepts
Pipeline Hazards
Pipeline hazards constrain instruction-level parallelism.
- Structural hazard: Occurs when multiple instructions compete for the same hardware unit (e.g., execution unit).
- Data hazard: Arises when an instruction must wait for operands produced by earlier instructions.
- Control hazard: Happens when the processor cannot determine the next instruction to fetch due to branches.
Branch Execution Cost
for (int idx = 0; idx < N; ++idx)
arr[idx] = rand() % 100;
volatile int total;
for (int idx = 0; idx < N; ++idx)
if (arr[idx] < 50)
total += arr[idx];
Sample assembly:
mov rbx, -4000000
jmp loop_body
skip_inc:
add rbx, 4
jz done
loop_body:
mov eax, dword ptr [rbx + arr + 4000000]
cmp eax, 49
jg skip_inc
add dword ptr [rsp + 12], eax
jmp skip_inc
Cost contributors:
- Control hazard every two cycles → ~19 CPU cycles per occurrence.
- Memory load (
mov eax, ...) → ~5 CPU cycles. - Store to volatile memory on taken branch → ~4 CPU cycles.
Varying the threshold:
for (int idx = 0; idx < N; ++idx)
if (arr[idx] < limit)
total += arr[idx];
At limit = 50, misprediction penalty is high (~14 cycles). At limit = 85–90, prediction errors cost less than execution time, producing a lower average cycle count.
Sorting the array improves predictor accuracy:
std::sort(arr, arr + N);
for (int idx = 0; idx < N; ++idx)
if (arr[idx] < 50)
total += arr[idx];
Predictors recognize more complex patterns beyond alternating branches. For small N (e.g., 1000 unsorted), the predictor memorizes the sequence, achieving ~4 cycles per iteration — slightly better than sorted case due to fewer state transitoins.
Branch replacement via predicate (cmov):
for (int idx = 0; idx < N; ++idx)
total += (arr[idx] < 50 ? arr[idx] : 0);
CPU Microarchitecture Details
Branch Prediction Unit (BPU)
In steady state, BPU steers instruction fetch. It holds a 12K-entry Branch Target Buffer (BTB) storing branch destinations and metadata. Each cycle it predicts direcsion and supplies the next fetch address.
Decoded Stream Buffer (DSB)
Also called μop cache. Stores pre-decoded macro-instructions to avoid repeated decode work. Parallel to L1 I-cache, checked upon each fetch address from BPU. Frequently used sequences hit here, supplying up to 8 μops/cycle from 4K entries.
Reorder Buffer (ROB)
Central to out-of-order execution with 512 entries. Functions:
- Register renaming via Physical Register File (PRF) and Register Alias Table (RAT).
- Resource allocation: assigns execution units and target registers, up to 6 μops/cycle.
- Tracks speculative execution; retirement occurs in order, freeing ROB slots, up to 8 μops/cycle.
Scheduler / Reservation Station (RS)
Monitors operand readiness for μops and dispatches them when resources free. Fewer entries than ROB, dispatch rate up to 6 μops/cycle.
Memory Access Optimizations
Layout Strategies
- For small arrays, sequential scan can beat binary search.
- Prefer Structure of Arrays (SOA) over Array of Structures (AOS):
// AOS
struct Point { int x, y, z; };
Point pts[N];
// SOA
struct Points { int x[N], y[N], z[N]; };
Eytzinger Layout for Binary Search
Improves cache behavior for search-intensive workloads.
Struct Size Tuning
- Bitfields reduce footprint:
struct Compact {
unsigned a : 4;
unsigned b : 2;
unsigned c : 2;
};
- Align to cache lines:
alignas(64) struct CacheAligned { /* members */ };
- Avoid false sharing:
struct ThreadData {
int valA; // modified by thread 1
alignas(64) int valB; // modified by thread 2
};
- Fit working set within cache dimensions.
Explicit Prefetching
Hide DRAM latency by fetching ahead:
size_t pos = rand_gen(generator);
for (int i = 0; i < N; ++i) {
int val = data[pos];
pos = rand_gen(generator);
__builtin_prefetch(&data[pos]);
heavyCompute(val);
}
Hardware prefetchers detect regular access patterns; OOO execution issues early loads.
Large Pages (HugePages)
Linux options:
- Explicit Huge Pages (EHP): Pre-reserved, static, minimal fragmentation, no swap.
- Global Transparent Huge Pages (THP): Auto-promoted, easy for testing, may affect other processes.
- Per-process THP: Uses
madvise; avoids global impact but incurs background compaction overhead.
Low-latency domains favor EHP; databases and VMs often use per-process THP.
Computation Optimizations
Critical paths prioritize latency; others maximize throughput.
Breaking Data Dependencies
Parallelize independent chains:
Before:
struct Body { float px, py, speed; };
class RandGen {
uint32_t state;
public:
RandGen(uint32_t s) : state(s) {}
uint32_t next() {
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
return state;
}
};
void moveBodies(std::vector<Body>& items, uint32_t s) {
RandGen gen(s);
for (int step = 0; step < STEPS; ++step)
for (auto& p : items) {
uint32_t ang = gen.next();
float rad = ang * RAD_SCALE;
p.px += cosApprox(rad) * p.speed;
p.py += sinApprox(rad) * p.speed;
}
}
After (dual streams):
void moveBodies(std::vector<Body>& items, uint32_t s1, uint32_t s2) {
RandGen g1(s1), g2(s2);
for (int step = 0; step < STEPS; ++step)
for (size_t i = 0; i + 1 < items.size(); i += 2) {
uint32_t a1 = g1.next(), a2 = g2.next();
float r1 = a1 * RAD_SCALE, r2 = a2 * RAD_SCALE;
items[i].px += cosApprox(r1) * items[i].speed;
items[i].py += sinApprox(r1) * items[i].speed;
items[i+1].px += cosApprox(r2) * items[i+1].speed;
items[i+1].py += sinApprox(r2) * items[i+1].speed;
}
}
Inlining Guidance
Use [[gnu::always_inline]] (GCC/Clang) or __forceinline (MSVC) to hint aggressive inlining.
Loop Optimizations
- Envariant Code Motion:
for (int i = 0; i < N; ++i) {
auto coeff = c[i];
for (int j = 0; j < N; ++j)
a[j] = b[j] * coeff;
}
- Unrolling (avoid introducing dependencies):
for (int i = 0; i + 1 < N; i += 2) {
a[i] = b[i] * c[i];
a[i+1] = b[i+1] * c[i+1];
}
Modern CPUs internally unroll via OOO execution; manual unrolling rarely needed.
- Strength Reduction: Replace expensive index math:
int j = 0;
for (int i = 0; i < N; ++i) {
a[i] = b[j] * c[i];
j += 10;
}
- Loop Peeling: Handle small fixed iterations before general loop.
Advanced Loop Transformations
- Interchange: Swap nested loops to improve spatial locality.
- Blocking/Tiling: Process data in cache-sized chunks.
Example tiling:
for (int ii = 0; ii < N; ii += 8)
for (int jj = 0; jj < N; jj += 8)
for (int i = ii; i < ii + 8; ++i)
for (int j = jj; j < jj + 8; ++j)
a[i][j] += b[j][i];
- Fusion & Unroll-Fuse: Combine loops and unroll to increase ILP.
Vectorization Constraints
Vectorization fails with:
- Read-after-write dependency:
void dep(int *A, int n) {
for (int i = 1; i < n; ++i)
A[i] = A[i-1] * 2;
}
- Non-associative floating-point reductions (unless
-ffast-math):
float sumArr(float *a, unsigned n) {
float acc = 0.0f;
for (unsigned i = 0; i < n; ++i)
acc += a[i];
return acc;
}
Branch Prediction Enhancements
- Lookup tables instead of conditional branches.
- Arithmetic masking instead of branches.
- Predication (
cmov) to eliminate jumps.
Machine Code Layout Tuning
- Use
likely/unlikelymacros to bias layout for exception paths. - Align basic blocks to cache line boundaries.
- Split hot/cold code sections.
- Reduce ITLB misses via compact code layout.
Low-Latency Techniques
Prevent Page Faults
Pre-touch pages:
char *buf = malloc(sz);
int pg = sysconf(_SC_PAGESIZE);
for (int i = 0; i < sz; i += pg)
buf[i] = 0;
Lock into RAM:
mallopt(M_MMAP_MAX, 0);
mallopt(M_TRIM_THRESHOLD, -1);
mallopt(M_ARENA_MAX, 1);
mlockall(MCL_CURRENT | MCL_FUTURE);
Avoid TLB Shootdowns
TLB coherence relies on kernel IPIs (e.g., INVLPG). Shared virtual address updates via munmap, mprotect, madvise trigger shootdowns. Monitor via /proc/interrupts.
Floating-Point Performance Pitfalls
Be aware of slow transcendental functions and denormal handling.
Last-Level Cache Sensitivity
Working set fitting in LLC reduces DRAM traffic; profile to identify sensitivity.
Quantitative Cache and RAM Behavior
Strided Access Effects
- L2 bandwidth drops ~15% due to bus contention.
- RAM throughput halved under strided patterns.
- Write mixes induce extra reads (write-allocate), reducing effective bandwidth.
Cache Bypass
Direct-to-RAM paths bypass cache for non-reused data, improving throughput.
NUMA Impact
On multi-socket systems, local memory access avoids inter-socket latency; 4–5 cores benefit from exclusive L3 per socket.
Cache Associativity Effects
Power-of-two strides map to few cache sets → conflict misses.
Mapping schemes:
- Fully associative: block can occupy any cache line.
- Direct mapped: block maps to
block_number % cache_lines. - Set associative: block maps to
block_number % number_of_sets; free placement within set.
Indexing example: 6-bit offset, 12-bit index, 46-bit tag.
Huge Pages Benefit
Reduce TLB pressure and page walk overhead for large contiguous allocations.