Home > Tech > Content

Fundamentals of High-Performance Computing Hardware and Optimization

Tech Apr 23 18

Modern HPC Hardware and Algorithmic Concepts

Pipeline Hazards

Pipeline hazards constrain instruction-level parallelism.

Structural hazard: Occurs when multiple instructions compete for the same hardware unit (e.g., execution unit).
Data hazard: Arises when an instruction must wait for operands produced by earlier instructions.
Control hazard: Happens when the processor cannot determine the next instruction to fetch due to branches.

Branch Execution Cost

for (int idx = 0; idx < N; ++idx)
    arr[idx] = rand() % 100;
volatile int total;

for (int idx = 0; idx < N; ++idx)
    if (arr[idx] < 50)
        total += arr[idx];

Sample assembly:

    mov  rbx, -4000000
    jmp  loop_body
skip_inc:
    add  rbx, 4
    jz   done
loop_body:
    mov  eax, dword ptr [rbx + arr + 4000000]
    cmp  eax, 49
    jg   skip_inc
    add  dword ptr [rsp + 12], eax
    jmp  skip_inc

Cost contributors:

Control hazard every two cycles → ~19 CPU cycles per occurrence.
Memory load (mov eax, ...) → ~5 CPU cycles.
Store to volatile memory on taken branch → ~4 CPU cycles.

Varying the threshold:

for (int idx = 0; idx < N; ++idx)
    if (arr[idx] < limit)
        total += arr[idx];

At limit = 50, misprediction penalty is high (~14 cycles). At limit = 85–90, prediction errors cost less than execution time, producing a lower average cycle count.

Sorting the array improves predictor accuracy:

std::sort(arr, arr + N);
for (int idx = 0; idx < N; ++idx)
    if (arr[idx] < 50)
        total += arr[idx];

Predictors recognize more complex patterns beyond alternating branches. For small N (e.g., 1000 unsorted), the predictor memorizes the sequence, achieving ~4 cycles per iteration — slightly better than sorted case due to fewer state transitoins.

Branch replacement via predicate (cmov):

for (int idx = 0; idx < N; ++idx)
    total += (arr[idx] < 50 ? arr[idx] : 0);

CPU Microarchitecture Details

Branch Prediction Unit (BPU)

In steady state, BPU steers instruction fetch. It holds a 12K-entry Branch Target Buffer (BTB) storing branch destinations and metadata. Each cycle it predicts direcsion and supplies the next fetch address.

Decoded Stream Buffer (DSB)

Also called μop cache. Stores pre-decoded macro-instructions to avoid repeated decode work. Parallel to L1 I-cache, checked upon each fetch address from BPU. Frequently used sequences hit here, supplying up to 8 μops/cycle from 4K entries.

Reorder Buffer (ROB)

Central to out-of-order execution with 512 entries. Functions:

Register renaming via Physical Register File (PRF) and Register Alias Table (RAT).
Resource allocation: assigns execution units and target registers, up to 6 μops/cycle.
Tracks speculative execution; retirement occurs in order, freeing ROB slots, up to 8 μops/cycle.

Scheduler / Reservation Station (RS)

Monitors operand readiness for μops and dispatches them when resources free. Fewer entries than ROB, dispatch rate up to 6 μops/cycle.

Memory Access Optimizations

Layout Strategies

For small arrays, sequential scan can beat binary search.
Prefer Structure of Arrays (SOA) over Array of Structures (AOS):

// AOS
struct Point { int x, y, z; };
Point pts[N];

// SOA
struct Points { int x[N], y[N], z[N]; };

Eytzinger Layout for Binary Search

Improves cache behavior for search-intensive workloads.

Struct Size Tuning

Bitfields reduce footprint:

struct Compact {
    unsigned a : 4;
    unsigned b : 2;
    unsigned c : 2;
};

Align to cache lines:

alignas(64) struct CacheAligned { /* members */ };

Avoid false sharing:

struct ThreadData {
    int valA; // modified by thread 1
    alignas(64) int valB; // modified by thread 2
};

Fit working set within cache dimensions.

Explicit Prefetching

Hide DRAM latency by fetching ahead:

size_t pos = rand_gen(generator);
for (int i = 0; i < N; ++i) {
    int val = data[pos];
    pos = rand_gen(generator);
    __builtin_prefetch(&data[pos]);
    heavyCompute(val);
}

Hardware prefetchers detect regular access patterns; OOO execution issues early loads.

Large Pages (HugePages)

Linux options:

Explicit Huge Pages (EHP): Pre-reserved, static, minimal fragmentation, no swap.
Global Transparent Huge Pages (THP): Auto-promoted, easy for testing, may affect other processes.
Per-process THP: Uses madvise; avoids global impact but incurs background compaction overhead.

Low-latency domains favor EHP; databases and VMs often use per-process THP.

Computation Optimizations

Critical paths prioritize latency; others maximize throughput.

Breaking Data Dependencies

Parallelize independent chains:

Before:

struct Body { float px, py, speed; };
class RandGen {
    uint32_t state;
public:
    RandGen(uint32_t s) : state(s) {}
    uint32_t next() {
        state ^= state << 13;
        state ^= state >> 17;
        state ^= state << 5;
        return state;
    }
};
void moveBodies(std::vector<Body>& items, uint32_t s) {
    RandGen gen(s);
    for (int step = 0; step < STEPS; ++step)
        for (auto& p : items) {
            uint32_t ang = gen.next();
            float rad = ang * RAD_SCALE;
            p.px += cosApprox(rad) * p.speed;
            p.py += sinApprox(rad) * p.speed;
        }
}

After (dual streams):

void moveBodies(std::vector<Body>& items, uint32_t s1, uint32_t s2) {
    RandGen g1(s1), g2(s2);
    for (int step = 0; step < STEPS; ++step)
        for (size_t i = 0; i + 1 < items.size(); i += 2) {
            uint32_t a1 = g1.next(), a2 = g2.next();
            float r1 = a1 * RAD_SCALE, r2 = a2 * RAD_SCALE;
            items[i].px   += cosApprox(r1) * items[i].speed;
            items[i].py   += sinApprox(r1) * items[i].speed;
            items[i+1].px += cosApprox(r2) * items[i+1].speed;
            items[i+1].py += sinApprox(r2) * items[i+1].speed;
        }
}

Inlining Guidance

Use [[gnu::always_inline]] (GCC/Clang) or __forceinline (MSVC) to hint aggressive inlining.

Loop Optimizations

Envariant Code Motion:

for (int i = 0; i < N; ++i) {
    auto coeff = c[i];
    for (int j = 0; j < N; ++j)
        a[j] = b[j] * coeff;
}

Unrolling (avoid introducing dependencies):

for (int i = 0; i + 1 < N; i += 2) {
    a[i]   = b[i]   * c[i];
    a[i+1] = b[i+1] * c[i+1];
}

Modern CPUs internally unroll via OOO execution; manual unrolling rarely needed.

Strength Reduction: Replace expensive index math:

int j = 0;
for (int i = 0; i < N; ++i) {
    a[i] = b[j] * c[i];
    j += 10;
}

Loop Peeling: Handle small fixed iterations before general loop.

Advanced Loop Transformations

Interchange: Swap nested loops to improve spatial locality.
Blocking/Tiling: Process data in cache-sized chunks.

Example tiling:

for (int ii = 0; ii < N; ii += 8)
    for (int jj = 0; jj < N; jj += 8)
        for (int i = ii; i < ii + 8; ++i)
            for (int j = jj; j < jj + 8; ++j)
                a[i][j] += b[j][i];

Fusion & Unroll-Fuse: Combine loops and unroll to increase ILP.

Vectorization Constraints

Vectorization fails with:

Read-after-write dependency:

void dep(int *A, int n) {
    for (int i = 1; i < n; ++i)
        A[i] = A[i-1] * 2;
}

Non-associative floating-point reductions (unless -ffast-math):

float sumArr(float *a, unsigned n) {
    float acc = 0.0f;
    for (unsigned i = 0; i < n; ++i)
        acc += a[i];
    return acc;
}

Branch Prediction Enhancements

Lookup tables instead of conditional branches.
Arithmetic masking instead of branches.
Predication (cmov) to eliminate jumps.

Machine Code Layout Tuning

Use likely/unlikely macros to bias layout for exception paths.
Align basic blocks to cache line boundaries.
Split hot/cold code sections.
Reduce ITLB misses via compact code layout.

Low-Latency Techniques

Prevent Page Faults

Pre-touch pages:

char *buf = malloc(sz);
int pg = sysconf(_SC_PAGESIZE);
for (int i = 0; i < sz; i += pg)
    buf[i] = 0;

Lock into RAM:

mallopt(M_MMAP_MAX, 0);
mallopt(M_TRIM_THRESHOLD, -1);
mallopt(M_ARENA_MAX, 1);
mlockall(MCL_CURRENT | MCL_FUTURE);

Avoid TLB Shootdowns

TLB coherence relies on kernel IPIs (e.g., INVLPG). Shared virtual address updates via munmap, mprotect, madvise trigger shootdowns. Monitor via /proc/interrupts.

Floating-Point Performance Pitfalls

Be aware of slow transcendental functions and denormal handling.

Last-Level Cache Sensitivity

Working set fitting in LLC reduces DRAM traffic; profile to identify sensitivity.

Quantitative Cache and RAM Behavior

Strided Access Effects

L2 bandwidth drops ~15% due to bus contention.
RAM throughput halved under strided patterns.
Write mixes induce extra reads (write-allocate), reducing effective bandwidth.

Cache Bypass

Direct-to-RAM paths bypass cache for non-reused data, improving throughput.

NUMA Impact

On multi-socket systems, local memory access avoids inter-socket latency; 4–5 cores benefit from exclusive L3 per socket.

Cache Associativity Effects

Power-of-two strides map to few cache sets → conflict misses.

Mapping schemes:

Fully associative: block can occupy any cache line.
Direct mapped: block maps to block_number % cache_lines.
Set associative: block maps to block_number % number_of_sets; free placement within set.

Indexing example: 6-bit offset, 12-bit index, 46-bit tag.

Huge Pages Benefit

Reduce TLB pressure and page walk overhead for large contiguous allocations.

Back to List

Prev: Identifying Synonyms and Antonyms in Natural Language Processing with Python

Next: ROS Launch File Configuration Reference

Fading Coder