Home > Tech > Content

CUDA Thread Hierarchy: Organizing Grids, Blocks, and Threads for Parallel Execution

Tech 1

When a kernel is invoked from the host, the CUDA runtime instantiates a collection of threads on the device to execute the kernel code in parallel. These threads are arranged in a hierarchical structure that facilitates both scalability and cooperation: the grid, the thread block, and the individual thread.

A grid encompasses all threads generated by a single kernel launch. Every thread within a given grid has access to the same global memory space. The grid is partitioned into thread blocks, which represent collaborative units of execution. Threads within the same block can synchronize their execution via __syncthreads() and exchange data through high-speed shared memory. However, no synchronization mehcanism exists between threads residing in different blocks.

To distinguish themselves within this hierarchy, threads utilize four built-in variables available exclusively in device code:

blockIdx: A uint3 structure identifying the block's coordinates within the grid (components: x, y, z)
threadIdx: A uint3 structure identifying the thread's coordinates within its block (components: x, y, z)
blockDim: A dim3 structure specifying the dimensions of the block (threads per block in each dimension)
gridDim: A dim3 structure specifying the dimensions of the grid (blocks per grid in each dimension)

Both uint3 and dim3 are built-in structures containing three unsigned integers. The uint3 structure is defined as:

struct __device_builtin__ uint3 {
    unsigned int x, y, z;
};

The dim3 structure extends this with constructors that default unspecified dimensions to 1:

struct __device_builtin__ dim3 {
    unsigned int x, y, z;
    __host__ __device__ dim3(unsigned int vx = 1, 
                             unsigned int vy = 1, 
                             unsigned int vz = 1) 
        : x(vx), y(vy), z(vz) {}
};

CUDA supports up to three-dimensional indexing for both grids and blocks, enabling intuitive mapping to multidimensional datasets such as images or volumetric data. When launching kernels with fewer than three dimensions, any unspecified coordinates automatically initialize to 1.

The following program demonstrates a two-dimensional configuration where threads calculate their global coordinates:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void analyzeHierarchy() {
    // Extract thread coordinates within block
    unsigned int tx = threadIdx.x;
    unsigned int ty = threadIdx.y;
    
    // Extract block coordinates within grid
    unsigned int bx = blockIdx.x;
    unsigned int by = blockIdx.y;
    
    // Extract dimension information
    unsigned int bdx = blockDim.x;
    unsigned int bdy = blockDim.y;
    unsigned int gdx = gridDim.x;
    unsigned int gdy = gridDim.y;
    
    printf("Thread(%u,%u) in Block(%u,%u) | BlockSize: %ux%u | GridSize: %ux%u\n",
           tx, ty, bx, by, bdx, bdy, gdx, gdy);
           
    // Compute global 2D coordinates
    unsigned int globalCol = bx * bdx + tx;
    unsigned int globalRow = by * bdy + ty;
    printf("Global Matrix Position: (row=%u, col=%u)\n\n", globalRow, globalCol);
}

int main() {
    // Configure a 2D grid to process an 8x6 matrix
    dim3 threadsPerBlock(4, 3);
    dim3 numBlocks(2, 2);
    
    printf("Launch Configuration:\n");
    printf("Grid: %d x %d blocks\n", numBlocks.x, numBlocks.y);
    printf("Block: %d x %d threads\n\n", threadsPerBlock.x, threadsPerBlock.y);
    
    analyzeHierarchy<<<numBlocks, threadsPerBlock>>>();
    cudaDeviceSynchronize();
    
    cudaDeviceReset();
    return 0;
}

Note that printf from device code requires Fermi architecture (SM 2.0) or newer. Compile with architecture specification:

nvcc -arch=sm_20 hierarchyDemo.cu -o hierarchyDemo
./hierarchyDemo

Back to List

Prev: Understanding Constructors and Method Overloading in Java

Next: Implementing Stacks with Queues and Vice Versa in Java

Fading Coder

CUDA Thread Hierarchy: Organizing Grids, Blocks, and Threads for Parallel Execution

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

CUDA Thread Hierarchy: Organizing Grids, Blocks, and Threads for Parallel Execution

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment