CUDA Thread Hierarchy: Organizing Grids, Blocks, and Threads for Parallel Execution
When a kernel is invoked from the host, the CUDA runtime instantiates a collection of threads on the device to execute the kernel code in parallel. These threads are arranged in a hierarchical structure that facilitates both scalability and cooperation: the grid, the thread block, and the individual thread.
A grid encompasses all threads generated by a single kernel launch. Every thread within a given grid has access to the same global memory space. The grid is partitioned into thread blocks, which represent collaborative units of execution. Threads within the same block can synchronize their execution via __syncthreads() and exchange data through high-speed shared memory. However, no synchronization mehcanism exists between threads residing in different blocks.
To distinguish themselves within this hierarchy, threads utilize four built-in variables available exclusively in device code:
blockIdx: Auint3structure identifying the block's coordinates within the grid (components:x,y,z)threadIdx: Auint3structure identifying the thread's coordinates within its block (components:x,y,z)blockDim: Adim3structure specifying the dimensions of the block (threads per block in each dimension)gridDim: Adim3structure specifying the dimensions of the grid (blocks per grid in each dimension)
Both uint3 and dim3 are built-in structures containing three unsigned integers. The uint3 structure is defined as:
struct __device_builtin__ uint3 {
unsigned int x, y, z;
};
The dim3 structure extends this with constructors that default unspecified dimensions to 1:
struct __device_builtin__ dim3 {
unsigned int x, y, z;
__host__ __device__ dim3(unsigned int vx = 1,
unsigned int vy = 1,
unsigned int vz = 1)
: x(vx), y(vy), z(vz) {}
};
CUDA supports up to three-dimensional indexing for both grids and blocks, enabling intuitive mapping to multidimensional datasets such as images or volumetric data. When launching kernels with fewer than three dimensions, any unspecified coordinates automatically initialize to 1.
The following program demonstrates a two-dimensional configuration where threads calculate their global coordinates:
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void analyzeHierarchy() {
// Extract thread coordinates within block
unsigned int tx = threadIdx.x;
unsigned int ty = threadIdx.y;
// Extract block coordinates within grid
unsigned int bx = blockIdx.x;
unsigned int by = blockIdx.y;
// Extract dimension information
unsigned int bdx = blockDim.x;
unsigned int bdy = blockDim.y;
unsigned int gdx = gridDim.x;
unsigned int gdy = gridDim.y;
printf("Thread(%u,%u) in Block(%u,%u) | BlockSize: %ux%u | GridSize: %ux%u\n",
tx, ty, bx, by, bdx, bdy, gdx, gdy);
// Compute global 2D coordinates
unsigned int globalCol = bx * bdx + tx;
unsigned int globalRow = by * bdy + ty;
printf("Global Matrix Position: (row=%u, col=%u)\n\n", globalRow, globalCol);
}
int main() {
// Configure a 2D grid to process an 8x6 matrix
dim3 threadsPerBlock(4, 3);
dim3 numBlocks(2, 2);
printf("Launch Configuration:\n");
printf("Grid: %d x %d blocks\n", numBlocks.x, numBlocks.y);
printf("Block: %d x %d threads\n\n", threadsPerBlock.x, threadsPerBlock.y);
analyzeHierarchy<<<numBlocks, threadsPerBlock>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
Note that printf from device code requires Fermi architecture (SM 2.0) or newer. Compile with architecture specification:
nvcc -arch=sm_20 hierarchyDemo.cu -o hierarchyDemo
./hierarchyDemo