Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

CUDA Thread Hierarchy: Organizing Grids, Blocks, and Threads for Parallel Execution

Tech 1

When a kernel is invoked from the host, the CUDA runtime instantiates a collection of threads on the device to execute the kernel code in parallel. These threads are arranged in a hierarchical structure that facilitates both scalability and cooperation: the grid, the thread block, and the individual thread.

A grid encompasses all threads generated by a single kernel launch. Every thread within a given grid has access to the same global memory space. The grid is partitioned into thread blocks, which represent collaborative units of execution. Threads within the same block can synchronize their execution via __syncthreads() and exchange data through high-speed shared memory. However, no synchronization mehcanism exists between threads residing in different blocks.

To distinguish themselves within this hierarchy, threads utilize four built-in variables available exclusively in device code:

  • blockIdx: A uint3 structure identifying the block's coordinates within the grid (components: x, y, z)
  • threadIdx: A uint3 structure identifying the thread's coordinates within its block (components: x, y, z)
  • blockDim: A dim3 structure specifying the dimensions of the block (threads per block in each dimension)
  • gridDim: A dim3 structure specifying the dimensions of the grid (blocks per grid in each dimension)

Both uint3 and dim3 are built-in structures containing three unsigned integers. The uint3 structure is defined as:

struct __device_builtin__ uint3 {
    unsigned int x, y, z;
};

The dim3 structure extends this with constructors that default unspecified dimensions to 1:

struct __device_builtin__ dim3 {
    unsigned int x, y, z;
    __host__ __device__ dim3(unsigned int vx = 1, 
                             unsigned int vy = 1, 
                             unsigned int vz = 1) 
        : x(vx), y(vy), z(vz) {}
};

CUDA supports up to three-dimensional indexing for both grids and blocks, enabling intuitive mapping to multidimensional datasets such as images or volumetric data. When launching kernels with fewer than three dimensions, any unspecified coordinates automatically initialize to 1.

The following program demonstrates a two-dimensional configuration where threads calculate their global coordinates:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void analyzeHierarchy() {
    // Extract thread coordinates within block
    unsigned int tx = threadIdx.x;
    unsigned int ty = threadIdx.y;
    
    // Extract block coordinates within grid
    unsigned int bx = blockIdx.x;
    unsigned int by = blockIdx.y;
    
    // Extract dimension information
    unsigned int bdx = blockDim.x;
    unsigned int bdy = blockDim.y;
    unsigned int gdx = gridDim.x;
    unsigned int gdy = gridDim.y;
    
    printf("Thread(%u,%u) in Block(%u,%u) | BlockSize: %ux%u | GridSize: %ux%u\n",
           tx, ty, bx, by, bdx, bdy, gdx, gdy);
           
    // Compute global 2D coordinates
    unsigned int globalCol = bx * bdx + tx;
    unsigned int globalRow = by * bdy + ty;
    printf("Global Matrix Position: (row=%u, col=%u)\n\n", globalRow, globalCol);
}

int main() {
    // Configure a 2D grid to process an 8x6 matrix
    dim3 threadsPerBlock(4, 3);
    dim3 numBlocks(2, 2);
    
    printf("Launch Configuration:\n");
    printf("Grid: %d x %d blocks\n", numBlocks.x, numBlocks.y);
    printf("Block: %d x %d threads\n\n", threadsPerBlock.x, threadsPerBlock.y);
    
    analyzeHierarchy<<<numBlocks, threadsPerBlock>>>();
    cudaDeviceSynchronize();
    
    cudaDeviceReset();
    return 0;
}

Note that printf from device code requires Fermi architecture (SM 2.0) or newer. Compile with architecture specification:

nvcc -arch=sm_20 hierarchyDemo.cu -o hierarchyDemo
./hierarchyDemo

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.