Troubleshooting NCCL Initialization Failures in Docker Environments
Deploying multi-GPU workloads within Docker containers often relies on the NVIDIA Collective Communications Library (NCCL) for inter-process synchronization. A frequent obstacle arises when the process count exceeds two GPUs, resulting in an ncclCommInitRank failed runtime error. This issue typically stems from system shared memory constraints rather than GPU VRAM limitations.
Reproduction Scenario
Consider a environment equipped with four GeForce GTX 1080 accelerators using the nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0 image. A distributed training script initialized via MPI might be structured as follows:
from nnabla import communicators
from nnabla.ext_utils import get_extension_context
def initialize_communicator():
backend_engine = "cudnn"
compute_context = get_extension_context(backend_engine)
distributed_comm = communicators.MultiProcessCommunicator(compute_context)
distributed_comm.initialize()
return distributed_comm
if __name__ == "__main__":
comm = initialize_communicator()
print(f"Rank: {comm.rank}, World Size: {comm.size}")
Executing this script with mpiexec -np 4 python script.py triggers a RuntimeError during the initialization phase. The stack trace references multi_process_data_parallel_communicator.cu, confirming the failure originates within the NCCL layer.
Diagnostic Analysis
Enabling verbose logging provides clarity on the failure mode. Running the command with NCCL_DEBUG=INFO exposes warnings related to shared memory allocation:
NCCL WARN Call to posix_fallocate failed : No space left on device
NCCL WARN Error while creating shared memory segment ...
Inspection of GPU memory via nvidia-smi shows adequate VRAM availability, indicating the bottleneck is not on the device. The error No space left on device refers to the host's shared memory filesystem mounted at /dev/shm inside the container.
Root Cause
Docker containers default to a /dev/shm size of 64MB. NCCL requires larger shared memory segments to manage communication rings between multiple GPUs. When the requested segment size exceeds the available 64MB, initialization fails.
Resolution Strategies
Three methods exist to mitigate this shared memory limitation:
- Disable Shared Memory: Configure NCCL to avoid using shared memory segments. Set the environment variable
NCCL_SHM_DISABLE=1. For NCCL versions older than 2.7, also setNCCL_P2P_LEVEL=SYS. This can be done via/etc/nccl.confor export commands. Note that bypassing shared memory may reduce communication efficiency. - Bind Mount Host Directory: Map the host's shared memory into the container using the volume flag
-v /dev/shm:/dev/shm. While effective, this approach may leave orphaned files on the host system after the container exits. - Increase Container Limit: Allocate more shared memory to the container at runtime using the
--shm-sizeflag. For example,docker run --shm-size=256m ...provides sufficient space for NCCL segments without modifying the host filesystem structure.
Utilizing the --shm-size flag is generally recommended for isolated environments to ensure resource adequacy while maintaining container hygiene.
Related Resources
- NVIDIA NCCL GitHub Issue #290
- PaddlePaddle Pull Request #28484
- Horovod GitHub Issue #2395
- NVIDIA NCCL GitHub Issue #406