Debugging Mulit-GPU Torch Runs

To debug a distributed (i.e. multi GPU) PyTorch run, make sure to first run:

export TORCH_DISTRIBUTED_DEBUG=DETAIL

…then run your script.

This will cause your script to actually throw an error rather than stall. The error will look something like:

RuntimeError: 
  Detected mismatch between collectives on ranks. 
  Rank 1 is running collective: 
    CollectiveFingerPrint(SequenceNumber=102, OpType=ALLREDUCE, TensorShape=[2361600], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), 
  but Rank 0 is running collective: 
    CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects:       Sequence number: 102vs 0  Op type: ALLREDUCEvs REDUCE  Tensor Tensor shapes: 2361600vs   Tensor Tensor dtypes: BFloat16vs   Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs 

These other envvars might be helpful too but I’m honestly not sure:

export NCCL_DEBUG="INFO"
export NCCL_SOCKET_TIMEOUT=5
export NCCL_ASYNC_ERROR_HANDLING=1
export CUDA_LAUNCH_BLOCKING=1