<< Back to posts
Debugging Mulit-GPU Torch Runs
To debug a distributed (i.e. multi GPU) PyTorch run, make sure to first run:
export TORCH_DISTRIBUTED_DEBUG=DETAIL
…then run your script.
This will cause your script to actually throw an error rather than stall. The error will look something like:
RuntimeError:
Detected mismatch between collectives on ranks.
Rank 1 is running collective:
CollectiveFingerPrint(SequenceNumber=102, OpType=ALLREDUCE, TensorShape=[2361600], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))),
but Rank 0 is running collective:
CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects: Sequence number: 102vs 0 Op type: ALLREDUCEvs REDUCE Tensor Tensor shapes: 2361600vs Tensor Tensor dtypes: BFloat16vs Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs