How to Find Unallocated Nodes on Slurm
Here is a quick two-step process for finding unallocated resources on SLURM.
Quickstart
sinfo # Shows all node names
scontrol show node [NODE_NAME] # Reveals unallocated/allocated CPUs/GPUs for the node named [NODE_NAME]
Details
First, run sinfo
to view all the nodes in your SLURM cluster:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 2-00:00:00 1 down* secure-6
normal* up 2-00:00:00 4 mix secure-[2-3,11,15]
gpu up 2-00:00:00 8 mix secure-gpu-[1-2,15-20]
gpu up 2-00:00:00 5 alloc secure-gpu-[3-7]
nigam-a100 up 2-00:00:00 1 mix secure-gpu-9
boussard-a100 up 2-00:00:00 1 idle secure-gpu-10
ogevaert-a100 up 2-00:00:00 1 idle secure-gpu-11
studdert-compute up 7-00:00:00 1 idle secure-17
nigam-v100 up 2-00:00:00 1 mix secure-gpu-12
mrivas-v100 up 2-00:00:00 1 idle secure-gpu-13
nigam-h100 up 2-00:00:00 1 down* secure-gpu-14
Second, use scontrol show node [NODE_NAME]
to go node-by-node and see what’s available:
$ scontrol show node secure-gpu-9
NodeName=secure-gpu-9 Arch=x86_64 CoresPerSocket=28
CPUAlloc=20 CPUTot=56 CPULoad=16.08
AvailableFeatures=CPU_GEN:SKX,CPU_SKU:6330,CPU_FRQ:2.0GHz,GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:A100_PCIE,GPU_MEM:80GB,GPU_CC:7.0,CLOUD
ActiveFeatures=CPU_GEN:SKX,CPU_SKU:6330,CPU_FRQ:2.0GHz,GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:A100_PCIE,GPU_MEM:80GB,GPU_CC:7.0,CLOUD
Gres=gpu:4(S:0-1)
NodeAddr=10.4.34.75 NodeHostName=slurm-gpu-compute-dell-750xa-owners-f7nww Port=0 Version=19.05.5
OS=Linux 5.4.0-187-generic #207-Ubuntu SMP Mon Jun 10 08:16:10 UTC 2024
RealMemory=450000 AllocMem=409600 FreeMem=14785 Sockets=2 Boards=1
State=MIXED+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=nigam-a100
BootTime=2024-07-09T00:47:45 SlurmdStartTime=2024-07-09T01:12:57
CfgTRES=cpu=56,mem=450000M,billing=56,gres/gpu=4
AllocTRES=cpu=20,mem=400G,gres/gpu=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
In the above example, I only look at these two lines:
CfgTRES=cpu=56,mem=450000M,billing=56,gres/gpu=4
AllocTRES=cpu=20,mem=400G,gres/gpu=4
where:
CfgTRES
tells me the max resources on the machine (i.e. 56 CPUs and 4 GPUs with 450GB of RAM)AllocTRES
tells me what’s already allocated (i.e. 20 CPUs and 4 GPUs and 400GB of RAM)
So secure-gpu-9
in the nigam-a100 partition is a bad node to request, because all its GPUs are already allocated. SLURM won’t allocate our job until the current job on this node is done.
Let’s retry with the nigam-v100 partition (i.e. secure-gpu-12
)
$ scontrol show node secure-gpu-12
NodeName=secure-gpu-12 Arch=x86_64 CoresPerSocket=12
CPUAlloc=18 CPUTot=24 CPULoad=1.18
AvailableFeatures=CPU_GEN:SKX,CPU_SKU:6226,CPU_FRQ:2.7GHz,GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:V100_PCIE,GPU_MEM:40GB,GPU_CC:7.0,CLOUD
ActiveFeatures=CPU_GEN:SKX,CPU_SKU:6226,CPU_FRQ:2.7GHz,GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:V100_PCIE,GPU_MEM:40GB,GPU_CC:7.0,CLOUD
Gres=gpu:8(S:0-1)
NodeAddr=10.4.41.102 NodeHostName=slurm-gpu-compute-gpu2-nigam-owners-bqc6v Port=0 Version=19.05.5
OS=Linux 5.4.0-187-generic #207-Ubuntu SMP Mon Jun 10 08:16:10 UTC 2024
RealMemory=1450000 AllocMem=409600 FreeMem=1483289 Sockets=2 Boards=1
State=MIXED+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=nigam-v100
BootTime=2024-07-09T19:07:43 SlurmdStartTime=2024-07-09T19:17:28
CfgTRES=cpu=24,mem=1450000M,billing=24,gres/gpu=8
AllocTRES=cpu=18,mem=400G,gres/gpu=7
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
We can immediately see that there’s 1 GPU, 6 CPUs, and ~1TB of RAM free on secure-gpu-12
.
So in my srun
request, if I set srun --nodelist secure-gpu-12 --gres=gpu:1 --cpus-per-task=3 --mem=100G
, then I know that SLURM will immediately fulfill my request.