<< Back to posts
How to Run Sbatch in Slurm
Note: This is in progress
Quickstart
The script below requests 1 CPU-core and 4 GB of memory for 1 minute of run time.
#!/bin/bash
#SBATCH --job-name=myjob # name of your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # Total number of tasks across all nodes
#SBATCH --cpus-per-task=8 # Use 8 CPU cores on the machine
#SBATCH --mem-per-cpu=2G # memory per cpu-core (4G is default)
#SBATCH --gres=gpu:1 # request a GPU
#SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin # send email when job begins
#SBATCH --mail-type=end # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu
#SBATCH --output=slurm_%A_%a.out # print output to slurm_<job_id>_<array_id>.out
#SBATCH --array=1-5 # this creates an array
# Load Python
module load python
# Echo environment vars set by slurm
echo $SLURM_ARRAY_JOB_ID # unique job ID (integer)
echo $SLURM_ARRAY_TASK_ID # unique array ID within job (integer) - only exists if --array is specified
# Print
echo "hi"
Launch the job by running:
sbatch my_file.sbatch
Gotchas
In “99.9% of cases”, the correct way to request N
CPU cores is:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=N
If you don’t use srun
in your sbatch, then always set --ntasks=1
You can also likely ignore all references to MPI – this Message Passing Interface is only needed for coordinating jobs that use multiple nodes. Usually, you’ll parallelize your scripts on a single node
Sbatch Guide
What does each #SBATCH
directive mean?
#SBATCH | Alias | Format | Example | Definition |
---|---|---|---|---|
–job-name | -J | String | my_job | Name of job. Will appear in output of commands like squeue |
–nodes | -N | Integer | 1 | Node count. Default to allocate neough nodes to fulfill other resource demands |
–ntasks | -n | Integer | 1 | Total number of tasks across all nodes. Default is one CPU per task. Only required for MPI workfloads, and in order to take advantage of this you must use the srun command to launch each individual task |
–cpus-per-task | -c | Integer | 1 | CPU cores per task. This must be <= max # of cores available on a single compute node |
–gpus-per-node | Integer | 1 | Number of GPUs per node | |
–gres | String | gpu:1 | Number of GPUs per node | |
–mem | Integer | 2G | Memory per node. Defaults to MB. | |
–mem-per-cpu | Integer | 2G | Memory per CPU core | |
–mem-per-gpu | Integer | 2G | Memory per GPU | |
-p | String | nigam-a100 | Name of partition to run job on | |
–time | -t | DD-HH:MM:SS | 01-00:00:00 (1 day) | Total run time limit |
–mail-type=begin | String – One of: begin/end/fail/requeue/all | Send email when job begins or ends | ||
–mail-user | email@email.com | Email to send job alerts | ||
–input | -i | String | stdin_%A.in | Where stdin is read from |
–output | String | stdout_%A.out | Where stdout is written | |
–error | -e | String | stderr_%A.out | Where stderr is written |
Useful Slurm Commands
List status of current jobs
Command
squeue \
-u mwornow # only show info for user `mwornow`
-j <JOB_ID> # only show info for job with ID
-l # show additional job info
Output
XXXX
View currently running job status
Command
scontrol show job <JOB_ID>
Output
XXXX
Cancel job
Command
scancel <JOB_ID>
Output
XXXX
Launch interactive job
salloc
srun --pty /bin/bash
References
- https://researchcomputing.princeton.edu/support/knowledge-base/slurm
- https://sites.google.com/nyu.edu/nyu-hpc/training-support/tutorials/slurm-tutorial
- https://hpc.llnl.gov/banks-jobs/running-jobs/slurm
- https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
- https://login.scg.stanford.edu/faqs/cores/ – great, easy-to-read website
- https://slurm.schedmd.com/tutorials.html – extremely detailed API