SLURM Resource Manager
srun - Run a parallel program managed by SLURM
The srun program is used to launch jobs on the cluster. Use this instead of the mpirun command. See: http://www.schedmd.com/slurmdocs/srun.html
The following example schedules the binary "my-program" in the current directory to run on the cluster when resources are available. This program will be given 64 "tasks" (think CPU threads) across a minimum of 2 nodes. Note that your math needs to be correct! Each of our cluster nodes can run 32 threads at once. This, 64 tasks = 32 per node * 2 nodes.
srun --ntasks 32 --nodes 2 ./my-program
You can use srun to target specific "partitions" on the machine.
# Use the 'gradclass' partition srun --partition gradclass ./my-program # Use the default "compute" partition srun ./my-program
sbatch - Run a parallel program via a batch job (submitted for later execution)
While the srun command runs your program interactively, the sbatch command runs your program in the background when resources are available. See: http://www.schedmd.com/slurmdocs/sbatch.html
First, you need to create a "batch file" to accompany your program. Here is an example:
#!/bin/sh # # Kick off this job via: sbatch sbatch_script # #SBATCH --job-name=MyCustomJob #SBATCH --output=my_custom_job_out.txt #SBATCH --error=my_custom_job_out.txt #SBATCH --partition=compute #SBATCH --ntasks 8 #SBATCH --nodes 4 #SBATCH --ntasks-per-node 2 module purge module load mpi/openmpi/1.8.8 echo "--------------" echo "Starting date:" date echo echo "Elasped time:" time srun ./my_program echo echo "Finished date:" date echo "--------------"
Once your batch script is customized and created, you can submit the job to the scheduler via:
scancel - Stop Running SLURM Job
Did you make a mistake? scancel can, as its name implies, **cancel** a currently running or scheduled job in SLURM. See: http://www.schedmd.com/slurmdocs/scancel.html
# Run sinfo, find the job to cancel, and note its JOBID number sinfo scancel <JOBID>
sinfo - View Cluster Scheduler Status
The sinfo program shows the status of the cluster. See: http://www.schedmd.com/slurmdocs/sinfo.html
- What partitions (groups of nodes) exist?
- What nodes are busy?
- What nodes are idle?
- What nodes are down?
The following node status codes may be shown for any particular node. A (*) shown after a code means that the node is not responding.
- ALLOCATED - The node has been allocated to one or more jobs.
- ALLOCATED+ - The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING.
- COMPLETING - All jobs associated with this node are in the process of COMPLETING.
- DOWN - The node is unavailable for use. SLURM can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state.
- DRAINED - The node is unavailable for use per system administrator request.
- DRAINING - The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes. Nodes enter this state per system administrator request.
- FAIL - The node is expected to fail soon and is unavailable for use per system administrator request.
- FAILING - The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request.
- IDLE - The node is not allocated to any jobs and is available for use.
- MAINT - The node is currently in a reservation with a flag value of "maintainence".
- UNKNOWN - The SLURM controller has just started and the node's state has not yet been determined.
squeue - View Jobs Running / Scheduled on Cluster
The squeue tool shows how many programs are running or waiting to run on the cluster. (i.e. are there 10 programs ahead of you, or is your program the next to run?) See: http://www.schedmd.com/slurmdocs/squeue.html
smap - Text summary of SLRUM status
The smap utility offers similar functionality to sinfo, squeue, and scontrol, but in pleasant text interactive form. See: http://www.schedmd.com/slurmdocs/smap.html
sview - GUI summary of SLURM status
The sview utility offers similar functionality to sinfo, squeue, and scontrol, but in pleasant GUI form. X11 forwarding is required to use this tool. See: http://www.schedmd.com/slurmdocs/sview.html
scontrol - Change SLURM configuration
For administrators only. See: http://www.schedmd.com/slurmdocs/scontrol.html
# Reset status of node 'node002' to idle if down scontrol update nodename=node002 state=idle # Restore status of node 'node002' to idle or allocated state if down scontrol update nodename=node002 state=resume # Restore status of node node002-node008 to idle or allocated state if down scontrol update nodename=node00[2-8] state=resume