You are here: Home / Cluster / Cluster Tools / SLURM Resource Manager

SLURM Resource Manager

See: http://www.schedmd.com/slurmdocs/slurm.html

 

srun - Run a parallel program managed by SLURM

The srun program is used to launch jobs on the cluster. Use this instead of the mpirun command. See: http://www.schedmd.com/slurmdocs/srun.html

The following example schedules the binary "my-program" in the current directory to run on the cluster when resources are available. This program will be given 64 "tasks" (think CPU threads) across a minimum of 2 nodes. Note that your math needs to be correct! Each of our cluster nodes can run 32 threads at once. This, 64 tasks = 32 per node * 2 nodes.

srun --ntasks 32 --nodes 2 ./my-program

You can use srun to target specific "partitions" on the machine. 

# Use the 'gradclass' partition
srun --partition gradclass ./my-program

# Use the default "compute" partition
srun ./my-program

 

sbatch - Run a parallel program via a batch job (submitted for later execution)

While the srun command runs your program interactively, the sbatch command runs your program in the background when resources are available.  See:  http://www.schedmd.com/slurmdocs/sbatch.html

First, you need to create a "batch file" to accompany your program.  Here is an example:

#!/bin/sh
#
# Kick off this job via:  sbatch sbatch_script
#
#SBATCH --job-name=MyCustomJob
#SBATCH --output=my_custom_job_out.txt
#SBATCH --error=my_custom_job_out.txt
#SBATCH --partition=compute
#SBATCH --ntasks 8
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 2

module purge
module load mpi/openmpi-2.0.2

echo "--------------"
echo "Starting date:"
date
echo
echo "Elasped time:"
time srun ./my_program
echo
echo "Finished date:"
date
echo "--------------"

Once your batch script is customized and created, you can submit the job to the scheduler via:

sbatch sbatch_script

  

scancel - Stop Running SLURM Job

Did you make a mistake? scancel can, as its name implies, **cancel** a currently running or scheduled job in SLURM. See: http://www.schedmd.com/slurmdocs/scancel.html

# Run sinfo, find the job to cancel, and note its JOBID number
sinfo
scancel <JOBID>

 

sinfo - View Cluster Scheduler Status

The sinfo program shows the status of the cluster. See: http://www.schedmd.com/slurmdocs/sinfo.html

  • What partitions (groups of nodes) exist?
  • What nodes are busy?
  • What nodes are idle?
  • What nodes are down?
sinfo

The following node status codes may be shown for any particular node. A (*) shown after a code means that the node is not responding.

  • ALLOCATED - The node has been allocated to one or more jobs.
  • ALLOCATED+ - The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING.
  • COMPLETING - All jobs associated with this node are in the process of COMPLETING.
  • DOWN - The node is unavailable for use. SLURM can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state.
  • DRAINED - The node is unavailable for use per system administrator request.
  • DRAINING - The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes. Nodes enter this state per system administrator request.
  • FAIL - The node is expected to fail soon and is unavailable for use per system administrator request.
  • FAILING - The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request.
  • IDLE - The node is not allocated to any jobs and is available for use.
  • MAINT - The node is currently in a reservation with a flag value of "maintainence".
  • UNKNOWN - The SLURM controller has just started and the node's state has not yet been determined.

 

squeue - View Jobs Running / Scheduled on Cluster

The squeue tool shows how many programs are running or waiting to run on the cluster. (i.e. are there 10 programs ahead of you, or is your program the next to run?) See: http://www.schedmd.com/slurmdocs/squeue.html

squeue

 

smap - Text summary of SLRUM status

The smap utility offers similar functionality to sinfo, squeue, and scontrol, but in pleasant text interactive form. See: http://www.schedmd.com/slurmdocs/smap.html

smap

 

sview - GUI summary of SLURM status

The sview utility offers similar functionality to sinfo, squeue, and scontrol, but in pleasant GUI form. X11 forwarding is required to use this tool. See: http://www.schedmd.com/slurmdocs/sview.html

sview &

 

scontrol - Change SLURM configuration

For administrators only. See: http://www.schedmd.com/slurmdocs/scontrol.html

# Reset status of node 'node002' to idle if down
scontrol update nodename=node002 state=idle

# Restore status of node 'node002' to idle or allocated state if down
scontrol update nodename=node002 state=resume

# Restore status of node node002-node008 to idle or allocated state if down
scontrol update nodename=node00[2-8] state=resume