thoughts on Slurm ~ Ted Malliaris ~ from the wiki at tedm.us

Slurm

free and open-source job scheduler for Linux/Unix-like computer clusters

tags:

Clusters are large, highly specialized computers. They are generally accessed by SSH, with interaction taking place at the command line (CLI). A single cluster may have hundreds of CPUs/GPUs, allowing multiple users to run jobs simultaneously. Depending on various factors, a given job may also be run in parallel, and even with inter-process communication (via MPI, for example). Configuring a computer cluster is not trivial — users have various hardware needs, software needs, priorities, and skill levels. Job schedulers like Slurm can take all these factors into account and ensure the cluster operates smoothly.

Slurm provides users with commands like sinfo, squeue, srun, scancel, etc. to determine cluster configuration, check job queues, and submit/monitor/cancel jobs. Running cluster jobs can be tricky: input/output files must be coordinated, job run time must be estimated, etc. Clusters usually offer a testing queue for working out the kinks, and sometimes the option to run jobs interactively for immediate feedback. Jobs can run for days, and users can arrange to be emailed upon job completion or error.