SLURM Workload Manager
SLURM stands for Simple Linux Utility for Resource Management and is a job scheduler tool used in high performance computing (HPC) environments in order to process big data. It is free and open-source and used by many data centres worldwide as workload manager or scheduler software. The software can be downloaded here. The development of the software was started by organizations such as Lawrence Lievermore National Laboratory, SchedMD, Linux NetworX, HP, and Groupe Bull, but many more have contributed more recently.
A HPC machine is often accessed through a dedicated set of login nodes when using, for example, an SSH connection. These login nodes should be used to write and compile scientific, engineering, or business applications. They can be also used to perform any type of pre- and post-processing of datasets that play a role in the large processing. This large processing run also called a computing job is handled by the SLURM workload manager tool. There are different ways of how to submit a job to the system using this tool. One example is to use the SRUN command with parameters taken from a batch script as shown in the script below.
Example Script
The following example of the SLURM SRUN command is about the submission of a parallel training job using the piSVM parallel code. This code is an implementation of a parallel and scalable Support Vector Machine (SVM) and another example about of the submission of a parallel prediction job can be find in our SRUN article. The SVM implementation and example is derived from the article On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods.
In the example below the walltime limit is set to 1 hour (SBATCH–time) and a particular batch partition is configured on the corresponding HPC machine (SBATCH–partition=batch). It uses some remote sensing training data from an indian pines dataset (area_panch_traindata). The SRUN command uses the training data and the SBATCH statements in the script and starts a parallel job on 2 nodes with each 24 cores resulting in 48 parallel tasks in this example.
file content submit-training.sh
#!/bin/bash #SBATCH--nodes=2 #SBATCH--ntasks=48 #SBATCH--ntasks-per-node=24 #SBATCH--output=mpi-out.%j #SBATCH--error=mpi-err.%j #SBATCH--time=01:00:00 #SBATCH--partition=batch #SBATCH--mail-user=service@big-data.tips #SBATCH--mail-type=ALL #SBATCH--job-name=train-record-2-48-24 ### location executable PISVM=/home/tools/pisvm-1.2.1/pisvm-train ### location training data TRAINDATA=/home/bigdata/indianpines/area_panch_traindata ### submit SRUN $PISVM -D -o 1024 - q 512 -c 100 -g 8 -t 2 -m 1024 -s 0 $TRAINDATA
The command-line parameters seen above behind SRUN are part of the executable of the SVM implementation. They determine cache, identify the type of SVM as well as which kernel function to use and some other implementation details. The following command can be used to submit the job to the SLURM scheduler:
sbatch submit-training.sh
Output:
Submitting batch job 4711255
The job id 4711255 is returned and can be used to obtain pieces of Information about the job Status and runtime environment. The following command can be used to get details:
scontrol show job 4711255
If the job should be cancelled you can also use this job id in order to cancel the job run as follows:
scancel 4711255
More details about SLURM
The following video provides good insights into this topic: