This is a short tutorial on SLURM. It will walk users through creating a sbatch file, what each part does, how to submit the job, and where to get the results. This demo will be utilizing a python file that creates a fractal using the Julia set. Both the python file and the sbatch can be found in a shared folder on Hellgate at the following path.
- /mnt/beegfs/projects/resources/SLURM
Step 1: Creating an Sbatch File
This tutorial assumes the sbatch file and original code to run are in the same directory. If they are not the location will need to be specified within the sbatch file.
The first line of any sbatch file is going to be called the shebang. It tells the shell to interpret the script using the bash shell.
- #!/bin/bash
Next are the resources that will be allocated for the job. When creating the sbatch file there may be some trial and error when deciding how many resources are needed. The resource requests will begin with a single "#", anything beginning with "##" will be read as a comment and ignored.
The next part of the sbatch file will be what is actually going to run when the job gets submitted. These will depend on the job and desired output. In this example an output directory is specified; it will hold the output when the job is done running. If no directory is specified the default output will be the same directory as the sbatch file itself.
Next any software and modules are loaded. In the below example Python and two supporting modules, numpy and matplotlib, are loaded. These are necessary for the job to run. It is crucial to add any software or modules here that are not on Hellgate, or if a certain version is needed for the job make sure to specify that or the job will return with an error and not run.
The final parts of the sbatch will be running the script and specifying an output if needed. In the example below the full path to the file is entered and an output, "output_dir" is specified. This way the job can be run from anywhere and the output file will still be in the specified directory.
Though each script will vary depending on the job, the base outline will be the same. Below is the final sbatch script that will be submitted.
Step 2: Running the job
To run the job the sbatch command will be used. The job must be run from the same directory as the sbatch file. Once submitted the job will be given a number.
To view the state of the jobs running on Helllgate use the squeue command. This will give some additional information. The job can be found using the ID given when it was submitted or by locating the netID it was run from. When squeue is used it will show all the jobs in line.
After the job is finished it will no longer be visible in the queue. The example this tutorial uses had two extra files that would be created as output, an err.txt and an out.txt. Both of these were specified in the resources part of the sbatch file. This means that these two files will also be added to the directory that contains the sbatch script once it completes its job.
This example also specfied an output directory, fractals_output, that would house the output files. The err.txt will contain any error messages when the job was run and the out.txt will contain some troubleshooting messages specified in the python file. Neither of these are necessary but are often considered good practice, especially when starting out.
This particular example has an image file as output, which will be located in the specified fractals_output directory.
Now that the job has run successfully the output files can be used, the script can be altered, or the next script started. The sbatch script created can continue to be used and optimized.
Step 3: Final Steps
Often times the output files are needed to be downloaded or moved off of Hellgate. This can be done using SFTP. There is a comprehensive tutorial that walks users through this process under the linux-basics folder. For more information or questions please reach out and submit a ticket.