How to Run Multiple Serial Programs as a Single SLURM Job

Job Arrays is a features that automates the process of submitting and managing a collection of similar jobs in SLURM. Using Job Arrays, we can put a bunch of jobs in the queue with a single SLURM sbatch command. For example, if we have two similar jobs that we want to run, we can submit them to the queue using a Job Array, and the SLURM queue will look as follows:

Sometimes, however, we want do something slightly different. Instead of submitting a bunch of similar jobs separately to the queue, we might want to collect these jobs into a single large job. Say, we have the same two jobs that we want to submit to the queue, but we bundle them together and submit them to the queue as one job. We will then have only one job in the SLURM queue:

This post shows you how to do that for serial programs.

Example Program

To illustrate how a collection of similar jobs can be bundled and submitted as a single SLURM job, we will use a simple Python script called demo.py. All the examples are run on Princeton University’s tigercpu cluster.


#!/usr/bin/env python
# demo.py:
# usage: python demo.py [job-number]

import sys
import socket
from time import sleep


def work(jobnum):
  print("Starting job {} on {}.".format(jobnum, socket.gethostname()))
  sleep(5)
  print("Finished job {}...\n".format(jobnum))


if __name__ == "__main__":
  jobnum = sys.argv[1]
  work(jobnum)

The script is a serial program and it takes in one parameter, the job number. The program prints the job number and hostname on startup, and the job number again once it finishes. For example:

$ module load anaconda3
$ python demo.py 0
Starting job 0 on tigercpu.princeton.edu.
Finished job 0...

Say, I want to run a bunch of these jobs with different job numbers, but I want to bundle them together such that they go in the SLURM queue as one single job. There are two ways I can do this. One would be to run the jobs sequentially as a single SLURM job, the other would be to run the jobs in parallel as a single SLURM job. Both cases are illustrated in the following two sections.

Running Jobs Sequentially as a Single SLURM Job

Say we have 3 jobs that we want to run, and we want to run them sequentially, one after another, as a single SLURM job. We can do this by using the following sbatch script:

#!/bin/bash

#SBATCH -N 1 # 1 node
#SBATCH -n 1 # 1 task
#SBATCH -c 1 # 1 core per task
#SBATCH -t 00:03:00 # time required, here it is 3 min
module load anaconda3
# Execute jobs sequentially
srun -N 1 python demo.py 0
srun -N 1 python demo.py 1
srun -N 1 python demo.py 2

NOTE: since the programs are running sequentially, we must request enough time to allow all the programs to run. In this case, where we run 3 programs in sequence, we must request at least 3 times the time it takes for one program to run.

If we call this sbatch script run.sh, we can submit it to the SLURM queue with the command:

sbatch run.sh

The jobs will then go in the queue, bundled as one SLURM job, and the following example output will print:

Starting job 0 on tiger-i26c1n16.
Finished job 0...

Starting job 1 on tiger-i26c1n16.
Finished job 1...

Starting job 2 on tiger-i26c1n16.
Finished job 2...

We see from the print statements that the jobs were run sequentially, and that they all ran on the same host, tiger-i26c1n16.

Running Jobs in Parallel as a Single SLURM Job

If we have the same 3 jobs and we want to run them in parallel as a single SLURM job, we can use the following sbatch script:

#!/bin/bash

#SBATCH -N 1 # 1 nodes
#SBATCH -n 3 # 3 tasks
#SBATCH -c 1 # 1 core per task
#SBATCH -t 00:01:00 # time required, here it is 1 min
module load anaconda3
# Execute jobs in parallel
srun -N 1 -n 1 python demo.py 0 &
srun -N 1 -n 1 python demo.py 1 &
srun -N 1 -n 1 python demo.py 2
wait

Since we want to run the jobs in parallel, we string together the srun commands using &. Since sruns cannot share nodes by default, we need to request three nodes and three tasks, one for each srun. In the execution command we then distribute the resources by giving each srun one task on one node. Notice the wait command at the end which ensures that the SLURM job does not exit until all the sruns have finished.

NOTE: here the programs run in parallel, which means that we only need to request the time it takes for one program to run (e.g. the longest running program if they have different runtimes).

If this sbatch script is called run.sh, we submit it to the SLURM queue with the command:

sbatch run.sh

which will put the jobs in the SLURM queue as one job. An example output is:

Starting job 2 on tiger-h21c1n21.
Finished job 2...

Starting job 1 on tiger-h21c1n8.
Finished job 1...

Starting job 0 on tiger-h21c1n20.
Finished job 0...

Notice how the print order is not necessarily in sequence, since the jobs are being run in parallel. Also, notice that the hostnames are all different, unlike what we saw in the example with sequential execution.

css.php