Priority
Created by Potthoff, Sebastian, last modified on 22. Jun 2022
The Slurm job scheduler is responsible for queuing the jobs and submitting them to compute nodes. Each job is thereby assigned a specific priority value - the higher this number, the higher up the job is inside the queue and will therefore start faster. However, the calculation of the priority depends on many factors and can be difficult to follow or compare to other jobs. Therefore, we listed some insights here to explain how this is done:
Formula
The formula used to calculate the priority consist of different factors and weights. On PALMA we are using the following:
Job_priority =
(PriorityWeightAge) * (age_factor) +(PriorityWeightFairshare) * (fair-share_factor) +(PriorityWeightJobSize) * (job_size_factor)
There can be more factors - have a look at https://slurm.schedmd.com/priority_multifactor.html
Factors {#Priority-Factors}
All factors are determined by Slurm itself and result in a number between 0.0 and 1.0.
-
age_factor = 0.0 - 1.0 depending on how long the job is waiting in the queue. The maximum value of 1.0 is reached after 14 days.
-
fair-share_factor = 1.0 - 0.0 depending on how many resources a user has already used in the past. The amount of used resources, contributing to the fair-share factor, will be halved every 7 days! Therefore your fair-share value will increase when not using many resources for a while.
-
job_size_factor = 0.0 - 1.0 depending on the amount of resources asked to reserve for a job. A job asking for the complete cluster would get a factor of 1.0. Larger jobs are therefore slightly favored.
Weights
The weights can be configured by the administrators making each factor more or less important. They can be shown by the sprio command:
sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE
Weights 1 20000 200000 10000
Showing job priorities
In the example below you can see the output of the squeue command and sprio command for four example jobs waiting in the normal partition (note that we are only comparing jobs in a single partition)
$ squeue -P -p normal --sort=-p,i --state=PD | head -n 5
JOBID | PARTITION | STATE | CPUS | MIN_MEMORY | NODELIST(REASON) | TIME_LEFT | PRIORITY
1 | normal | PENDING | 1344 | 2500M | (Resources) | 7-00:00:00 | 23162
2 | normal | PENDING | 36 | 8G | (Resources) | 7-00:00:00 | 21534
3 | normal | PENDING | 36 | 8G | (Priority) | 7-00:00:00 | 21534
4 | normal | PENDING | 36 | 8G | (Priority) | 7-00:00:00 | 21534
$ sprio -lp normal --sort=-y,i | head -n 5
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE
1 normal 23162 0 20000 2375 788
2 normal 21534 0 19310 2205 19
3 normal 21534 0 19310 2205 19
4 normal 21534 0 19310 2205 19
::: :::
Backfilling
In addition to the normal queuing algorithm, Slurm also uses a so called backfilling algorithm. Slurm then tries to squeeze in smaller jobs in-between larger jobs - as long as those do not increase their waiting times. This allows for a much better overall cluster usage as many resources would otherwise just be idle, waiting for the next job. If a job was scheduled using backfilling, can be seen if you run
scontrol show job <jobid>
...
... Scheduler=Backfill
...
Favoring larger Jobs
We configured the slurm scheduler so that it will slightly favor larger jobs, i.e. the settings in the slurm.conf is as follows:
PriorityFavorSmall=No
This setting will calculate the factor by dividing the requested number of cpus by the total number of cpus of the system
NCPUs/TotalCPUs = JobSizeFactor
e.g. requesting all resources of the cluster would lead to
TotalCPUs / TotalCPUs = 1.0
requesting only part of the cluster leads to
NCPUs/TotalCPUs < 1.0