View on GitHub

SlurmDagman

Run the jobs in a DAG with Slurm.

Short user documentation

Submit a DAG

To submit a DAG to Slurm use the slurm_submit_dag command:

slurm_submit_dag <dag-file>

Check the options of the command with slurm_submit_dag --help.

This will run a slurm_dagman process that will parse the DAG file and submit and monitor the Slurm jobs. The slurm_dagman process will log its activity into an out file which by default has the same name of the dag file plus the extension “.slurm_dagman.out”.

DAG file syntax

The syntax of the DAG file is similar to the one used in HTCondor DAGMan. The following four constructs are recognized by SlurmDagman in a DAG file:

JOB dag-node slurm-submission-file
VARS dag-node var-name=var-value [...]
RETRY dag-node max-retries
PARENT parent-dag-node[,...] CHILD child-dag-node[,...]

Configuration of an individual slurm_dagman process

Each slurm_dagman process writes a process configuration file in the same directory and with the same name as the DAG file with an additional extension “.slurm_dagman.cfg”. The process configuration file is the way of communicating with the slurm_dagman process. This file is read by the slurm_dagman process at the end of each iteration, updating its internal working parameters.

The following parameters are configurable via the slurm_dagman process configuration file. Each parameter has a hard-coded default value in the SlurmDagman package. The default values can be overridden in the package configuration file (except for drain and cancel, where only False makes sense as a default value). They can also be passed specific values for each slurm_dagman process as options in the slurm_submit_dag command.

Cancel a running DAG

A running slurm_dagman process can be cancelled by setting cancel = True in the process configuration file or by running the slurm_cancel_dag command. The slurm_cancel_dag command will only work if there is only one slurm_dagman process running owned by the current user. The slurm_cancel_dag sends a SIGTERM signal to the slurm_dagman process by means of a kill -9 <slurm_dagman-pid>. SIGTERM signals are trapped by slurm_dagman, which then terminates the DAG (cancels queued jobs) and writes a rescue DAG file. Equivalently, a user can run a kill -9 <slurm_dagman-pid> by hand if the PID of the slurm_dagman process is known (each time a slurm_dagman process starts, it will log its PID into the .slurm_dagman.out file).

Translate a HTCondor DAG file and submission files to Slurm

Using the condor_dag_to_slurm_dag command one can translate a DAG file that is intended to be used with HTCondor DAGMan together with a set of HTCondor submission files into a corresponding set of Slurm submission files and a dag file to be used by SlurmDagman:

condor_dag_to_slurm_dag <condor-dag-file>

Check the options of the command with condor_dag_to_slurm_dag --help. For example, one can specify that the job payload should be run with singularity in the Slurm job.

Note: Only some basic variables available in a HTCondor submission file are translated. The rest is ignored and not translated. Thus, the user should better check if the translation is correct.