Open
Description
I am working on a cluster where squeue
and sbatch
sometimes fail, for whatever reason.
Submitting 1080000 jobs in 1080 chunks using cluster functions 'Slurm' ...
Submitting [========>------------------------------------------] 17% eta: 11mError: Fatal error occurred: 101. Command 'sbatch' produced exit code 1. Output: 'sbatch: error: Invalid user for SlurmUser slurm, ignored
sbatch: fatal: Unable to process configuration file'
Submitting 15000000 jobs in 7500 chunks using cluster functions 'Slurm' ...
Submitting [===========================>-----------------------] 55% eta: 4hError: Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=R,S,CG,RS,SI,SO,ST --noheader --format=%i -r'
output:
squeue: error: Invalid user for SlurmUser slurm, ignored
squeue: fatal: Unable to process configuration file
these errors are transient; after resubmitting the jobs, everything continues as it should. It would be nice to have an option for this to happen automatically, since then one could let batchtools submit jobs over night. My suggestion would be to give an option to retry slurm commands X times with Y seconds of pause in between (possibly with exponential backoff).
Metadata
Metadata
Assignees
Labels
No labels