Retry slurm-commands on error

I am working on a cluster where `squeue` and `sbatch` sometimes fail, for whatever reason.

```
Submitting 1080000 jobs in 1080 chunks using cluster functions 'Slurm' ...
Submitting [========>------------------------------------------]  17% eta: 11mError: Fatal error occurred: 101. Command 'sbatch' produced exit code 1. Output: 'sbatch: error: Invalid user for SlurmUser slurm, ignored
sbatch: fatal: Unable to process configuration file'
```
```
Submitting 15000000 jobs in 7500 chunks using cluster functions 'Slurm' ...
Submitting [===========================>-----------------------]  55% eta:  4hError: Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=R,S,CG,RS,SI,SO,ST --noheader --format=%i -r'
output:
squeue: error: Invalid user for SlurmUser slurm, ignored
squeue: fatal: Unable to process configuration file
```

these errors are transient; after resubmitting the jobs, everything continues as it should. It would be nice to have an option for this to happen automatically, since then one could let batchtools submit jobs over night. My suggestion would be to give an option to retry slurm commands X times with Y seconds of pause in between (possibly with exponential backoff).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Retry slurm-commands on error #303

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Retry slurm-commands on error #303

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions