+
Skip to content

Retry slurm-commands on error #303

Open
@mb706

Description

@mb706

I am working on a cluster where squeue and sbatch sometimes fail, for whatever reason.

Submitting 1080000 jobs in 1080 chunks using cluster functions 'Slurm' ...
Submitting [========>------------------------------------------]  17% eta: 11mError: Fatal error occurred: 101. Command 'sbatch' produced exit code 1. Output: 'sbatch: error: Invalid user for SlurmUser slurm, ignored
sbatch: fatal: Unable to process configuration file'
Submitting 15000000 jobs in 7500 chunks using cluster functions 'Slurm' ...
Submitting [===========================>-----------------------]  55% eta:  4hError: Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=R,S,CG,RS,SI,SO,ST --noheader --format=%i -r'
output:
squeue: error: Invalid user for SlurmUser slurm, ignored
squeue: fatal: Unable to process configuration file

these errors are transient; after resubmitting the jobs, everything continues as it should. It would be nice to have an option for this to happen automatically, since then one could let batchtools submit jobs over night. My suggestion would be to give an option to retry slurm commands X times with Y seconds of pause in between (possibly with exponential backoff).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载