这是indexloc提供的服务,不要输入任何密码
Skip to content

Provide a configuration option to enable a "fail fast" development mode #1274

@kkersten

Description

@kkersten

Problem: the server can be configured in a way that causes an indefinitely hanging job
The current FLARE controller is designed to allow setting the minimum number of required clients along with a server timeout. When min_clients is set to the total number of available clients with server_timeout=0, a failed client will cause the server workflow to hang.

This feature is useful for production use cases, in which the server workflow should be resilient to temporary interruptions in client communication, allowing for clients to temporarily fail and reconnect.

But in cases where a client has failed and is unrecoverable, the server workflow should timeout, independent from the controller workflow configuration. This would also allow a "development mode" in which any client failure causes the server workflow to terminate.

Potential solution
A separate server timeout configuration could be implemented independent of the controller configuration (for example in the server communication layer). This could be configured as a server job timeout, where

  • a timeout of 0 could trigger immediate failure (development mode)
  • a timeout of -1 (inf) would result in current behavior (production mode)
  • a non-zero positive timeout, depending on your level of patience

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions