FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one


FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one.

## Expected Behavior

FailoverClient successfully finds an available Sentinel and establishes a connection.

## Current Behavior

FailoverClient.replicaAddrs terminates immediately if it tries to connect to a Sentinel that is network-unavailable.
(see Detailed Description)

## Possible Solution

Need to check outer context expiration instead of checking if the returned error matches context.DeadlineExceeded.

## Steps to Reproduce

1. Run several Redis-server+Sentinel instances locally (6379+26379 <- leader, 6380+26380, 6381+26381)
2. Simulate network timeout for connection to one of them (block in firewall via packet drop, not reject)
   blocking via pfctl (reproducing on mac os x)
   ```
   block drop in proto tcp from any to any port = 6381
   block drop in proto tcp from any to any port = 26381
   ```
3. Create FailoverClient and execute a command (Ping) (check [main_sentinel.go](https://gist.github.com/perekalov/8f710337a3dd6ba8873886c7198c610a))
4. During the first command, discovery occurs, and if [shuffle](https://github.com/redis/go-redis/blob/master/sentinel.go#L407) places the problematic sentinel first, the connection establishment will timeout and the command may fail with an error without trying to connect to other live Sentinels.

Log file containing multiple executions (`main_sentinel.log`) is attached. In certain cases, the connection can still be established (see Detailed Description).

## Detailed Description

Under the hood, the driver creates a network connection via `net.Dialer`.
The `DialContext` method in case of network timeout can return an error that can be interpreted as `context.DeadlineExceeded`: https://github.com/golang/go/issues/51428
For example, this happens when the Timeout expires during connection (this parameter is [set](https://github.com/redis/go-redis/blob/master/sentinel.go#L467) by go-redis).

That is, `Dialer.DialContext` can return an error (comparable to `context.DeadlineExceeded`), not because the context passed from outside has expired, but due to an internal connection timeout.
The go-redis code, as far as I can see, expects that the `context.DeadlineExceeded` error means the expiration of the context passed from outside.

Thus `sentinelFailover.replicaAddrs` fails with `context.DeadlineExceeded` error when encountering an unavailable Sentinel, even if the context passed to the command has no timeout set.
This breaks the failover mechanism's core purpose of finding available sentinels.

Check [main_deal.go](https://gist.github.com/perekalov/bc90689409952dcc7af702415ee5af9e) for reproduction. Again, packet drop needs to be configured on port 26381.
A log file from one of the runs is attached too (`main_deal.log`). Interestingly, _sometimes_ the function doesn't wrap `i/o timeout` in `context.DeadlineExceeded`, and `replicaAddrs` proceeds to check the next Sentinel and ultimately finds a live one.
But according to the mentioned change in the standard library code, we should assume that it should behave exactly this way.

## Possible Implementation

As I mentioned above, comparing the error returned by `net.Dialer` with `context.DeadlineExceeded` seems incorrect.
Perhaps instead, we could check whether the externally passed context has not yet expired.
```
diff --git a/sentinel.go b/sentinel.go
index 04c0f72..61a6f6b 100644
--- a/sentinel.go
+++ b/sentinel.go
@@ -833,7 +833,7 @@ func (c *sentinelFailover) replicaAddrs(ctx context.Context, useDisconnected boo
                replicas, err := sentinel.Replicas(ctx, c.opt.MasterName).Result()
                if err != nil {
                        _ = sentinel.Close()
-                      if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+                     if ctx.Err() != nil {
                                return nil, err
                        }
                        internal.Logger.Printf(ctx, "sentinel: Replicas master=%q failed: %s",
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one #3439

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Detailed Description

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one #3439

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions