这是indexloc提供的服务,不要输入任何密码
Skip to content

FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one #3439

@perekalov

Description

@perekalov

FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one.

Expected Behavior

FailoverClient successfully finds an available Sentinel and establishes a connection.

Current Behavior

FailoverClient.replicaAddrs terminates immediately if it tries to connect to a Sentinel that is network-unavailable.
(see Detailed Description)

Possible Solution

Need to check outer context expiration instead of checking if the returned error matches context.DeadlineExceeded.

Steps to Reproduce

  1. Run several Redis-server+Sentinel instances locally (6379+26379 <- leader, 6380+26380, 6381+26381)
  2. Simulate network timeout for connection to one of them (block in firewall via packet drop, not reject)
    blocking via pfctl (reproducing on mac os x)
    block drop in proto tcp from any to any port = 6381
    block drop in proto tcp from any to any port = 26381
    
  3. Create FailoverClient and execute a command (Ping) (check main_sentinel.go)
  4. During the first command, discovery occurs, and if shuffle places the problematic sentinel first, the connection establishment will timeout and the command may fail with an error without trying to connect to other live Sentinels.

Log file containing multiple executions (main_sentinel.log) is attached. In certain cases, the connection can still be established (see Detailed Description).

Detailed Description

Under the hood, the driver creates a network connection via net.Dialer.
The DialContext method in case of network timeout can return an error that can be interpreted as context.DeadlineExceeded: golang/go#51428
For example, this happens when the Timeout expires during connection (this parameter is set by go-redis).

That is, Dialer.DialContext can return an error (comparable to context.DeadlineExceeded), not because the context passed from outside has expired, but due to an internal connection timeout.
The go-redis code, as far as I can see, expects that the context.DeadlineExceeded error means the expiration of the context passed from outside.

Thus sentinelFailover.replicaAddrs fails with context.DeadlineExceeded error when encountering an unavailable Sentinel, even if the context passed to the command has no timeout set.
This breaks the failover mechanism's core purpose of finding available sentinels.

Check main_deal.go for reproduction. Again, packet drop needs to be configured on port 26381.
A log file from one of the runs is attached too (main_deal.log). Interestingly, sometimes the function doesn't wrap i/o timeout in context.DeadlineExceeded, and replicaAddrs proceeds to check the next Sentinel and ultimately finds a live one.
But according to the mentioned change in the standard library code, we should assume that it should behave exactly this way.

Possible Implementation

As I mentioned above, comparing the error returned by net.Dialer with context.DeadlineExceeded seems incorrect.
Perhaps instead, we could check whether the externally passed context has not yet expired.

diff --git a/sentinel.go b/sentinel.go
index 04c0f72..61a6f6b 100644
--- a/sentinel.go
+++ b/sentinel.go
@@ -833,7 +833,7 @@ func (c *sentinelFailover) replicaAddrs(ctx context.Context, useDisconnected boo
                replicas, err := sentinel.Replicas(ctx, c.opt.MasterName).Result()
                if err != nil {
                        _ = sentinel.Close()
-                      if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+                     if ctx.Err() != nil {
                                return nil, err
                        }
                        internal.Logger.Printf(ctx, "sentinel: Replicas master=%q failed: %s",

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions