-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
FailoverClient (with ReplicaOnly) times out on unavailable Sentinel and terminates search for a live one.
Expected Behavior
FailoverClient successfully finds an available Sentinel and establishes a connection.
Current Behavior
FailoverClient.replicaAddrs terminates immediately if it tries to connect to a Sentinel that is network-unavailable.
(see Detailed Description)
Possible Solution
Need to check outer context expiration instead of checking if the returned error matches context.DeadlineExceeded.
Steps to Reproduce
- Run several Redis-server+Sentinel instances locally (6379+26379 <- leader, 6380+26380, 6381+26381)
- Simulate network timeout for connection to one of them (block in firewall via packet drop, not reject)
blocking via pfctl (reproducing on mac os x)block drop in proto tcp from any to any port = 6381 block drop in proto tcp from any to any port = 26381
- Create FailoverClient and execute a command (Ping) (check main_sentinel.go)
- During the first command, discovery occurs, and if shuffle places the problematic sentinel first, the connection establishment will timeout and the command may fail with an error without trying to connect to other live Sentinels.
Log file containing multiple executions (main_sentinel.log
) is attached. In certain cases, the connection can still be established (see Detailed Description).
Detailed Description
Under the hood, the driver creates a network connection via net.Dialer
.
The DialContext
method in case of network timeout can return an error that can be interpreted as context.DeadlineExceeded
: golang/go#51428
For example, this happens when the Timeout expires during connection (this parameter is set by go-redis).
That is, Dialer.DialContext
can return an error (comparable to context.DeadlineExceeded
), not because the context passed from outside has expired, but due to an internal connection timeout.
The go-redis code, as far as I can see, expects that the context.DeadlineExceeded
error means the expiration of the context passed from outside.
Thus sentinelFailover.replicaAddrs
fails with context.DeadlineExceeded
error when encountering an unavailable Sentinel, even if the context passed to the command has no timeout set.
This breaks the failover mechanism's core purpose of finding available sentinels.
Check main_deal.go for reproduction. Again, packet drop needs to be configured on port 26381.
A log file from one of the runs is attached too (main_deal.log
). Interestingly, sometimes the function doesn't wrap i/o timeout
in context.DeadlineExceeded
, and replicaAddrs
proceeds to check the next Sentinel and ultimately finds a live one.
But according to the mentioned change in the standard library code, we should assume that it should behave exactly this way.
Possible Implementation
As I mentioned above, comparing the error returned by net.Dialer
with context.DeadlineExceeded
seems incorrect.
Perhaps instead, we could check whether the externally passed context has not yet expired.
diff --git a/sentinel.go b/sentinel.go
index 04c0f72..61a6f6b 100644
--- a/sentinel.go
+++ b/sentinel.go
@@ -833,7 +833,7 @@ func (c *sentinelFailover) replicaAddrs(ctx context.Context, useDisconnected boo
replicas, err := sentinel.Replicas(ctx, c.opt.MasterName).Result()
if err != nil {
_ = sentinel.Close()
- if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+ if ctx.Err() != nil {
return nil, err
}
internal.Logger.Printf(ctx, "sentinel: Replicas master=%q failed: %s",