-
Notifications
You must be signed in to change notification settings - Fork 33
Improve logging of rebalancer and recovery #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve logging of rebalancer and recovery #586
Conversation
6b1057d
to
64cc837
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents
5a8b3f8
to
f5c25f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, shit. I forgot to send the last message of review, I'm very sorry
04c506f
to
ccff54f
Compare
46add65
to
a1c095b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the final comments I have, the patch is pretty clean now)
Before this patch "Finish bucket recovery step ..." logs were printed at the end of recovery even if no buckets were successfully recovered. It led to unnecessary log records. This patch fixes the issue by adding an additional check for the number of recovered buckets. Part of tarantool#212 NO_DOC=bugfix
871197d
to
489b425
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final nits
489b425
to
fce8f28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but this one is critical, overlooked it somehow on the prev iterations(
This patch introduces logging of buckets' ids which were recovered during recovery stage of storage. Part of tarantool#212 NO_DOC=bugfix
This patch adds rebalancer routes' logging. The log file now includes information about the source storage, the number of buckets, and the destination storage where the buckets will be moved. Since the rebalancer service has changed logging of routes that were sent, we change the `rebalancer/rebalancer.test.lua` and `rebalancer/stress_add_remove_several_rs.test.lua` tests. Part of tarantool#212 NO_DOC=bugfix
Before this patch the function `rebalancer_download_states` didn't return information about replicaset from which the states could not be downloaded. As a result, the log "Some buckets are not active ..." lacks of valuable information about unhealthy replicaset. Now, we return `(replicaset.id, nil)` instead of `nil` in case when rebalancer can't download state from this replicaset. Also we add replicaset.id in "Some buckets are not active ..." log. Also we change `rebalancer/rebalancer.test.lua` test which expected the old "Some buckets are not active" log without replicaset.id. Closes tarantool#212 NO_DOC=bugfix
fce8f28
to
dcfa98a
Compare
local hanged_bucket_id_2 = vtest.storage_first_bucket(g.replica_1_a) | ||
start_bucket_move(g.replica_1_a, g.replica_2_a, hanged_bucket_id_2) | ||
t.helpers.retrying({}, function() | ||
t.assert(g.replica_1_a:grep_log('Error during recovery of bucket 1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's better to wakeup the recovery
there in the loop. And in other places, please. E.g. in the test_rebalancer_routes_logging
rebalancer should be waken up, when waiting for the Balanced ok
{timeout = consts.REBALANCER_GET_STATE_TIMEOUT}) | ||
if state == nil then | ||
return | ||
return nil, replicaset.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be carefully rebased, now error is returned from the rebalancer_download_states
. I'd propose to introduce the new error code, which will say, that the storage doesn't have all bucket in proper state, and set err
there if it's nil (it'll happen, in case the function really returned nil).
In that err (or any other there) we'll just set the err.replicaset
, smth like that:
Line 1653 in b4adaea
err.replicaset = rs_id |
And we'll print the whole err, not just the replicaset's id
Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.
Closes #212
NO_DOC=bugfix