-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Filter old CrdsValues received via Pull Responses in Gossip #8150
Conversation
|
@sagar-solana i think that’s fine if we allow old values if there is a newer value in the crds already. |
core/src/crds_gossip_pull.rs
Outdated
| let mut failed = 0; | ||
| for r in response { | ||
| if now > r.wallclock() + self.msg_timeout || now + self.msg_timeout < r.wallclock() { | ||
| inc_new_counter_error!("cluster_info-gossip_pull_response_value_timeout", 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a different counter for older and newer timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm newer is only going to help spot liars. I followed what the push message logic does. There it returns the same error for old and new.
I'm not sure I follow. Do you mean if the value is old, but we have a more fresh local_timestamp for an identity, I can bypass the time check? I think that would work. We can just check if the contact info exists and if it's local timestamp is not ancient. |
|
@sagar-solana basically, some new value in the crds should reset the lease. Nodes update their ContactInfo every gossip pull timeout /2. Maybe that should be a bit faster with this change. |
|
Got it. Thanks. I'll update the PR. |
|
I don’t want gossip network to start dropping root votes during a long partition. So if a new ContactInfo is present, old values should be accepted. |
core/src/crds_gossip_pull.rs
Outdated
| let mut failed = 0; | ||
| for r in response { | ||
| let owner = r.label().pubkey(); | ||
| if now > r.wallclock() + self.msg_timeout || now + self.msg_timeout < r.wallclock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wall clock that is u64::MAX could cause an exception here I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use checked_add when adding to wallclock
aeyakovenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the stake weighted timeout for contact infos?
Still working on it. Don't have the epoch schedule to look up the timeout so still figuring some of that out. |
Codecov Report
@@ Coverage Diff @@
## master #8150 +/- ##
========================================
- Coverage 81.9% 81.8% -0.1%
========================================
Files 248 248
Lines 53533 53610 +77
========================================
+ Hits 43851 43896 +45
- Misses 9682 9714 +32 |
Pull request has been modified.
|
We should probably verify this with a partition test on gce that blocks packets longer than the timeout. Maybe let it loop partitions and recovery. |
I've been testing to make sure it stops the issue described in #8141. |
|
I'm running this PR on tds.solana.com now, I no longer see the multiple IP entries in gossip for tds.solana.com or the excessive gossip "I'm talking to myself" messages. RAM is still climbing, it looks like it's still going to OOM, but RAM appears to be climbing slower at the moment. This PR certainly appears to improve the situation |
|
Okay that's some good news. Also the "talking to myself" logs should not have been impacted by this. Since the duplicate entries (with same IP) had different pubkeys it wouldn't detect "myself" The memory problem might be because t.s.c is still in a bunch of gossip tables in the cluster. We need some more "sinkholes" (upgraded gossip nodes) to drain these bogus values from the gossip network. It might a lot of incoming repair? Although I'm not sure. We can land this PR. I was going to write a little test just to get some coverage but that can follow if I don't get it done in time. You can hit that green button whenever. |
|
We can wait for the test to land this. :) |
* Add CrdsValue timeout checks on Pull Responses * Allow older values to enter Crds as long as a ContactInfo exists * Allow staked contact infos to be inserted into crds if they haven't expired * Try and handle oveflows * Fix test * Some comments * Fix compile * fix test deadlock * Add a test for processing timed out values received via pull response (cherry picked from commit fa00803)
* Add CrdsValue timeout checks on Pull Responses * Allow older values to enter Crds as long as a ContactInfo exists * Allow staked contact infos to be inserted into crds if they haven't expired * Try and handle oveflows * Fix test * Some comments * Fix compile * fix test deadlock * Add a test for processing timed out values received via pull response (cherry picked from commit fa00803)
Problem
There's only 2 ways for a CrdsValue to enter our Gossip Crds Table.
For 1, there is a time based filter that nodes from ingesting old values.
For 2, there is no filter. Pull Messages use a roundtrip request/response and since all recently "purged" values are included in the request, the assumption is that as long Crds does not already have a newer version of a given Value, it will accept any incoming value, no matter how old.
Now, given enough versions of a value, it's possible for that value to never leave the gossip network, even if the node that generated the value has gone offline.
For example, if there are 10 values for some node V ( V_1 through V_10), after that node goes offline, if the gossip network is large enough such that it's not converging in time, what can happen is that as some nodes receive V_10 and purge it (after 15s - default timeout), another node who only has V_5 (has not seen V_10 yet) might send V_5 back to the nodes that purged V_10. Since V_5 is not in their table or in their purged list, it looks like a "new" value and they accept it. Thus cycling this old value in gossip forever.
Summary of Changes
Gossip will not ingest any value that is over 60s old for Pull Responses an 30s for a Push Message.
@aeyakovenko, this partially undoes any epoch long timeouts we added to crds values. For example, if a new validator joins a cluster with 1 inactive validator. Everyone else will still have that inactive validators' contact info but that value is too old to give to the new validator. We should probably get rid of the epoch long timeouts?
Fixes #8141