Filter old CrdsValues received via Pull Responses in Gossip #8150

sagar-solana · 2020-02-06T20:22:27Z

Problem

There's only 2 ways for a CrdsValue to enter our Gossip Crds Table.

Via a Push Message
Via a Pull Response
For 1, there is a time based filter that nodes from ingesting old values.
For 2, there is no filter. Pull Messages use a roundtrip request/response and since all recently "purged" values are included in the request, the assumption is that as long Crds does not already have a newer version of a given Value, it will accept any incoming value, no matter how old.

Now, given enough versions of a value, it's possible for that value to never leave the gossip network, even if the node that generated the value has gone offline.

For example, if there are 10 values for some node V ( V_1 through V_10), after that node goes offline, if the gossip network is large enough such that it's not converging in time, what can happen is that as some nodes receive V_10 and purge it (after 15s - default timeout), another node who only has V_5 (has not seen V_10 yet) might send V_5 back to the nodes that purged V_10. Since V_5 is not in their table or in their purged list, it looks like a "new" value and they accept it. Thus cycling this old value in gossip forever.

Summary of Changes

Increased the Push timeout to allow pushes to propagate a little further through the Gossip network.
Added a time based filter for values received in a PullResponse.

Gossip will not ingest any value that is over 60s old for Pull Responses an 30s for a Push Message.

@aeyakovenko, this partially undoes any epoch long timeouts we added to crds values. For example, if a new validator joins a cluster with 1 inactive validator. Everyone else will still have that inactive validators' contact info but that value is too old to give to the new validator. We should probably get rid of the epoch long timeouts?

Fixes #8141

aeyakovenko · 2020-02-06T20:41:34Z

@sagar-solana i think that’s fine if we allow old values if there is a newer value in the crds already.

core/src/crds_gossip_pull.rs

mvines · 2020-02-06T20:43:21Z

core/src/crds_gossip_pull.rs

        let mut failed = 0;
        for r in response {
+            if now > r.wallclock() + self.msg_timeout || now + self.msg_timeout < r.wallclock() {
+                inc_new_counter_error!("cluster_info-gossip_pull_response_value_timeout", 1);


should we have a different counter for older and newer timeouts?

Hm newer is only going to help spot liars. I followed what the push message logic does. There it returns the same error for old and new.

sagar-solana · 2020-02-06T20:45:16Z

@sagar-solana i think that’s fine if we allow old values if there is a newer value in the crds already.

I'm not sure I follow. Do you mean if the value is old, but we have a more fresh local_timestamp for an identity, I can bypass the time check?

I think that would work. We can just check if the contact info exists and if it's local timestamp is not ancient.

aeyakovenko · 2020-02-06T20:48:31Z

@sagar-solana basically, some new value in the crds should reset the lease. Nodes update their ContactInfo every gossip pull timeout /2. Maybe that should be a bit faster with this change.

sagar-solana · 2020-02-06T20:50:49Z

Got it. Thanks. I'll update the PR.

aeyakovenko · 2020-02-06T20:51:23Z

I don’t want gossip network to start dropping root votes during a long partition. So if a new ContactInfo is present, old values should be accepted.

aeyakovenko · 2020-02-06T22:28:38Z

core/src/crds_gossip_pull.rs

        let mut failed = 0;
        for r in response {
            let owner = r.label().pubkey();
+            if now > r.wallclock() + self.msg_timeout || now + self.msg_timeout < r.wallclock() {


wall clock that is u64::MAX could cause an exception here I think

Updated to use checked_add when adding to wallclock

aeyakovenko

where is the stake weighted timeout for contact infos?

sagar-solana · 2020-02-06T22:34:40Z

where is the stake weighted timeout for contact infos?

Still working on it. Don't have the epoch schedule to look up the timeout so still figuring some of that out.

…xpired

codecov · 2020-02-06T23:09:57Z

Codecov Report

Merging #8150 into master will decrease coverage by <.1%.
The diff coverage is 75.8%.

@@           Coverage Diff            @@
##           master   #8150     +/-   ##
========================================
- Coverage    81.9%   81.8%   -0.1%     
========================================
  Files         248     248             
  Lines       53533   53610     +77     
========================================
+ Hits        43851   43896     +45     
- Misses       9682    9714     +32

Pull request has been modified.

aeyakovenko · 2020-02-07T00:06:45Z

We should probably verify this with a partition test on gce that blocks packets longer than the timeout. Maybe let it loop partitions and recovery.

sagar-solana · 2020-02-07T00:11:25Z

We should probably verify this with a partition test on gce that blocks packets longer than the timeout. Maybe let it loop partitions and recovery.

I've been testing to make sure it stops the issue described in #8141.

mvines · 2020-02-07T17:38:47Z

I'm running this PR on tds.solana.com now, I no longer see the multiple IP entries in gossip for tds.solana.com or the excessive gossip "I'm talking to myself" messages.

RAM is still climbing, it looks like it's still going to OOM, but RAM appears to be climbing slower at the moment.

This PR certainly appears to improve the situation

sagar-solana · 2020-02-07T17:45:52Z

Okay that's some good news. Also the "talking to myself" logs should not have been impacted by this. Since the duplicate entries (with same IP) had different pubkeys it wouldn't detect "myself"

The memory problem might be because t.s.c is still in a bunch of gossip tables in the cluster. We need some more "sinkholes" (upgraded gossip nodes) to drain these bogus values from the gossip network. It might a lot of incoming repair? Although I'm not sure.

We can land this PR. I was going to write a little test just to get some coverage but that can follow if I don't get it done in time. You can hit that green button whenever.

mvines · 2020-02-07T17:47:12Z

We can wait for the test to land this. :)

* Add CrdsValue timeout checks on Pull Responses * Allow older values to enter Crds as long as a ContactInfo exists * Allow staked contact infos to be inserted into crds if they haven't expired * Try and handle oveflows * Fix test * Some comments * Fix compile * fix test deadlock * Add a test for processing timed out values received via pull response (cherry picked from commit fa00803)

…8171) automerge

…olana-labs#8150) (solana-labs#8171)" This reverts commit c65b9cd.

* Add CrdsValue timeout checks on Pull Responses * Allow older values to enter Crds as long as a ContactInfo exists * Allow staked contact infos to be inserted into crds if they haven't expired * Try and handle oveflows * Fix test * Some comments * Fix compile * fix test deadlock * Add a test for processing timed out values received via pull response (cherry picked from commit fa00803)

…8277) automerge

Add CrdsValue timeout checks on Pull Responses

650a7a8

sagar-solana requested review from mvines and pgarg66 February 6, 2020 20:22

mvines reviewed Feb 6, 2020

View reviewed changes

core/src/crds_gossip_pull.rs Show resolved Hide resolved

mvines reviewed Feb 6, 2020

View reviewed changes

Allow older values to enter Crds as long as a ContactInfo exists

6c93804

mvines requested a review from aeyakovenko February 6, 2020 22:01

aeyakovenko reviewed Feb 6, 2020

View reviewed changes

aeyakovenko previously approved these changes Feb 6, 2020

View reviewed changes

Allow staked contact infos to be inserted into crds if they haven't e…

a8a6f97

…xpired

sagar-solana added 3 commits February 6, 2020 15:35

Try and handle oveflows

a8de87f

Fix test

33f08e2

Some comments

38470c9

Fix compile

05496c3

fix test deadlock

0033673

mvines added the v0.23 label Feb 7, 2020

Add a test for processing timed out values received via pull response

bee947d

sagar-solana merged commit fa00803 into solana-labs:master Feb 7, 2020

sagar-solana deleted the gossip_fix branch February 7, 2020 20:38

mergify bot mentioned this pull request Feb 7, 2020

Filter old CrdsValues received via Pull Responses in Gossip (bp #8150) #8171

Merged

solana-grimes pushed a commit that referenced this pull request Feb 7, 2020

Filter old CrdsValues received via Pull Responses in Gossip (#8150) (#…

c65b9cd

…8171) automerge

carllin added a commit to carllin/solana that referenced this pull request Feb 8, 2020

Revert "Filter old CrdsValues received via Pull Responses in Gossip (s…

4ecf3ef

…olana-labs#8150) (solana-labs#8171)" This reverts commit c65b9cd.

mvines added the v0.22 label Feb 14, 2020

mergify bot mentioned this pull request Feb 14, 2020

Filter old CrdsValues received via Pull Responses in Gossip (bp #8150) #8277

Merged

solana-grimes pushed a commit that referenced this pull request Feb 14, 2020

Filter old CrdsValues received via Pull Responses in Gossip (#8150) (#…

535ee28

…8277) automerge

Filter old CrdsValues received via Pull Responses in Gossip #8150

Filter old CrdsValues received via Pull Responses in Gossip #8150

Uh oh!

Conversation

sagar-solana commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

aeyakovenko commented Feb 6, 2020

Uh oh!

Uh oh!

mvines Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

sagar-solana Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

sagar-solana commented Feb 6, 2020

Uh oh!

aeyakovenko commented Feb 6, 2020

Uh oh!

sagar-solana commented Feb 6, 2020

Uh oh!

aeyakovenko commented Feb 6, 2020

Uh oh!

aeyakovenko Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

sagar-solana Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

aeyakovenko left a comment

Choose a reason for hiding this comment

Uh oh!

sagar-solana commented Feb 6, 2020

Uh oh!

codecov bot commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aeyakovenko commented Feb 7, 2020

Uh oh!

sagar-solana commented Feb 7, 2020

Uh oh!

mvines commented Feb 7, 2020

Uh oh!

sagar-solana commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvines commented Feb 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sagar-solana commented Feb 6, 2020 •

edited

Loading

codecov bot commented Feb 6, 2020 •

edited

Loading

sagar-solana commented Feb 7, 2020 •

edited

Loading