-
-
Notifications
You must be signed in to change notification settings - Fork 414
feat: track number of peers per data column subnet topic #8442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @twoeths, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the network's gossipsub layer by implementing peer tracking for data column subnets, which is crucial for future network upgrades. It also standardizes how network peers and topics are categorized by transitioning from fork names to more granular fork boundaries, ensuring more accurate and future-proof metric collection. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces tracking for the number of peers per data column subnet topic, a feature needed for the fulu fork. It also refactors the existing peer and topic tracking to use fork boundaries instead of fork names for more precise metrics. The changes look good and are consistent with the stated goals. I've added one suggestion to improve maintainability by reducing code duplication in the metrics collection logic.
| for (const [boundary, peersByBeaconAttSubnet] of peersByBeaconAttSubnetByBoundary.map) { | ||
| for (let subnet = 0; subnet < ATTESTATION_SUBNET_COUNT; subnet++) { | ||
| metricsGossip.peersByBeaconAttestationSubnet.set( | ||
| {fork, subnet: attSubnetLabel(subnet)}, | ||
| {boundary, subnet: attSubnetLabel(subnet)}, | ||
| peersByBeaconAttSubnet[subnet] ?? 0 | ||
| ); | ||
| } | ||
| } | ||
| for (const [fork, peersByBeaconSyncSubnet] of peersByBeaconSyncSubnetByFork.map) { | ||
| for (const [boundary, peersByBeaconSyncSubnet] of peersByBeaconSyncSubnetByBoundary.map) { | ||
| for (let subnet = 0; subnet < SYNC_COMMITTEE_SUBNET_COUNT; subnet++) { | ||
| // SYNC_COMMITTEE_SUBNET_COUNT is < 9, no need to prepend a 0 to the label | ||
| metricsGossip.peersBySyncCommitteeSubnet.set({fork, subnet}, peersByBeaconSyncSubnet[subnet] ?? 0); | ||
| metricsGossip.peersBySyncCommitteeSubnet.set({boundary, subnet}, peersByBeaconSyncSubnet[subnet] ?? 0); | ||
| } | ||
| } | ||
| for (const [boundary, peersByDataColumnSubnet] of peersByDataColumnSubnetByBoundary.map) { | ||
| for (let subnet = 0; subnet < NUMBER_OF_COLUMNS; subnet++) { | ||
| metricsGossip.peersByDataColumnSubnet.set({boundary, subnet}, peersByDataColumnSubnet[subnet] ?? 0); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some code duplication in these loops for setting subnet-based metrics. To improve maintainability, you could extract this logic into a helper function. This function could take the map of peers, the metrics gauge, the subnet count, and an optional subnet label formatter as arguments.
For example:
function setSubnetMetrics(
peersBySubnetByBoundary: Map<ForkBoundaryLabel, number[]>,
metricsGauge: {set: (labels: {boundary: ForkBoundaryLabel; subnet: string | number}, value: number) => void},
subnetCount: number,
subnetLabelFn: (subnet: number) => string | number = (s) => s
) {
for (const [boundary, peersBySubnet] of peersBySubnetByBoundary.map) {
for (let subnet = 0; subnet < subnetCount; subnet++) {
metricsGauge.set(
{boundary, subnet: subnetLabelFn(subnet)},
peersBySubnet[subnet] ?? 0
);
}
}
}Then you could replace the loops with calls to this helper:
setSubnetMetrics(
peersByBeaconAttSubnetByBoundary,
metricsGossip.peersByBeaconAttestationSubnet,
ATTESTATION_SUBNET_COUNT,
attSubnetLabel
);
setSubnetMetrics(
peersByBeaconSyncSubnetByBoundary,
metricsGossip.peersBySyncCommitteeSubnet,
SYNC_COMMITTEE_SUBNET_COUNT
);
setSubnetMetrics(
peersByDataColumnSubnetByBoundary,
metricsGossip.peersByDataColumnSubnet,
NUMBER_OF_COLUMNS
);
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## unstable #8442 +/- ##
============================================
- Coverage 52.24% 52.23% -0.02%
============================================
Files 853 853
Lines 64770 64797 +27
Branches 4766 4767 +1
============================================
+ Hits 33841 33845 +4
- Misses 30859 30882 +23
Partials 70 70 🚀 New features to boost your workflow:
|
Performance Report🚀🚀 Significant benchmark improvement detected
Full benchmark results
|
| peersByDataColumnSubnet: register.gauge<{subnet: SubnetID; boundary: ForkBoundaryLabel}>({ | ||
| name: "lodestar_gossip_mesh_peers_by_data_column_subnet_count", | ||
| help: "Number of connected mesh peers per data column subnet", | ||
| labelNames: ["subnet", "boundary"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this metric can be pretty expensive in testnets if BPOs are close to each other, eg. if we have fulu + 5 BPOs this results in 768 metric entries, not a blocker but just need to keep that in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add more on that, if someone come up with a config like BPO at epoch n, n+1, n+2, n+3, n+4 then I think we have bigger issues to deal with than the metric
sample config from fusaka-devnet-5
BLOB_SCHEDULE:
- EPOCH: 512
MAX_BLOBS_PER_BLOCK: 15
- EPOCH: 768
MAX_BLOBS_PER_BLOCK: 21
- EPOCH: 1024
MAX_BLOBS_PER_BLOCK: 33
- EPOCH: 1280
MAX_BLOBS_PER_BLOCK: 48
- EPOCH: 1536
MAX_BLOBS_PER_BLOCK: 72
we only subscribe to +- 2 epochs at fork boundary FORK_EPOCH_LOOKAHEAD = 2 so most of the time we only get metric for 1 fork boundary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the node would still serve them via /metrics until next restart and based on prometheus retention it would be still stored there
I wonder if we even need to fork boundary label or can just assess the metrics based on time, you would probably see a drop at each fork boundary and then it should go up again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only do that if there is a clear performance issue
seeing different peer counts at different fork boundary give us confidence on the implementation of lodestar nodes, and help us investigate issues just in case
we did not even get through any fork boundaries on mainnet, can revisit later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @nflaig that we will just see the boundary in the metric dislocation. the forks will never overlap so we will never have two lines at the same time needing to be differentiated by different labels. they will always be mutually exclusive and not sure its worth all the extra buckets/data-points for the label that is not "technically differentiating anything" because there would never be boundary overlap. There will just be a metric dislocation. I guess I also see what @twoeths is saying, that if the metric stays the same across a boundary and there is no color change on the dashboard we will not see the change (if there should be one) but not sure that is compelling enough to offset the hundreds of "0" data points we will collect to support the second label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'm fine with either but just wanted to bring that up. What do you think @wemeetagain ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a cheaper way to get the information we're looking for? Imo the primary piece of information / questions we're answering are:
- how many subnets have 0 mesh peers?
- what is the distribution of mesh peers for the subnets?
I don't think we really need to know that eg subnet 13 has 3 mesh peers. Its not actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I did not notice we have a concern before merging the PR
imo the performance concern is common for attnets/syncnets as well, so we should resolve all if someone can prove we have a performance issue with it
I don't think we really need to know that eg subnet 13 has 3 mesh peers. Its not actionable.
if we miss data column of subnet 13 of a block, block is not imported and it's not easy to find which missing DataColumnSidecars just looking into the gossip log. Then this metric should help
also we already have discovery metrics, for example lodestar_discovery_custody_groups_to_connect
these metrics are addition to that metrics when we need investigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created this issue to track #8454
|
🎉 This PR is included in v1.35.0 🎉 |
Motivation
Description