replay: extend last fec set check for 32+ retransmitter signed shreds#2101
replay: extend last fec set check for 32+ retransmitter signed shreds#2101AshwinSekar merged 11 commits intoanza-xyz:masterfrom
Conversation
9b3728d to
ef2185c
Compare
| "incomplete_final_fec_set", | ||
| ("slot", slot, i64), | ||
| ("hash", hash.to_string(), String) | ||
| ); | ||
| } | ||
| if !self.is_retransmitter_signed { | ||
| datapoint_warn!( | ||
| "invalid_retransmitter_signature_final_fec_set", | ||
| ("slot", slot, i64), | ||
| ("hash", hash.to_string(), String) |
There was a problem hiding this comment.
Apparently these hash things are pretty expensive for metrics server because they do not compress.
And until chained merkle shreds are rolled out every node is going to submit this metric for every slot which is too much.
There was a problem hiding this comment.
Sounds good i've gated this with should_chain_merkle_shreds so that we don't wastefully submit this metric prematurely.
There was a problem hiding this comment.
But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.
I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.
So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.
There was a problem hiding this comment.
i thought the should_chain_merkle_shreds function is the one that controls the chained merkle rollout?
As in, you will slowly update this function to target different cluster types and increasing slot ranges until it is always true.
By gating this metric behind should_chain_merkle_shreds we can verify if it is performing correctly on chained merkle shred slots only, and once should_chain_merkle_shreds is always true and the metric looks correct we can turn on feature_set::vote_only_retransmitter_signed_fec_sets
There was a problem hiding this comment.
I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.
What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead? since I assume we will activate that feature flag first (after chained merkle roots are fully rolled out) but before turning on this one.
If not, I can do as you suggest and remove the metric entirely and add it back before we intend to activate the feature.
There was a problem hiding this comment.
I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.
yeah, that is the issue with using should_chain_merkle_shreds.
What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead?
I think the code is over crowded with feature gate branches already, and I would lean to keep it simple and not to add more branches.
ef2185c to
509cdb9
Compare
| } | ||
| } | ||
|
|
||
| pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool { |
There was a problem hiding this comment.
moved from standard_broadcast_run in order to resolve dependencies.
There was a problem hiding this comment.
just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.
| "incomplete_final_fec_set", | ||
| ("slot", slot, i64), | ||
| ("hash", hash.to_string(), String) | ||
| ); | ||
| } | ||
| if !self.is_retransmitter_signed { | ||
| datapoint_warn!( | ||
| "invalid_retransmitter_signature_final_fec_set", | ||
| ("slot", slot, i64), | ||
| ("hash", hash.to_string(), String) |
There was a problem hiding this comment.
But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.
I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.
So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.
| // These metrics are expensive to send because hash does not compress well. | ||
| // Only send these metrics when we are sure the appropriate shred format is being sent | ||
| if !self.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) { | ||
| datapoint_warn!( | ||
| "invalid_retransmitter_signature_final_fec_set", | ||
| ("slot", slot, i64), | ||
| ("bank_hash", bank_hash.to_string(), String) | ||
| ); | ||
| } |
There was a problem hiding this comment.
As in the other comment, maybe we should remove this metric entirely for now and add it back once chained merkle shreds are fully rolled out.
The replay check that introduced this metric: On testnet however we do observe it quite frequently: It seems that something is wrong with the check. I will hold off on this change and investigate the previous check and get back to you. |
|
False alarm, I checked the block producer for blocks that have been failing the check on testnet today and they are all running firedancer: It seems that FD is not padding the last fec set yet, so this is to be expected. Also I realized that there is no reason to send the bank hash in metrics, as this check is happening before the bank is frozen so the hash is not yet populated 🤦 EDIT: |
7daf78c to
acac298
Compare
behzadnouri
left a comment
There was a problem hiding this comment.
lgtm, aside from minor comments.
| &bank.feature_set, | ||
| ) { | ||
| Ok(block_id) => block_id, | ||
| Err(result_err) => { |
There was a problem hiding this comment.
Instead of match, it might be more readable to use inspect_err:
https://doc.rust-lang.org/std/result/enum.Result.html#method.inspect_err
followed by unwrap_or_default.
There was a problem hiding this comment.
It doesn't work because we can't continue from the inspect_err closure.
| } | ||
| // These metrics are expensive to send because hash does not compress well. | ||
| // Only send these metrics when we are sure the appropriate shred format is being sent | ||
| if !results.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) |
There was a problem hiding this comment.
As mentioned in the other comment, gating on should_chain_merkle_shreds has cluster upgrade issues.
|
|
||
| #[derive(Clone, Copy, PartialEq, Eq, Debug)] | ||
| pub struct LastFECSetCheckResults { | ||
| block_id: Option<Hash>, |
There was a problem hiding this comment.
Is there a place this block_id is defined ?
Would it make sense to call this last_fec_set_merkle_root or something similar to be more expressive?
There was a problem hiding this comment.
agave/sdk/program/src/vote/state/mod.rs
Lines 235 to 238 in a2eaf6d
This is where it is defined. I'll keep it as
last_fec_set_merkle_root here and we can start calling it block_id in the replay pipeline.
| } | ||
| } | ||
|
|
||
| pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool { |
There was a problem hiding this comment.
just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.
…nt, false for legacy
acac298 to
df32e7b
Compare
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root (cherry picked from commit 93edb65) # Conflicts: # ledger/src/shred.rs # sdk/src/feature_set.rs
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…shreds (backport of #2101) (#2192) replay: extend last fec set check for 32+ retransmitter signed shreds (#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>

Problem
When the feature flag is turned on, we need to verify that the last fec set uses the new shred format.
Because the last fec set is not known when receiving shreds, and the new shred format is only used for the last set, we cannot perform this check until replay has completed.
Summary of Changes
Extend the check which compares the merkle roots of the last fec set to additionally check that they are all of the retransmitter shred variant.