Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qs] batch store bootstrap perf improvements #15491

Merged
merged 3 commits into from
Dec 4, 2024
Merged

Conversation

ibalajiarun
Copy link
Contributor

Description

  • [qs] Use expiration buffer to cleanup during bootstrap
    Fix a bug from [consensus] sync improvements to help slow nodes sync better #15364 where the batches were not cleanup respecting the expiration buffer during bootstrap.
  • [qs] async gc old epoch batches from batch store
    With [consensus] sync improvements to help slow nodes sync better #15364, doubling cache duration in batch store means bootstrap time during epoch changes is now doubled because all the batches now need to be read from storage. To avoid this, this PR introduces a separate path for epoch changes vs restarts. During epoch change, the old batches from storage are deleted asynchronously. At restarts, the batches are still cleaned up synchronously.
  • [qs] monitor! create batch store

Copy link

trunk-io bot commented Dec 4, 2024

⏱️ 1h 41m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
forge-compat-test / forge 36m 🟥🟩
rust-move-tests 13m 🟩
rust-lints 7m 🟩
check-dynamic-deps 7m 🟩🟩🟩
rust-targeted-unit-tests 6m 🟩
rust-cargo-deny 5m 🟩🟥🟩
rust-doc-tests 5m 🟩
test-target-determinator 4m 🟩
execution-performance / test-target-determinator 4m 🟩
check 4m 🟩
rust-move-tests 3m 🟥
rust-move-tests 3m
general-lints 2m 🟩🟥🟩
fetch-last-released-docker-image-tag 2m 🟩
semgrep/ci 1m 🟩🟩🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 12s 22m -99%

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun force-pushed the balaji/qs-batch-store branch from 3290472 to ab67fd5 Compare December 4, 2024 18:53
Self::gc_previous_epoch_batches_from_db(db_clone, epoch);
});
} else {
Self::populate_cache_and_gc_expired_batches(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchocho @zekun000 I split the QS part of the PR out and addressed your comments here. I renamed the method to be clear.

@ibalajiarun ibalajiarun requested a review from zekun000 December 4, 2024 19:01
@ibalajiarun ibalajiarun enabled auto-merge (squash) December 4, 2024 19:19
Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -157,7 +157,7 @@ impl BatchStore {
last_certified_time
);
for (digest, value) in db_content {
let expiration = value.expiration();
let expiration = value.expiration().saturating_sub(expiration_buffer_usecs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Dec 4, 2024

✅ Forge suite realistic_env_max_load success on ab67fd51be09c4055eeaf056af17cd577de19013

two traffics test: inner traffic : committed: 14908.76 txn/s, latency: 2663.89 ms, (p50: 2700 ms, p70: 2700, p90: 2900 ms, p99: 3500 ms), latency samples: 5668640
two traffics test : committed: 100.00 txn/s, latency: 1574.34 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 7800 ms), latency samples: 1740
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.581, avg: 1.440", "ConsensusProposalToOrdered: max: 0.329, avg: 0.291", "ConsensusOrderedToCommit: max: 0.305, avg: 0.294", "ConsensusProposalToCommit: max: 0.592, avg: 0.585"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.61s no progress at version 47854 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.53s no progress at version 2505207 (avg 0.53s) [limit 16].
Test Ok

Copy link
Contributor

github-actions bot commented Dec 4, 2024

✅ Forge suite framework_upgrade success on 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013

Compatibility test results for 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013 (PR)
Upgrade the nodes to version: ab67fd51be09c4055eeaf056af17cd577de19013
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 880.49 txn/s, submitted: 909.22 txn/s, failed submission: 8.41 txn/s, expired: 28.73 txn/s, latency: 2447.54 ms, (p50: 1800 ms, p70: 2400, p90: 4300 ms, p99: 7900 ms), latency samples: 79583
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1183.76 txn/s, submitted: 1185.83 txn/s, failed submission: 2.07 txn/s, expired: 2.07 txn/s, latency: 2528.18 ms, (p50: 2100 ms, p70: 2600, p90: 4200 ms, p99: 5700 ms), latency samples: 102880
5. check swarm health
Compatibility test for 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013 passed
Upgrade the remaining nodes to version: ab67fd51be09c4055eeaf056af17cd577de19013
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1335.43 txn/s, submitted: 1338.95 txn/s, failed submission: 3.52 txn/s, expired: 3.52 txn/s, latency: 2260.62 ms, (p50: 2400 ms, p70: 2400, p90: 2700 ms, p99: 3600 ms), latency samples: 121440
Test Ok

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Dec 4, 2024

✅ Forge suite compat success on 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013

Compatibility test results for 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013 (PR)
1. Check liveness of validators at old version: 3527aa2e299553b759c515d9843586bad48c802c
compatibility::simple-validator-upgrade::liveness-check : committed: 14519.96 txn/s, latency: 2274.27 ms, (p50: 1900 ms, p70: 2100, p90: 4200 ms, p99: 6900 ms), latency samples: 483500
2. Upgrading first Validator to new version: ab67fd51be09c4055eeaf056af17cd577de19013
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6967.18 txn/s, latency: 4084.56 ms, (p50: 4600 ms, p70: 4900, p90: 5100 ms, p99: 5100 ms), latency samples: 129380
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 261.43 txn/s, submitted: 359.40 txn/s, expired: 97.97 txn/s, latency: 3700.35 ms, (p50: 4300 ms, p70: 4700, p90: 4900 ms, p99: 5100 ms), latency samples: 81411
3. Upgrading rest of first batch to new version: ab67fd51be09c4055eeaf056af17cd577de19013
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7102.99 txn/s, latency: 3967.85 ms, (p50: 4500 ms, p70: 4700, p90: 4900 ms, p99: 5000 ms), latency samples: 130300
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6821.61 txn/s, latency: 4826.45 ms, (p50: 5000 ms, p70: 5000, p90: 6200 ms, p99: 6900 ms), latency samples: 230300
4. upgrading second batch to new version: ab67fd51be09c4055eeaf056af17cd577de19013
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 1516.75 txn/s, submitted: 1742.84 txn/s, expired: 226.10 txn/s, latency: 3603.25 ms, (p50: 2500 ms, p70: 3600, p90: 9100 ms, p99: 10400 ms), latency samples: 108401
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10492.25 txn/s, latency: 2994.81 ms, (p50: 2700 ms, p70: 3500, p90: 4100 ms, p99: 4600 ms), latency samples: 343840
5. check swarm health
Compatibility test for 3527aa2e299553b759c515d9843586bad48c802c ==> ab67fd51be09c4055eeaf056af17cd577de19013 passed
Test Ok

@ibalajiarun ibalajiarun merged commit 63f0df8 into main Dec 4, 2024
82 of 88 checks passed
@ibalajiarun ibalajiarun deleted the balaji/qs-batch-store branch December 4, 2024 21:30
danielxiangzl pushed a commit that referenced this pull request Dec 12, 2024
* [qs] Use expiration buffer to cleanup during bootstrap
* [qs] async gc old epoch batches from batch store
* [qs] monitor! create batch store
danielxiangzl pushed a commit that referenced this pull request Dec 12, 2024
* [qs] Use expiration buffer to cleanup during bootstrap
* [qs] async gc old epoch batches from batch store
* [qs] monitor! create batch store
georgemitenkov pushed a commit that referenced this pull request Jan 6, 2025
* [qs] Use expiration buffer to cleanup during bootstrap
* [qs] async gc old epoch batches from batch store
* [qs] monitor! create batch store
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants