Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deadlock between the mempool and the db #15421

Merged
merged 1 commit into from
Nov 27, 2024
Merged

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Nov 27, 2024

Description

We started to do this par_iter in the global pool instead of the "exe pool" here:
#15184 (comment)

and it triggered a deadlock like what we saw repeatedly before:
#15068

To fix it and safe guard it, I am doing these two things in this PR:

  1. move the mempool validation to its dedicated pool -- it's not ideal to wait on a lock in the global pool
  2. move the par_iter back to the "exe pool"

it's not ideal that any rayon pool is involved in the "current_state" lock whose critical section was meant to be small. Won't be a issue any more once I change the representation of the state delta from hashmaps to layeredmaps.

How Has This Been Tested?

Key Areas to Review

Type of Change

  • Bug fix

Which Components or Systems Does This Change Impact?

  • Validator Node

Copy link

trunk-io bot commented Nov 27, 2024

@msmouse msmouse enabled auto-merge (squash) November 27, 2024 20:51

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on a623550637e74b2ca0a3f69b367a8d9434057990

two traffics test: inner traffic : committed: 14077.78 txn/s, latency: 2822.52 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 6600 ms), latency samples: 5352860
two traffics test : committed: 99.89 txn/s, latency: 2790.45 ms, (p50: 1500 ms, p70: 2000, p90: 6500 ms, p99: 18400 ms), latency samples: 1860
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.474, avg: 1.429", "ConsensusProposalToOrdered: max: 0.317, avg: 0.293", "ConsensusOrderedToCommit: max: 0.378, avg: 0.365", "ConsensusProposalToCommit: max: 0.675, avg: 0.658"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.08s no progress at version 1789888 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 15.98s no progress at version 6009620 (avg 15.83s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990

Compatibility test results for 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990 (PR)
1. Check liveness of validators at old version: 010570d3b7aa20889fb5ad0e5b23800aa33f5634
compatibility::simple-validator-upgrade::liveness-check : committed: 17443.43 txn/s, latency: 1956.72 ms, (p50: 2100 ms, p70: 2100, p90: 2200 ms, p99: 2400 ms), latency samples: 559120
2. Upgrading first Validator to new version: a623550637e74b2ca0a3f69b367a8d9434057990
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7291.24 txn/s, latency: 3763.49 ms, (p50: 4100 ms, p70: 4600, p90: 5100 ms, p99: 5400 ms), latency samples: 130520
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7309.82 txn/s, latency: 4354.53 ms, (p50: 4500 ms, p70: 4700, p90: 6400 ms, p99: 6700 ms), latency samples: 244120
3. Upgrading rest of first batch to new version: a623550637e74b2ca0a3f69b367a8d9434057990
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7329.79 txn/s, latency: 3859.48 ms, (p50: 4300 ms, p70: 4600, p90: 5100 ms, p99: 5200 ms), latency samples: 131140
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7332.37 txn/s, latency: 4413.12 ms, (p50: 4600 ms, p70: 4800, p90: 6000 ms, p99: 6400 ms), latency samples: 243840
4. upgrading second batch to new version: a623550637e74b2ca0a3f69b367a8d9434057990
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10509.89 txn/s, latency: 2667.80 ms, (p50: 2700 ms, p70: 2900, p90: 4000 ms, p99: 4300 ms), latency samples: 190380
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10657.17 txn/s, latency: 2934.46 ms, (p50: 2700 ms, p70: 2900, p90: 5600 ms, p99: 6600 ms), latency samples: 346060
5. check swarm health
Compatibility test for 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990

Compatibility test results for 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990 (PR)
Upgrade the nodes to version: a623550637e74b2ca0a3f69b367a8d9434057990
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1292.16 txn/s, submitted: 1295.06 txn/s, failed submission: 2.90 txn/s, expired: 2.90 txn/s, latency: 2274.47 ms, (p50: 2100 ms, p70: 2400, p90: 3000 ms, p99: 5000 ms), latency samples: 115800
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1267.61 txn/s, submitted: 1270.24 txn/s, failed submission: 2.63 txn/s, expired: 2.63 txn/s, latency: 2306.20 ms, (p50: 2100 ms, p70: 2400, p90: 3500 ms, p99: 5100 ms), latency samples: 115760
5. check swarm health
Compatibility test for 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> a623550637e74b2ca0a3f69b367a8d9434057990 passed
Upgrade the remaining nodes to version: a623550637e74b2ca0a3f69b367a8d9434057990
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1289.08 txn/s, submitted: 1291.97 txn/s, failed submission: 2.89 txn/s, expired: 2.89 txn/s, latency: 2423.45 ms, (p50: 2300 ms, p70: 2700, p90: 3600 ms, p99: 4500 ms), latency samples: 115960
Test Ok

@msmouse msmouse merged commit de9040d into main Nov 27, 2024
89 of 90 checks passed
@msmouse msmouse deleted the 1127-alden-fix-deadlock branch November 27, 2024 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants