-
Notifications
You must be signed in to change notification settings - Fork 17
fix(statesync): statesync is unstable and doesn't time out #1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request implements a series of updates across multiple modules. Trust parameters in state synchronization have been removed and replaced with a retry mechanism. A new Changes
Possibly related PRs
Suggested reviewers
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (1.64.8)Error: you are using a configuration file for golangci-lint v2 with golangci-lint v1: please use golangci-lint v2 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (9)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (12)
libs/math/safemath.go (2)
164-199: Consider using type constraints for better readability.The implementation is good, but readability could be improved by using a more descriptive type constraint name.
-type Integer interface { +type Number interface { ~int | ~int8 | ~int16 | ~int32 | ~int64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64 } -func MustConvert[FROM Integer, TO Integer](a FROM) TO { +func MustConvert[FROM Number, TO Number](a FROM) TO {
263-312: Improve panic messages in Max/Min functions.The implementation is good, but the panic messages could be more descriptive.
- panic("unsupported type") + panic(fmt.Sprintf("unsupported type %T", max))libs/math/safemath_test.go (1)
89-207: LGTM! Comprehensive test coverage.The test cases are thorough and well-structured, covering various integer types and edge cases. Consider adding a helper function to reduce the nested switch statements and improve maintainability.
+func convertValue[F Integer, T Integer](from F, want T) (interface{}, error) { + return SafeConvert[F, T](from) +} func TestSafeConvert(t *testing.T) { // ... test cases ... for i, tc := range testCases { testName := fmt.Sprintf("%d:%T(%d)-%T(%d)", i, tc.from, tc.from, tc.want, tc.want) t.Run(testName, func(t *testing.T) { - var result interface{} - var err error - switch from := tc.from.(type) { - case int: - switch tc.want.(type) { - case int64: - result, err = SafeConvert[int, int64](from) + result, err := convertValue(tc.from, tc.want)types/validator_test.go (5)
119-126: Consider parameterizing test vectors.The test uses hardcoded test vectors. Consider parameterizing them into a table-driven test to make it easier to add more test cases in the future.
-func TestValidatorSetHashVectors(t *testing.T) { - thresholdPublicKey, err := base64.RawStdEncoding.DecodeString("gw5F5F5kFNnWFUc8woFOaxccUI+cd+ixaSS3RZT2HJlWpvoWM16YRn6sjYvbdtGH") - require.NoError(t, err) - - quorumHash, err := hex.DecodeString("703ee5bfc78765cc9e151d8dd84e30e196ababa83ac6cbdee31a88a46bba81b9") - require.NoError(t, err) - - expected := "81742F95E99EAE96ABC727FE792CECB4996205DE6BFC88AFEE1F60B96BC648B2" +func TestValidatorSetHashVectors(t *testing.T) { + testCases := []struct { + name string + thresholdPublicKey string + quorumHash string + expectedHash string + }{ + { + name: "valid hash generation", + thresholdPublicKey: "gw5F5F5kFNnWFUc8woFOaxccUI+cd+ixaSS3RZT2HJlWpvoWM16YRn6sjYvbdtGH", + quorumHash: "703ee5bfc78765cc9e151d8dd84e30e196ababa83ac6cbdee31a88a46bba81b9", + expectedHash: "81742F95E99EAE96ABC727FE792CECB4996205DE6BFC88AFEE1F60B96BC648B2", + }, + } + + for _, tc := range testCases { + t.Run(tc.name, func(t *testing.T) { + thresholdPublicKey, err := base64.RawStdEncoding.DecodeString(tc.thresholdPublicKey) + require.NoError(t, err) + + quorumHash, err := hex.DecodeString(tc.quorumHash) + require.NoError(t, err)
128-134: Add test documentation and error cases.The test function lacks documentation and error test cases. Consider:
- Adding a function comment explaining the test's purpose
- Including error cases for invalid inputs
+// TestValidatorSetHashVectors verifies that the ValidatorSet hash is correctly computed +// from a given threshold public key and quorum hash. func TestValidatorSetHashVectors(t *testing.T) { // ... test cases setup ... + // Add error cases + t.Run("invalid public key", func(t *testing.T) { + valset := ValidatorSet{ + ThresholdPublicKey: bls12381.PubKey([]byte("invalid")), + QuorumHash: quorumHash, + } + assert.Panics(t, func() { valset.Hash() }) + })
292-306: Improve error handling in Sync method.The error handling for
errNoSnapshotsis good, but consider adding metrics or telemetry to track fallback occurrences.if errors.Is(err, errNoSnapshots) && r.postSyncHook != nil { r.logger.Warn("no snapshots available; falling back to block sync", "err", err) + if r.metrics != nil { + r.metrics.StateSyncFallbacks.Add(1) + }
1068-1104: LGTM! Enhanced snapshot safety checks.Good addition of safety checks to ensure snapshots are only accepted when the next block is finalized.
This change improves system reliability by preventing potential state inconsistencies during snapshot synchronization.
877-877: Use safe integer conversion.Good use of
tmmath.MustConvertInt64for safe integer conversion, but consider handling the potential panic.-cp, err := r.stateStore.LoadConsensusParams(tmmath.MustConvertInt64(msg.Height)) +height, err := tmmath.SafeConvertInt64(msg.Height) +if err != nil { + r.logger.Error("invalid height in params request", + "height", msg.Height, + "error", err) + return nil +} +cp, err := r.stateStore.LoadConsensusParams(height)internal/libs/sync/mutexguard_test.go (4)
14-16: Consider increasing the timeout duration for CI environments.While 1 second is reasonable for local testing, it might be too short for CI environments or slower machines. Consider making it configurable via an environment variable.
const ( - timeout = 1 * time.Second + timeout = getTestTimeout() ) + +func getTestTimeout() time.Duration { + if t := os.Getenv("TEST_TIMEOUT"); t != "" { + if d, err := time.ParseDuration(t); err == nil { + return d + } + } + return 1 * time.Second +}
18-46: Consider adding edge cases to strengthen the test.The test covers basic unlock scenarios well. Consider adding these edge cases:
- Panic recovery when unlocking an already unlocked mutex
- Concurrent unlock attempts from multiple goroutines
func TestLockGuardMultipleUnlocks(t *testing.T) { + t.Run("panic recovery", func(t *testing.T) { + var mtx deadlock.Mutex + mtx.Lock() + mtx.Unlock() + assert.NotPanics(t, func() { + unlock := tmsync.LockGuard(&mtx) + unlock() + }) + }) + + t.Run("concurrent unlocks", func(t *testing.T) { + var mtx deadlock.Mutex + unlock := tmsync.LockGuard(&mtx) + var wg sync.WaitGroup + for i := 0; i < 10; i++ { + wg.Add(1) + go func() { + defer wg.Done() + assert.False(t, unlock()) + }() + } + wg.Wait() + })
70-105: Verify concurrent read access.The test should verify that multiple reads can occur simultaneously. Consider adding a channel-based synchronization to confirm this behavior.
func TestRLockGuard(t *testing.T) { var mtx deadlock.RWMutex var counter int var wg sync.WaitGroup + reading := make(chan struct{}) + done := make(chan struct{}) read := func() { defer wg.Done() unlock := tmsync.RLockGuard(&mtx) defer unlock() + reading <- struct{}{} // Signal that we're reading + <-done // Wait for verification _ = counter } + + // Verify concurrent reads + go func() { + // Wait for at least 2 readers + <-reading + <-reading + close(done) + }()
107-120: Make the helper more test-friendly.Instead of panicking, consider returning an error to allow for better test control and error handling.
-func waitFor(f func()) { +func waitFor(f func()) error { done := make(chan struct{}) go func() { f() close(done) }() select { case <-time.After(timeout): - panic("Test timed out") + return fmt.Errorf("test timed out after %v", timeout) case <-done: + return nil } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
internal/libs/sync/mutexguard_test.go(1 hunks)internal/statesync/reactor.go(22 hunks)libs/math/safemath.go(5 hunks)libs/math/safemath_test.go(2 hunks)types/validator_test.go(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
libs/math/safemath_test.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#1026
File: libs/math/safemath.go:123-161
Timestamp: 2025-01-20T13:38:10.707Z
Learning: The `SafeConvert` function in `libs/math/safemath.go` correctly handles negative values during integer type conversion through two mechanisms: explicit checks for int64 to uint64 conversion, and the Min[U]() check that catches negative values when converting to unsigned types.
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: test_apps
- GitHub Check: test_abci_cli
- GitHub Check: Super linter
- GitHub Check: e2e-test (rotate)
- GitHub Check: e2e-test (dashcore)
🔇 Additional comments (12)
libs/math/safemath.go (2)
9-9: LGTM! Good error consolidation.Consolidating multiple specific overflow errors into a single
ErrOverflowsimplifies error handling while maintaining descriptive error messages.
117-158: LGTM! Robust generic implementation.The
SafeConvertfunction is well-implemented with:
- Efficient platform-specific size checks using constants
- Proper handling of edge cases for int64/uint64
- Comprehensive overflow checks using Max/Min
types/validator_test.go (3)
72-73: LGTM! Good practice to extract magic number into a constant.The introduction of
minPeersconstant improves code maintainability by centralizing the minimum peer requirement.
141-148: LGTM! Well-documented interface definition.The
ConsensusStateProviderinterface is well-documented and follows the interface segregation principle with minimal required methods.
1000-1000: LGTM! Improved logging granularity.Good change from Info to Trace level for peer updates, which reduces log noise while maintaining observability.
Also applies to: 1009-1009
internal/libs/sync/mutexguard_test.go (2)
1-12: LGTM! Well-organized imports and package declaration.The package is correctly named
sync_testfor external tests, and imports are properly organized with standard library packages first, followed by third-party and internal packages.
48-68:❓ Verification inconclusive
Add race condition verification.
While the test correctly verifies concurrent access, it would be beneficial to explicitly run it with the race detector and verify the absence of data races.
🏁 Script executed:
#!/bin/bash # Description: Check if tests pass with race detection enabled go test -race ./internal/libs/sync/...Length of output: 139
Update test instructions to enable race detection by setting CGO_ENABLED=1
While the test correctly verifies concurrent access, it's recommended to run it with the race detector to confirm the absence of data races. However, note that the race detector requires cgo to be enabled. Please update your test command accordingly:
- Run the tests using:
CGO_ENABLED=1 go test -race ./internal/libs/sync/...internal/statesync/reactor.go (5)
24-24: LGTM! Good improvements to code organization.The changes improve code maintainability by:
- Adding safe integer conversion utilities via
tmmath- Centralizing the minimum peer requirement into a constant instead of using magic numbers
Also applies to: 72-73
138-138: Great use of interface-based design!The introduction of
ConsensusStateProviderinterface improves:
- Testability by enabling mock implementations
- Modularity by decoupling the reactor from concrete consensus state
- Code clarity through focused, well-documented interface
Also applies to: 141-148
292-306: Excellent error handling and fallback mechanism!The changes improve system resilience by:
- Gracefully handling the case when no snapshots are available
- Providing a fallback to block sync
- Using clear, informative error messages
1068-1104: Robust snapshot safety improvements!The changes enhance data consistency by:
- Adding consensus state validation
- Ensuring snapshots are only accepted when the next block is finalized
- Including helpful debug logging
1000-1000: Appropriate adjustment of logging levels!Changed peer update logging from Info to Trace level, which:
- Reduces log noise for frequent events
- Maintains consistency with other reactor implementations
- Still provides detailed debugging capability when needed
Also applies to: 1009-1009
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
libs/math/safemath.go (3)
93-111: Add parameter documentation to conversion functions.The functions are well-implemented, but the parameter documentation is missing. Add descriptions for the generic type parameter
T.-// SafeConvertUint64 takes a int and checks if it overflows. +// SafeConvertUint64 takes a value of type T and converts it to uint64, checking for overflow. +// Type parameter T must satisfy the Integer constraint. func SafeConvertUint64[T Integer](a T) (uint64, error) {
119-158: Simplify the type switch in SafeConvert.The type switch could be simplified by removing redundant comments and empty returns.
func SafeConvert[F Integer, T Integer](from F) (T, error) { const uintIsSmall = math.MaxUint < math.MaxUint64 const intIsSmall = math.MaxInt < math.MaxInt64 && math.MinInt > math.MinInt64 switch any(from).(type) { case int64: if _, ok := any(T(0)).(uint64); ok && from < 0 { return 0, ErrOverflow } - // return T(from), nil case uint64: if _, ok := any(T(0)).(int64); ok && uint64(from) > math.MaxInt64 { return 0, ErrOverflow } - // return T(from), nil case int: if !intIsSmall { return SafeConvert[int64, T](int64(from)) } - // no return here - it's safe to use normal logic case uint: if !uintIsSmall { return SafeConvert[uint64, T](uint64(from)) } - // no return here - it's safe to use normal logic }
263-312: Make panic messages consistent between Max and Min.The error messages for unsupported types should be consistent:
- Max uses:
unsupported type %T- Min uses:
unsupported type- panic("unsupported type") + panic(fmt.Sprintf("unsupported type %T", T(0)))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
libs/math/safemath.go(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: test_apps
- GitHub Check: test_abci_cli
- GitHub Check: e2e-test (rotate)
- GitHub Check: tests (01)
- GitHub Check: e2e-test (dashcore)
- GitHub Check: Super linter
🔇 Additional comments (2)
libs/math/safemath.go (2)
9-9: LGTM! Good error consolidation.Consolidating multiple specific overflow errors into a single
ErrOverflowreduces code duplication and improves maintainability.
26-28: LGTM! Consistent error handling.The error handling is consistently updated across all functions to use the consolidated
ErrOverflow.Also applies to: 36-38, 66-68, 76-78, 86-88, 224-226, 234-236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
libs/math/safemath_test.go (2)
89-207: LGTM! Comprehensive test coverage for safe integer conversions.The test suite thoroughly validates integer type conversions, including edge cases and boundary conditions. The table-driven approach with descriptive test names enhances maintainability.
Consider these improvements:
- Add test cases for:
- Conversions between uint16/uint8
- Conversions to/from custom integer types
- Extract type conversion logic into helper functions to reduce switch statement complexity:
func convertValue[From, To any](t *testing.T, i int, from From) (To, error) { switch any(from).(type) { case int: return SafeConvert[From, To](from) // ... other cases default: t.Fatalf("unsupported source type %T", from) return *new(To), nil } }
209-213: Add more test cases for panic behavior.While the current test cases cover basic scenarios, consider adding more cases to thoroughly test panic behavior:
func TestMustConvertPanics(t *testing.T) { assert.NotPanics(t, func() { MustConvert[int32, int32](0) }) assert.Panics(t, func() { MustConvert[int32, int16](math.MaxInt16 + 1) }) assert.NotPanics(t, func() { MustConvert[int32, int16](math.MaxInt16) }) // Additional test cases assert.Panics(t, func() { MustConvert[int64, uint64](math.MinInt64) }) assert.Panics(t, func() { MustConvert[uint64, int64](math.MaxUint64) }) assert.NotPanics(t, func() { MustConvert[int16, uint32](math.MaxInt16) }) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
libs/math/safemath_test.go(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
libs/math/safemath_test.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#1026
File: libs/math/safemath.go:123-161
Timestamp: 2025-01-20T13:38:10.707Z
Learning: The `SafeConvert` function in `libs/math/safemath.go` correctly handles negative values during integer type conversion through two mechanisms: explicit checks for int64 to uint64 conversion, and the Min[U]() check that catches negative values when converting to unsigned types.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: tests (01)
- GitHub Check: e2e-test (rotate)
- GitHub Check: golangci-lint
- GitHub Check: Build (amd64, linux)
- GitHub Check: e2e-test (dashcore)
- GitHub Check: Super linter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/statesync/reactor.go (1)
1068-1104: Document the block height requirement.While the code comments explain the need for finalized blocks, consider adding more detailed documentation about:
- Why exactly 2 blocks are needed
- The relationship between snapshot height and block finalization
- The potential edge cases this prevents
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/statesync/reactor.go(22 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test_abci_cli
- GitHub Check: test_apps
- GitHub Check: e2e-test (rotate)
- GitHub Check: tests (01)
- GitHub Check: Super linter
- GitHub Check: e2e-test (dashcore)
- GitHub Check: golangci-lint
🔇 Additional comments (5)
internal/statesync/reactor.go (5)
24-24: LGTM! Good addition of safe math utilities.The addition of
tmmathimport enables safe integer conversions, helping prevent potential overflow issues.
72-73: LGTM! Good constant extraction.Extracting
minPeersinto a constant improves maintainability by centralizing the minimum peer requirement.
141-148: LGTM! Well-designed interface for improved testability.The
ConsensusStateProviderinterface is well-documented and follows interface segregation principle by including only the required methods.
138-138: LGTM! Good dependency inversion.Replacing the concrete type with
ConsensusStateProviderinterface follows the Dependency Inversion Principle, improving testability and flexibility.
292-306:✅ Verification successful
Verify the error variable definition.
The error handling for no snapshots case looks good, but please ensure that
errNoSnapshotsis properly defined.
🏁 Script executed:
#!/bin/bash # Search for errNoSnapshots definition rg "errNoSnapshots.*=.*" internal/statesync/Length of output: 240
Verified:
errNoSnapshotsis properly defined.The error variable is defined in
internal/statesync/syncer.goaserrNoSnapshots = errors.New("no suitable snapshots found"), so the handling ininternal/statesync/reactor.gois correct.
This reverts commit 8eeccd4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/statesync/reactor.go (1)
877-877: Consider using non-panicking integer conversion.While
tmmath.MustConvertInt64prevents overflow issues, it can panic on invalid input. Consider using the non-panicking version that returns an error which can be handled gracefully.Example approach:
-height := tmmath.MustConvertInt64(msg.Height) +height, err := tmmath.SafeConvertInt64(msg.Height) +if err != nil { + return fmt.Errorf("invalid height: %w", err) +}Also applies to: 1101-1101, 1120-1120
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/statesync/reactor.go(22 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: test_abci_cli
- GitHub Check: test_apps
- GitHub Check: build
- GitHub Check: tests (05)
- GitHub Check: tests (03)
- GitHub Check: tests (02)
- GitHub Check: tests (01)
- GitHub Check: e2e-test (rotate)
- GitHub Check: golangci-lint
- GitHub Check: e2e-test (dashcore)
- GitHub Check: Super linter
🔇 Additional comments (5)
internal/statesync/reactor.go (5)
71-74: LGTM! Good refactoring of the minimum peers constant.Extracting the hardcoded value into a named constant improves maintainability and makes the code more self-documenting.
141-148: LGTM! Well-designed interface abstraction.The
ConsensusStateProviderinterface improves testability and follows the Interface Segregation Principle by defining only the required methods.
292-306: LGTM! Improved error handling with graceful fallback.The enhanced error handling now:
- Distinguishes between no snapshots and other errors
- Provides a graceful fallback to block sync when no snapshots are available
- Uses clear and informative error messages
1068-1104: LGTM! Enhanced snapshot safety with finalization check.The addition of the finalization check is a critical safety improvement that ensures:
- Only snapshots with finalized next blocks are accepted
- The node won't sync to unfinalized state
- Appropriate debug logging for troubleshooting
1000-1000: LGTM! Appropriate adjustment of logging levels.Changed peer update logging from Info to Trace level, which is more appropriate for these frequent events and aligns with similar changes in other reactor files.
Also applies to: 1009-1009
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Issue being fixed or feature implemented
We need state sync to be working properly. During testing, several issues arised, like:
What was done?
rotatetesttrust-height,trust-hash,trust-periodfrom[statesync]sectionretriessetting to[statesync]sectionHow Has This Been Tested?
Added statesync to p2p tests, also tested on dash platform mainnet as part of platform state sync effort.
Breaking Changes
Non-breaking changes in config file
Checklist:
For repository code-owners and collaborators only
Summary by CodeRabbit
New Features
ConsensusStateProviderinterface to facilitate testing.Refactor
Chores
Tests
LockGuardandRLockGuardfunctions to ensure robust mutex handling.