-
Notifications
You must be signed in to change notification settings - Fork 231
feat: live backups evm and ev-node #2758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Added Backup method to Store interface and DefaultStore implementation to stream a Badger backup of the datastore. - Introduced BackupRequest and BackupResponse messages in the state_rpc.proto file to handle backup requests and responses. - Implemented backup streaming logic in StoreServer, including metadata handling for current and target heights. - Created a backupStreamWriter to manage chunked writing of backup data. - Updated client tests to validate the Backup functionality. - Enhanced mock store to support Backup method for testing. - Added unit tests for Backup functionality in the store package.
…d improve error handling
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
Claude finished @randygrok's task —— View job Comprehensive Code ReviewI've completed a thorough review of the backup/restore functionality. Here are my findings organized by focus area: 1. Code Quality & ArchitectureStrengths:
Issues & Recommendations: Critical - Magic Number (pkg/store/restore.go:42): if err := badgerDatastore.DB.Load(reader, 16); err != nil { The hardcoded
Medium - Unwrapping Logic Complexity (pkg/store/backup.go:25-33, pkg/store/restore.go:27-62): func unwrapBadgerDatastore(ds ds.Datastore) (*badger4.Datastore, error) {
// Centralized unwrapping logic
} This would:
Medium - Restore Cycle Detection (pkg/store/restore.go:21-52):
Low - Package Naming: 2. Error Handling & ResilienceStrengths:
Issues & Recommendations: Critical - Context Not Checked During Streaming (pkg/rpc/server/server.go:221): writer := newBackupStreamWriter(stream, defaultBackupChunkSize)
version, err := s.store.Backup(ctx, writer, since) The backup operation could run for a long time. The func (w *backupStreamWriter) Write(p []byte) (int, error) {
// Add at the start:
if err := w.ctx.Err(); err != nil {
return 0, err
}
// ... rest of implementation
} Medium - Restore Direct DB Access (pkg/cmd/restore.go:60-65): if force {
if err := os.RemoveAll(dbPath); err != nil {
return fmt.Errorf("failed to remove existing datastore: %w", err)
}
} This is dangerous - if the node is running, this could corrupt state. Should:
Medium - No Checksum Validation:
Low - Buffer Flush Error Handling (pkg/cmd/backup.go:107-111): if err := writer.Flush(); err != nil {
_ = file.Close()
_ = os.Remove(outputPath)
return fmt.Errorf("failed to flush backup data: %w", err)
} Good cleanup, but flushErr := writer.Flush()
closeErr := file.Close()
if flushErr != nil || closeErr != nil {
_ = os.Remove(outputPath)
return errors.Join(flushErr, closeErr)
} 3. SecurityIssues & Recommendations: Critical - File Permissions (pkg/cmd/backup.go:74): file, err := os.OpenFile(outputPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600) Good use of
High - Unauthenticated RPC Endpoint:
Medium - Path Traversal (pkg/cmd/backup.go:52-55, restore.go:36-38): absPath, err := filepath.Abs(outputPath)
// Validate path is safe (not in system directories, etc.) Low - Reth Backup Script Security (scripts/reth-backup/backup.sh:175-187): $exec_remote "rm -rf '$REMOTE_TMP' && mkdir -p '$REMOTE_TMP/db' '$REMOTE_TMP/static_files'" Variables are quoted which is good, but consider:
4. Performance & Resource EfficiencyStrengths:
Issues & Recommendations: Medium - Chunk Size Not Tunable (pkg/rpc/server/server.go:268): const defaultBackupChunkSize = 128 * 1024 128KB chunks may not be optimal for all network conditions. Consider:
Medium - No Progress Indication:
Low - Counting Writer Allocation (pkg/cmd/backup.go:81-82): bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)
var bytesCount countingWriter
streamWriter := io.MultiWriter(writer, &bytesCount) Minor optimization but good practice. Low - Restore Thread Count (pkg/store/restore.go:42):
5. Testing & ReliabilityStrengths:
Critical Issues: Test Coverage is 17.86% per codecov report. Specifically missing: pkg/rpc/server/server.go - 110 missing lines:
pkg/cmd/restore.go - 84 missing lines:
pkg/store/restore.go - 37 missing lines:
Medium - No Integration Tests:
Medium - Reth Backup Scripts Untested:
Recommendations:
6. Documentation & Developer ExperienceStrengths:
Issues & Recommendations: High - Missing Operational Documentation:
Medium - CLI Help Text Could Be Better: Short: "Stream a datastore backup to a local file via RPC" Should expand to: Short: "Create a backup of the ev-node datastore",
Long: `Creates a consistent point-in-time backup of the Badger datastore via RPC.
The backup is performed online (hot backup) without stopping the node.
Supports incremental backups via --since-version flag.
WARNING: Backups may be large (multi-GB). Ensure sufficient disk space.
`, Medium - Error Messages Need Context: return 0, fmt.Errorf("backup is only supported for badger4 datastore") Should tell users:
Low - Comment on Restore Safety (pkg/cmd/restore.go:15-20): // NewRestoreCmd creates a cobra command that restores a datastore from a Badger backup file.
//
// WARNING: This command will OVERWRITE the existing datastore. The node must be stopped
// before running restore. Running restore while the node is active will cause data corruption.
// Always backup your data before attempting a restore operation.
func NewRestoreCmd() *cobra.Command { Code Reference Examples:
SummaryMust Fix Before Merge:
Should Fix:
Nice to Have:
Overall Assessment:This is a solid foundation for backup/restore functionality with good architecture and separation of concerns. The main concerns are:
The code quality is good, but production readiness requires addressing the critical issues above. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2758 +/- ##
==========================================
- Coverage 61.42% 59.55% -1.88%
==========================================
Files 81 85 +4
Lines 8622 9007 +385
==========================================
+ Hits 5296 5364 +68
- Misses 2828 3122 +294
- Partials 498 521 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits. haven't checked the core logic then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this whole folder should be in github.com/evstack/ev-reth instead
// Aggregator is used to determine if the rollback is performed on the aggregator node. | ||
Rollback(ctx context.Context, height uint64, aggregator bool) error | ||
|
||
// Backup writes a consistent backup stream to writer. The returned version can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be added to the Rollback interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we wire testapp as well with those?
Overview
Closes: #2641