Skip to content

Conversation

randygrok
Copy link
Contributor

Overview

Closes: #2641

- Added Backup method to Store interface and DefaultStore implementation to stream a Badger backup of the datastore.
- Introduced BackupRequest and BackupResponse messages in the state_rpc.proto file to handle backup requests and responses.
- Implemented backup streaming logic in StoreServer, including metadata handling for current and target heights.
- Created a backupStreamWriter to manage chunked writing of backup data.
- Updated client tests to validate the Backup functionality.
- Enhanced mock store to support Backup method for testing.
- Added unit tests for Backup functionality in the store package.
Copy link
Contributor

github-actions bot commented Oct 14, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 15, 2025, 9:42 AM

Copy link
Contributor

claude bot commented Oct 14, 2025

Claude finished @randygrok's task —— View job


Comprehensive Code Review

I've completed a thorough review of the backup/restore functionality. Here are my findings organized by focus area:


1. Code Quality & Architecture

Strengths:

  • Clean separation of concerns: store layer handles Badger operations, RPC layer exposes functionality, CLI provides user interface
  • Well-structured protobuf definitions with streaming support for large backups
  • Good use of Go interfaces (Store interface properly extended with Backup/Restore methods in pkg/store/types.go:80-84)

Issues & Recommendations:

Critical - Magic Number (pkg/store/restore.go:42):

if err := badgerDatastore.DB.Load(reader, 16); err != nil {

The hardcoded 16 is undocumented. This appears to be the number of threads for parallel restore. Should be:

Medium - Unwrapping Logic Complexity (pkg/store/backup.go:25-33, pkg/store/restore.go:27-62):
Both files have similar datastore unwrapping logic that could be extracted to a shared helper:

func unwrapBadgerDatastore(ds ds.Datastore) (*badger4.Datastore, error) {
    // Centralized unwrapping logic
}

This would:

  • Reduce code duplication
  • Make the unwrapping strategy testable
  • Improve maintainability

Medium - Restore Cycle Detection (pkg/store/restore.go:21-52):
The visited map is good for preventing infinite loops, but the implementation:

  • Doesn't communicate why it failed to find a supported datastore
  • Could log which datastores were encountered during traversal for debugging
  • Consider adding a max depth limit as defense-in-depth

Low - Package Naming:
The backup.go and restore.go files are in pkg/store/ but don't have tests alongside them. Consider whether backup/restore should be in a subpackage (e.g., pkg/store/snapshot/) for better organization.


2. Error Handling & Resilience

Strengths:

  • Proper error wrapping with %w throughout
  • Good cleanup on failure (pkg/cmd/backup.go:100-104, 108-110, 114-116)
  • Partial file removal on backup failure prevents corrupt snapshots

Issues & Recommendations:

Critical - Context Not Checked During Streaming (pkg/rpc/server/server.go:221):

writer := newBackupStreamWriter(stream, defaultBackupChunkSize)
version, err := s.store.Backup(ctx, writer, since)

The backup operation could run for a long time. The backupStreamWriter.Write method (line 290-316) doesn't check ctx.Err() between chunks. If the client disconnects, the server continues writing. Recommendation:

func (w *backupStreamWriter) Write(p []byte) (int, error) {
    // Add at the start:
    if err := w.ctx.Err(); err != nil {
        return 0, err
    }
    // ... rest of implementation
}

Medium - Restore Direct DB Access (pkg/cmd/restore.go:60-65):
The restore command forcefully removes the existing datastore:

if force {
    if err := os.RemoveAll(dbPath); err != nil {
        return fmt.Errorf("failed to remove existing datastore: %w", err)
    }
}

This is dangerous - if the node is running, this could corrupt state. Should:

  • Check if a lock file exists or if the database is open
  • Warn users prominently in documentation
  • Consider requiring an explicit --i-know-this-is-dangerous flag

Medium - No Checksum Validation:
Neither backup nor restore validates data integrity. Badger's backup format may have internal checksums, but:

  • Should document this assumption
  • Consider adding SHA256 hash to BackupMetadata for end-to-end verification
  • Restore should verify backup completeness before applying

Low - Buffer Flush Error Handling (pkg/cmd/backup.go:107-111):

if err := writer.Flush(); err != nil {
    _ = file.Close()
    _ = os.Remove(outputPath)
    return fmt.Errorf("failed to flush backup data: %w", err)
}

Good cleanup, but file.Close() error is silently ignored. Consider:

flushErr := writer.Flush()
closeErr := file.Close()
if flushErr != nil || closeErr != nil {
    _ = os.Remove(outputPath)
    return errors.Join(flushErr, closeErr)
}

3. Security

Issues & Recommendations:

Critical - File Permissions (pkg/cmd/backup.go:74):

file, err := os.OpenFile(outputPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600)

Good use of 0o600 for the backup file (owner read/write only). However:

  • Directory is created with 0o755 (line 57) - should be 0o700 for sensitive data
  • Restore command creates directories with 0o755 (restore.go:68) - same issue

High - Unauthenticated RPC Endpoint:
The backup RPC endpoint (pkg/rpc/server/server.go:193-266) has no authentication. Anyone with network access can:

  • Stream the entire datastore
  • Potentially DoS by requesting many simultaneous backups
    Recommendations:
  • Add authentication/authorization
  • Implement rate limiting
  • Document that RPC should not be exposed publicly (already in CLAUDE.md, but worth emphasizing)
  • Consider adding a --backup-enabled flag to explicitly opt-in

Medium - Path Traversal (pkg/cmd/backup.go:52-55, restore.go:36-38):
Both commands use filepath.Abs() which is good, but don't validate that the resolved path is in an expected directory. Consider:

absPath, err := filepath.Abs(outputPath)
// Validate path is safe (not in system directories, etc.)

Low - Reth Backup Script Security (scripts/reth-backup/backup.sh:175-187):
The script executes commands in containers with user input:

$exec_remote "rm -rf '$REMOTE_TMP' && mkdir -p '$REMOTE_TMP/db' '$REMOTE_TMP/static_files'"

Variables are quoted which is good, but consider:

  • Input validation on BACKUP_TAG to prevent injection
  • Sanitize container names
  • Current implementation trusts DATADIR completely

4. Performance & Resource Efficiency

Strengths:

  • Buffered I/O with sensible chunk size (128KB for streaming, 1MB for file writes)
  • Streaming architecture prevents loading entire backup into memory
  • Badger's native backup is designed for hot backups (non-blocking)

Issues & Recommendations:

Medium - Chunk Size Not Tunable (pkg/rpc/server/server.go:268):

const defaultBackupChunkSize = 128 * 1024

128KB chunks may not be optimal for all network conditions. Consider:

  • Making it configurable via RPC request
  • Benchmarking different sizes
  • 128KB is reasonable for LAN, but larger chunks (1-4MB) might be better for high-latency links

Medium - No Progress Indication:
The backup command shows no progress during long operations. For multi-GB datastores:

  • Users have no idea if it's working or hung
  • Add periodic progress updates (every N MB or every N seconds)
  • The BackupMetadata stream could include intermediate progress messages

Low - Counting Writer Allocation (pkg/cmd/backup.go:81-82):

bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)

countingWriter is heap-allocated when it could be stack-allocated:

var bytesCount countingWriter
streamWriter := io.MultiWriter(writer, &bytesCount)

Minor optimization but good practice.

Low - Restore Thread Count (pkg/store/restore.go:42):
As mentioned in architecture section, 16 is hardcoded. For resource-constrained environments, this could:

  • Consume too much memory
  • Cause CPU contention
  • Should consider runtime.NumCPU() or make configurable

5. Testing & Reliability

Strengths:

  • Good unit test coverage for happy paths (pkg/cmd/backup_test.go, pkg/rpc/client/client_test.go)
  • Comprehensive store rollback tests with edge cases (pkg/store/store_test.go:662-1195)
  • Mock-based testing allows testing error conditions

Critical Issues:

Test Coverage is 17.86% per codecov report. Specifically missing:

pkg/rpc/server/server.go - 110 missing lines:

  • No tests for Backup RPC endpoint error cases:
    • What happens if Height() fails?
    • Stream send failures?
    • Context cancellation during backup?
    • Concurrent backup requests?

pkg/cmd/restore.go - 84 missing lines:

  • No tests for restore command at all
  • Should test:
    • Missing input file
    • Corrupted backup file
    • Datastore already exists (force vs non-force)
    • Restore failure cleanup

pkg/store/restore.go - 37 missing lines:

  • No unit tests for the Restore method
  • Should test:
    • Unsupported datastore type error
    • Badger Load() failure
    • Different shim wrapper configurations
    • Cycle detection in unwrapping

Medium - No Integration Tests:

  • No end-to-end test of backup → restore → verify
  • Should have a test that:
    1. Creates a store with known data
    2. Backs it up
    3. Restores to a new store
    4. Verifies data integrity

Medium - Reth Backup Scripts Untested:
The bash scripts in scripts/reth-backup/ have:

  • No automated tests
  • No CI validation
  • Should at least have shellcheck validation

Recommendations:

  1. Increase test coverage to >80% before merging
  2. Add integration test: TestBackupRestoreRoundTrip
  3. Add error injection tests for RPC streaming
  4. Test concurrent backup/restore operations
  5. Add shellcheck to CI for bash scripts

6. Documentation & Developer Experience

Strengths:

  • Clear protobuf documentation (proto/evnode/v1/state_rpc.proto:24-25, 70-91)
  • Reth backup README is comprehensive (scripts/reth-backup/README.md - 288 lines!)
  • Good inline comments explaining Badger internals (pkg/store/backup.go:39, pkg/store/restore.go:41-42)

Issues & Recommendations:

High - Missing Operational Documentation:
Need to document:

  • Backup best practices: When to backup? How often? Retention policy?
  • Restore procedures: Can you restore while node is running? (NO!)
  • Incremental backup strategy: How to use since_version effectively?
  • Disaster recovery: Step-by-step guide
  • Performance impact: What's the overhead of backup on a running node?

Medium - CLI Help Text Could Be Better:

Short: "Stream a datastore backup to a local file via RPC"

Should expand to:

Short: "Create a backup of the ev-node datastore",
Long: `Creates a consistent point-in-time backup of the Badger datastore via RPC.

The backup is performed online (hot backup) without stopping the node.
Supports incremental backups via --since-version flag.

WARNING: Backups may be large (multi-GB). Ensure sufficient disk space.
`,

Medium - Error Messages Need Context:
Example from pkg/store/backup.go:35:

return 0, fmt.Errorf("backup is only supported for badger4 datastore")

Should tell users:

  • What datastore type was detected
  • How to check their configuration
  • Link to documentation

Low - Comment on Restore Safety (pkg/cmd/restore.go:15-20):
The command should have a prominent warning:

// NewRestoreCmd creates a cobra command that restores a datastore from a Badger backup file.
// 
// WARNING: This command will OVERWRITE the existing datastore. The node must be stopped
// before running restore. Running restore while the node is active will cause data corruption.
// Always backup your data before attempting a restore operation.
func NewRestoreCmd() *cobra.Command {

Code Reference Examples:

  • Backup implementation: pkg/store/backup.go:15-45
  • Restore implementation: pkg/store/restore.go:16-66
  • RPC server streaming: pkg/rpc/server/server.go:193-266
  • CLI backup command: pkg/cmd/backup.go:26-125
  • CLI restore command: pkg/cmd/restore.go:21-115

Summary

Must Fix Before Merge:

  1. Context cancellation in backup streaming (security/resource leak)
  2. Restore safety check - prevent running on active datastore
  3. Increase test coverage - especially restore.go, RPC server backup endpoint
  4. Document the magic number 16 in restore
  5. Fix directory permissions to 0o700 for sensitive data

Should Fix:

  1. Add checksum validation for backup integrity
  2. Add authentication/rate limiting to backup RPC endpoint
  3. Extract common datastore unwrapping logic
  4. Make chunk size and thread count configurable
  5. Add integration test for backup/restore round-trip

Nice to Have:

  1. Progress indication for long backups
  2. Better error messages with context
  3. Operational documentation (backup strategies, DR procedures)
  4. Shellcheck validation for bash scripts

Overall Assessment:

This is a solid foundation for backup/restore functionality with good architecture and separation of concerns. The main concerns are:

  • Low test coverage (17.86%) needs significant improvement
  • Security gaps (no auth, unsafe restore)
  • Operational safety (need better docs and safeguards)

The code quality is good, but production readiness requires addressing the critical issues above.


Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 17.66234% with 317 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.55%. Comparing base (5a0a8fa) to head (97d49c6).

Files with missing lines Patch % Lines
pkg/rpc/server/server.go 0.00% 110 Missing ⚠️
pkg/cmd/restore.go 0.00% 84 Missing ⚠️
pkg/cmd/backup.go 48.48% 37 Missing and 14 partials ⚠️
pkg/store/restore.go 0.00% 37 Missing ⚠️
pkg/rpc/client/client.go 35.29% 15 Missing and 7 partials ⚠️
pkg/store/backup.go 38.09% 11 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2758      +/-   ##
==========================================
- Coverage   61.42%   59.55%   -1.88%     
==========================================
  Files          81       85       +4     
  Lines        8622     9007     +385     
==========================================
+ Hits         5296     5364      +68     
- Misses       2828     3122     +294     
- Partials      498      521      +23     
Flag Coverage Δ
combined 59.55% <17.66%> (-1.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@randygrok randygrok marked this pull request as ready for review October 15, 2025 09:10
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits. haven't checked the core logic then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this whole folder should be in github.com/evstack/ev-reth instead

// Aggregator is used to determine if the rollback is performed on the aggregator node.
Rollback(ctx context.Context, height uint64, aggregator bool) error

// Backup writes a consistent backup stream to writer. The returned version can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be added to the Rollback interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we wire testapp as well with those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEATURE] Copy live dbs

2 participants