Skip to content

Conversation

@tac0turtle
Copy link
Contributor

Add comprehensive Prometheus metrics to track DA submission failures and retry behavior in the sequencer.

Closes #2755

Changes

  • Added da_submitter_failures_total{reason} counter
  • Added da_submitter_last_failure_timestamp{reason} gauge
  • Added da_submitter_pending_blobs gauge
  • Added da_submitter_resends_total counter

🤖 Generated with Claude Code

Add comprehensive Prometheus metrics to track DA submission failures
and retry behavior in the sequencer:

- da_submitter_failures_total{reason}: Counter for failures by reason
  (timeout, too_big, already_in_mempool, not_included_in_block,
  context_canceled, unknown)
- da_submitter_last_failure_timestamp{reason}: Timestamp of last failure
- da_submitter_pending_blobs: Current number of pending blobs
- da_submitter_resends_total: Total retry attempts

This addresses issue #2755 by providing better observability into DA
layer submission problems, allowing operators to diagnose and respond
to failures more effectively.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Marko <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 13, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 14, 2025, 8:43 AM

@claude
Copy link
Contributor

claude bot commented Oct 13, 2025

Claude finished @tac0turtle's task —— View job


Comprehensive Code Review

I've completed a thorough review of this PR which adds Prometheus metrics for DA submission failures. Here's my detailed feedback:


Strengths

  1. Well-implemented typed failure reasons - The use of DASubmitterFailureReason as a typed constant prevents typos and provides compile-time safety. The centralized AllDASubmitterFailureReasons() function ensures consistency across metrics initialization.

  2. Clean NoOp pattern - The refactoring to automatically use NopMetrics() when nil metrics are passed eliminates defensive nil checks throughout the codebase, significantly improving readability.

  3. Comprehensive metric coverage - The metrics cover all critical failure scenarios (timeout, mempool issues, size limits, context cancellation, etc.) plus useful operational metrics like pending blobs and resends.

  4. Good metric naming and documentation - Prometheus metric names follow conventions, help text is clear, and the use of labels (reason) enables powerful querying.

  5. Proper backlog tracking - The change from tracking batch size to total backlog (via getTotalPendingFn) provides much better operational visibility.


🔴 Critical Issues

1. Missing test coverage for metrics recording (block/internal/submitting/da_submitter.go:164-175)

The recordFailure() function and all metric recording logic lack direct test coverage. This is evident from the Codecov report showing 75.51% patch coverage with 12 lines missing coverage.

Recommendation: Add unit tests that verify:

  • Each failure reason increments the correct counter
  • Timestamp gauges are updated appropriately
  • Unregistered failure reasons trigger warning logs
  • The NoOp metrics don't cause panics

Example test structure:

func TestDASubmitter_RecordFailure(t *testing.T) {
    metrics := common.PrometheusMetrics("test")
    submitter := NewDASubmitter(/* ... */, metrics, /* ... */)
    
    // Test each failure reason
    submitter.recordFailure(common.DASubmitterFailureReasonTimeout)
    
    // Verify counter increased
    // Verify timestamp gauge updated
}

2. Potential race condition in concurrent metrics updates (block/internal/submitting/da_submitter.go:380-448)

The submitToDA function updates DASubmitterPendingBlobs gauge at multiple points (lines 382, 415, 425, 447) during retry loops. If multiple goroutines call SubmitHeaders() or SubmitData() concurrently, the gauge could show inconsistent values due to race conditions between reading total pending count and setting the metric.

Recommendation:

  • Add a comment documenting the expected concurrency model (single-threaded submission or concurrent-safe)
  • If concurrent submissions are expected, consider using atomic operations or mutex protection around the metric updates
  • Add a race detector test: go test -race ./block/internal/submitting/...

🟡 Moderate Issues

3. Incomplete error context in unregistered failure reason warning (block/internal/submitting/da_submitter.go:167)

The warning log for unregistered failure reasons only includes the reason string. For debugging, it would be helpful to include additional context about where this occurred.

Recommendation:

s.logger.Warn().
    Str("reason", string(reason)).
    Str("component", "da_submitter").
    Msg("unregistered failure reason, metric not recorded")

4. Magic number in concurrent marshaling (block/internal/submitting/da_submitter.go:520)

The semaphore size of 32 for concurrent marshaling appears to be arbitrary with no documentation explaining the choice.

Recommendation: Extract to a named constant with documentation:

const (
    // maxMarshalConcurrency limits the number of concurrent marshaling goroutines
    // to prevent excessive memory usage and contention
    maxMarshalConcurrency = 32
)

5. Incomplete test coverage for failure scenarios (block/internal/submitting/da_submitter_mocks_test.go)

The mock-based tests cover happy paths and basic retry scenarios but don't test:

  • Metrics being recorded on each failure type
  • The pending blobs gauge updates through various retry scenarios
  • Concurrent submission behavior

Recommendation: Add tests for:

  • Each DASubmitterFailureReason path with metric verification
  • Pending blobs gauge accuracy during partial successes
  • Resends counter incrementing correctly

🟢 Minor Issues & Suggestions

6. Documentation could be more explicit about thread safety (block/internal/common/metrics.go:44-100)

The Metrics struct contains maps that are initialized once but read by multiple goroutines. While this is safe, it's not explicitly documented.

Recommendation: Add package-level documentation:

// Metrics is safe for concurrent reads after initialization.
// Maps (DASubmitterFailures, DASubmitterLastFailure, etc.) are populated
// during initialization and never modified, making them safe for concurrent access.

7. Potential for improved metric granularity (block/internal/submitting/da_submitter.go:388-390)

The resends counter is incremented for any retry, regardless of reason. This loses information about which failure types cause the most retries.

Recommendation: Consider adding a label to track resend reasons:

DASubmitterResends: prometheus.NewCounterFrom(stdprometheus.CounterOpts{
    // ...
    ConstLabels: map[string]string{
        "reason": string(reason), // Add this during recording
    },
}, labels).With(labelsAndValues...)

8. Missing constant for submission timeout (block/internal/submitting/da_submitter.go:23)

The submissionTimeout constant is hardcoded to 60 seconds. For production systems, this might need tuning based on the DA layer.

Recommendation: Consider making this configurable via the config struct or documenting why 60s is appropriate for all DA layers.

9. Test cleanup not consistent (block/internal/submitting/da_submitter_test.go:88-108)

TestNewDASubmitterSetsVisualizerWhenEnabled includes a defer cleanup for the visualization server, but other tests don't verify cleanup.

Recommendation: Use t.Cleanup() consistently in all tests that modify global state.


🔧 Architecture & Design

10. Good separation of concerns

The metrics logic is cleanly separated from the business logic, with the NoOp pattern allowing the submitter to work without metrics infrastructure.

11. Appropriate use of functional parameters

The getTotalPendingFn func() uint64 parameter in submitToDA() is a good design that decouples the submission logic from the cache implementation.

12. Retry policy is well-designed

The retryPolicy and retryState structs provide a clean abstraction for complex retry behavior with gas price adjustments.


📊 Performance Considerations

13. Efficient metric updates

The Prometheus metrics are updated via simple counter/gauge operations, which are atomic and have minimal overhead. No performance concerns here.

14. Concurrent marshaling is appropriate

The semaphore-limited concurrent marshaling (line 520) is a good balance between throughput and resource usage.


🔒 Security Review

15. No security concerns identified

  • No credential logging
  • No user input directly flows into metrics (all values are internal)
  • Failure reasons are typed constants, preventing injection
  • Metrics exposure follows standard Prometheus practices

📝 Documentation & Developer Experience

16. Good inline comments

The code includes helpful comments explaining complex logic, especially around gas price handling and retry behavior.

17. Missing package-level documentation

While individual functions are well-documented, the submitting package could benefit from package-level documentation explaining the overall submission flow and how metrics fit in.

Recommendation: Add to da_submitter.go:

// Package submitting handles submission of headers and data to the DA layer
// with comprehensive retry logic, gas price management, and Prometheus metrics
// for observability.
//
// Metrics tracked:
// - da_submitter_failures_total: Counts failures by reason
// - da_submitter_last_failure_timestamp: Last failure time by reason  
// - da_submitter_pending_blobs: Current backlog size
// - da_submitter_resends_total: Retry attempts

Testing & Reliability

18. Test coverage is good but incomplete

  • Integration tests verify end-to-end behavior (da_submitter_integration_test.go)
  • Mock tests verify retry logic (da_submitter_mocks_test.go)
  • Unit tests verify basic functionality (da_submitter_test.go)

Missing: Direct tests of metric recording (see Critical Issue #1)

19. No flaky test patterns detected

Tests use appropriate timeouts and don't rely on timing for correctness (except where necessary for eventual consistency checks).


🎯 Summary & Recommendations

Overall Assessment: This is a well-implemented feature that adds valuable observability to DA submission. The code quality is high, the design is sound, and the NoOp pattern refactoring is excellent.

Priority Actions:

  1. Add metric recording tests (Critical) - Essential for ensuring metrics work correctly
  2. Investigate race conditions (Critical) - Run go test -race and add documentation
  3. Add missing test coverage (Moderate) - Get coverage above 85%
  4. Document thread safety (Minor) - Helps future maintainers

Approval Recommendation: ⚠️ Approve with requested changes - The implementation is solid, but the missing test coverage for metrics recording is concerning. I recommend adding tests for the recordFailure() function and verifying metrics updates before merging.


@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 69.81132% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.42%. Comparing base (0dc7b27) to head (a0b2a90).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/submitting/da_submitter.go 68.62% 12 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2756      +/-   ##
==========================================
+ Coverage   61.39%   61.42%   +0.03%     
==========================================
  Files          81       81              
  Lines        8573     8622      +49     
==========================================
+ Hits         5263     5296      +33     
- Misses       2816     2828      +12     
- Partials      494      498       +4     
Flag Coverage Δ
combined 61.42% <69.81%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions bot and others added 2 commits October 13, 2025 15:06
Changes the DASubmitterPendingBlobs metric to track the total number
of blobs awaiting submission across the entire queue, rather than just
the current batch being submitted. This provides better visibility into
the submission backlog for monitoring and alerting.

- Added getTotalPendingFn parameter to submitToDA()
- Updated metric to call NumPendingHeaders() or NumPendingData()
- Updated metric documentation and help text
- Updated all test calls with new parameter

Co-authored-by: Marko <[email protected]>
@tac0turtle tac0turtle marked this pull request as ready for review October 13, 2025 15:13
@tac0turtle tac0turtle requested a review from auricom October 13, 2025 15:14
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but commented about some code cleanup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove all those metrics nil check in the code, and just pass a NoOp metrics if metrics is nil in the constructor? makes it less readable otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude can you implement this and other feedback from julien

Copy link
Contributor

@claude claude bot Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @tac0turtle's task —— View job


Tasks

  • Read current implementation files
  • Create typed failure reason constants
  • Refactor metrics to use NoOp pattern in constructor
  • Remove nil checks for metrics throughout code
  • Update all call sites to use typed failure reasons
  • Update tests
  • Run linting and formatting
  • Commit and push changes

Summary

I've successfully implemented all of @julienrbrt's feedback:

1. Created Typed Failure Reason Constants

  • Added DASubmitterFailureReason type with 8 typed constants
  • Created AllDASubmitterFailureReasons() function to centralize the list
  • This prevents typos and makes failure reasons discoverable at compile time

2. Updated Metrics to Use Typed Reasons

  • Changed map types from map[string]metrics.Counter to map[DASubmitterFailureReason]metrics.Counter
  • Updated both PrometheusMetrics() and NopMetrics() to use the centralized list
  • All metric initialization now uses the typed constants

3. Refactored to Use NoOp Metrics Pattern

  • Updated NewDASubmitter() constructor to automatically use NopMetrics() if metrics is nil
  • Removed all if s.metrics == nil checks throughout the code (8 occurrences removed)
  • This makes the code more readable and eliminates defensive programming clutter

4. Updated recordFailure() Function

  • Changed signature to accept DASubmitterFailureReason instead of string
  • Added warning log for unregistered failure reasons (addressing the code review concern)
  • All call sites now use typed constants like common.DASubmitterFailureReasonTimeout

5. Benefits of Changes

  • ✅ Type safety: Compiler catches invalid failure reasons
  • ✅ Better maintainability: Single source of truth for failure reasons
  • ✅ Cleaner code: No nil checks scattered throughout
  • ✅ Better observability: Warning logs for unexpected failure reasons
  • ✅ No breaking changes: Tests still pass with NopMetrics()

All changes have been committed and pushed to the PR branch.


github-actions bot and others added 2 commits October 14, 2025 08:40
- Create DASubmitterFailureReason type with typed constants
- Update metrics maps to use typed failure reasons instead of strings
- Refactor DASubmitter constructor to use NopMetrics if metrics is nil
- Remove all nil checks for metrics throughout the codebase
- Add warning log for unregistered failure reasons
- Centralize failure reason definitions using AllDASubmitterFailureReasons()

This addresses review feedback to make the code more type-safe and
readable by removing defensive nil checks.

Co-authored-by: Marko <[email protected]>
@tac0turtle tac0turtle requested a review from julienrbrt October 14, 2025 08:44
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@julienrbrt julienrbrt enabled auto-merge October 14, 2025 09:19
@julienrbrt julienrbrt added this pull request to the merge queue Oct 14, 2025
Merged via the queue into main with commit 5a0a8fa Oct 14, 2025
47 of 48 checks passed
@julienrbrt julienrbrt deleted the claude/issue-2755-20251013-1447 branch October 14, 2025 09:30
@github-project-automation github-project-automation bot moved this to Done in Evolve Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEATURE] Prometheus metrics and proper logs for DA-submission failures in sequencer

3 participants