feat: add Prometheus metrics for DA submission failures #2756

tac0turtle · 2025-10-13T14:59:09Z

Add comprehensive Prometheus metrics to track DA submission failures and retry behavior in the sequencer.

Changes

Added da_submitter_failures_total{reason} counter
Added da_submitter_last_failure_timestamp{reason} gauge
Added da_submitter_pending_blobs gauge
Added da_submitter_resends_total counter

🤖 Generated with Claude Code

Add comprehensive Prometheus metrics to track DA submission failures and retry behavior in the sequencer: - da_submitter_failures_total{reason}: Counter for failures by reason (timeout, too_big, already_in_mempool, not_included_in_block, context_canceled, unknown) - da_submitter_last_failure_timestamp{reason}: Timestamp of last failure - da_submitter_pending_blobs: Current number of pending blobs - da_submitter_resends_total: Total retry attempts This addresses issue #2755 by providing better observability into DA layer submission problems, allowing operators to diagnose and respond to failures more effectively. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Marko <[email protected]>

github-actions · 2025-10-13T14:59:23Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Oct 14, 2025, 8:43 AM

claude · 2025-10-13T14:59:25Z

Claude finished @tac0turtle's task —— View job

Comprehensive Code Review

I've completed a thorough review of this PR which adds Prometheus metrics for DA submission failures. Here's my detailed feedback:

✅ Strengths

Well-implemented typed failure reasons - The use of DASubmitterFailureReason as a typed constant prevents typos and provides compile-time safety. The centralized AllDASubmitterFailureReasons() function ensures consistency across metrics initialization.
Clean NoOp pattern - The refactoring to automatically use NopMetrics() when nil metrics are passed eliminates defensive nil checks throughout the codebase, significantly improving readability.
Comprehensive metric coverage - The metrics cover all critical failure scenarios (timeout, mempool issues, size limits, context cancellation, etc.) plus useful operational metrics like pending blobs and resends.
Good metric naming and documentation - Prometheus metric names follow conventions, help text is clear, and the use of labels (reason) enables powerful querying.
Proper backlog tracking - The change from tracking batch size to total backlog (via getTotalPendingFn) provides much better operational visibility.

🔴 Critical Issues

1. Missing test coverage for metrics recording (block/internal/submitting/da_submitter.go:164-175)

The recordFailure() function and all metric recording logic lack direct test coverage. This is evident from the Codecov report showing 75.51% patch coverage with 12 lines missing coverage.

Recommendation: Add unit tests that verify:

Each failure reason increments the correct counter
Timestamp gauges are updated appropriately
Unregistered failure reasons trigger warning logs
The NoOp metrics don't cause panics

Example test structure:

func TestDASubmitter_RecordFailure(t *testing.T) {
    metrics := common.PrometheusMetrics("test")
    submitter := NewDASubmitter(/* ... */, metrics, /* ... */)
    
    // Test each failure reason
    submitter.recordFailure(common.DASubmitterFailureReasonTimeout)
    
    // Verify counter increased
    // Verify timestamp gauge updated
}

2. Potential race condition in concurrent metrics updates (block/internal/submitting/da_submitter.go:380-448)

The submitToDA function updates DASubmitterPendingBlobs gauge at multiple points (lines 382, 415, 425, 447) during retry loops. If multiple goroutines call SubmitHeaders() or SubmitData() concurrently, the gauge could show inconsistent values due to race conditions between reading total pending count and setting the metric.

Recommendation:

Add a comment documenting the expected concurrency model (single-threaded submission or concurrent-safe)
If concurrent submissions are expected, consider using atomic operations or mutex protection around the metric updates
Add a race detector test: go test -race ./block/internal/submitting/...

🟡 Moderate Issues

3. Incomplete error context in unregistered failure reason warning (block/internal/submitting/da_submitter.go:167)

The warning log for unregistered failure reasons only includes the reason string. For debugging, it would be helpful to include additional context about where this occurred.

Recommendation:

s.logger.Warn().
    Str("reason", string(reason)).
    Str("component", "da_submitter").
    Msg("unregistered failure reason, metric not recorded")

4. Magic number in concurrent marshaling (block/internal/submitting/da_submitter.go:520)

The semaphore size of 32 for concurrent marshaling appears to be arbitrary with no documentation explaining the choice.

Recommendation: Extract to a named constant with documentation:

const (
    // maxMarshalConcurrency limits the number of concurrent marshaling goroutines
    // to prevent excessive memory usage and contention
    maxMarshalConcurrency = 32
)

5. Incomplete test coverage for failure scenarios (block/internal/submitting/da_submitter_mocks_test.go)

The mock-based tests cover happy paths and basic retry scenarios but don't test:

Metrics being recorded on each failure type
The pending blobs gauge updates through various retry scenarios
Concurrent submission behavior

Recommendation: Add tests for:

Each DASubmitterFailureReason path with metric verification
Pending blobs gauge accuracy during partial successes
Resends counter incrementing correctly

🟢 Minor Issues & Suggestions

6. Documentation could be more explicit about thread safety (block/internal/common/metrics.go:44-100)

The Metrics struct contains maps that are initialized once but read by multiple goroutines. While this is safe, it's not explicitly documented.

Recommendation: Add package-level documentation:

// Metrics is safe for concurrent reads after initialization.
// Maps (DASubmitterFailures, DASubmitterLastFailure, etc.) are populated
// during initialization and never modified, making them safe for concurrent access.

7. Potential for improved metric granularity (block/internal/submitting/da_submitter.go:388-390)

The resends counter is incremented for any retry, regardless of reason. This loses information about which failure types cause the most retries.

Recommendation: Consider adding a label to track resend reasons:

DASubmitterResends: prometheus.NewCounterFrom(stdprometheus.CounterOpts{
    // ...
    ConstLabels: map[string]string{
        "reason": string(reason), // Add this during recording
    },
}, labels).With(labelsAndValues...)

8. Missing constant for submission timeout (block/internal/submitting/da_submitter.go:23)

The submissionTimeout constant is hardcoded to 60 seconds. For production systems, this might need tuning based on the DA layer.

Recommendation: Consider making this configurable via the config struct or documenting why 60s is appropriate for all DA layers.

9. Test cleanup not consistent (block/internal/submitting/da_submitter_test.go:88-108)

TestNewDASubmitterSetsVisualizerWhenEnabled includes a defer cleanup for the visualization server, but other tests don't verify cleanup.

Recommendation: Use t.Cleanup() consistently in all tests that modify global state.

🔧 Architecture & Design

10. Good separation of concerns

The metrics logic is cleanly separated from the business logic, with the NoOp pattern allowing the submitter to work without metrics infrastructure.

11. Appropriate use of functional parameters

The getTotalPendingFn func() uint64 parameter in submitToDA() is a good design that decouples the submission logic from the cache implementation.

12. Retry policy is well-designed

The retryPolicy and retryState structs provide a clean abstraction for complex retry behavior with gas price adjustments.

📊 Performance Considerations

13. Efficient metric updates

The Prometheus metrics are updated via simple counter/gauge operations, which are atomic and have minimal overhead. No performance concerns here.

14. Concurrent marshaling is appropriate

The semaphore-limited concurrent marshaling (line 520) is a good balance between throughput and resource usage.

🔒 Security Review

15. No security concerns identified

No credential logging
No user input directly flows into metrics (all values are internal)
Failure reasons are typed constants, preventing injection
Metrics exposure follows standard Prometheus practices

📝 Documentation & Developer Experience

16. Good inline comments

The code includes helpful comments explaining complex logic, especially around gas price handling and retry behavior.

17. Missing package-level documentation

While individual functions are well-documented, the submitting package could benefit from package-level documentation explaining the overall submission flow and how metrics fit in.

Recommendation: Add to da_submitter.go:

// Package submitting handles submission of headers and data to the DA layer
// with comprehensive retry logic, gas price management, and Prometheus metrics
// for observability.
//
// Metrics tracked:
// - da_submitter_failures_total: Counts failures by reason
// - da_submitter_last_failure_timestamp: Last failure time by reason  
// - da_submitter_pending_blobs: Current backlog size
// - da_submitter_resends_total: Retry attempts

✅ Testing & Reliability

18. Test coverage is good but incomplete

Integration tests verify end-to-end behavior (da_submitter_integration_test.go)
Mock tests verify retry logic (da_submitter_mocks_test.go)
Unit tests verify basic functionality (da_submitter_test.go)

Missing: Direct tests of metric recording (see Critical Issue #1)

19. No flaky test patterns detected

Tests use appropriate timeouts and don't rely on timing for correctness (except where necessary for eventual consistency checks).

🎯 Summary & Recommendations

Overall Assessment: This is a well-implemented feature that adds valuable observability to DA submission. The code quality is high, the design is sound, and the NoOp pattern refactoring is excellent.

Priority Actions:

✅ Add metric recording tests (Critical) - Essential for ensuring metrics work correctly
✅ Investigate race conditions (Critical) - Run go test -race and add documentation
✅ Add missing test coverage (Moderate) - Get coverage above 85%
✅ Document thread safety (Minor) - Helps future maintainers

Approval Recommendation: ⚠️ Approve with requested changes - The implementation is solid, but the missing test coverage for metrics recording is concerning. I recommend adding tests for the recordFailure() function and verifying metrics updates before merging.

block/internal/submitting/da_submitter.go

codecov · 2025-10-13T15:02:32Z

Codecov Report

❌ Patch coverage is 69.81132% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.42%. Comparing base (0dc7b27) to head (a0b2a90).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
block/internal/submitting/da_submitter.go	68.62%	12 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2756      +/-   ##
==========================================
+ Coverage   61.39%   61.42%   +0.03%     
==========================================
  Files          81       81              
  Lines        8573     8622      +49     
==========================================
+ Hits         5263     5296      +33     
- Misses       2816     2828      +12     
- Partials      494      498       +4

Flag	Coverage Δ
combined	`61.42% <69.81%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Changes the DASubmitterPendingBlobs metric to track the total number of blobs awaiting submission across the entire queue, rather than just the current batch being submitted. This provides better visibility into the submission backlog for monitoring and alerting. - Added getTotalPendingFn parameter to submitToDA() - Updated metric to call NumPendingHeaders() or NumPendingData() - Updated metric documentation and help text - Updated all test calls with new parameter Co-authored-by: Marko <[email protected]>

julienrbrt

lgtm, but commented about some code cleanup

block/internal/common/metrics.go

block/internal/submitting/da_submitter.go

julienrbrt · 2025-10-14T08:33:10Z

block/internal/submitting/da_submitter.go

can we remove all those metrics nil check in the code, and just pass a NoOp metrics if metrics is nil in the constructor? makes it less readable otherwise

@claude can you implement this and other feedback from julien

Claude finished @tac0turtle's task —— View job

Tasks

Read current implementation files

Create typed failure reason constants

Refactor metrics to use NoOp pattern in constructor

Remove nil checks for metrics throughout code

Update all call sites to use typed failure reasons

Update tests

Run linting and formatting

Commit and push changes

Summary

I've successfully implemented all of @julienrbrt's feedback:

1. Created Typed Failure Reason Constants

Added DASubmitterFailureReason type with 8 typed constants

Created AllDASubmitterFailureReasons() function to centralize the list

This prevents typos and makes failure reasons discoverable at compile time

2. Updated Metrics to Use Typed Reasons

Changed map types from map[string]metrics.Counter to map[DASubmitterFailureReason]metrics.Counter

Updated both PrometheusMetrics() and NopMetrics() to use the centralized list

All metric initialization now uses the typed constants

3. Refactored to Use NoOp Metrics Pattern

Updated NewDASubmitter() constructor to automatically use NopMetrics() if metrics is nil

Removed all if s.metrics == nil checks throughout the code (8 occurrences removed)

This makes the code more readable and eliminates defensive programming clutter

4. Updated recordFailure() Function

Changed signature to accept DASubmitterFailureReason instead of string

Added warning log for unregistered failure reasons (addressing the code review concern)

All call sites now use typed constants like common.DASubmitterFailureReasonTimeout

5. Benefits of Changes

✅ Type safety: Compiler catches invalid failure reasons

✅ Better maintainability: Single source of truth for failure reasons

✅ Cleaner code: No nil checks scattered throughout

✅ Better observability: Warning logs for unexpected failure reasons

✅ No breaking changes: Tests still pass with NopMetrics()

All changes have been committed and pushed to the PR branch.

- Create DASubmitterFailureReason type with typed constants - Update metrics maps to use typed failure reasons instead of strings - Refactor DASubmitter constructor to use NopMetrics if metrics is nil - Remove all nil checks for metrics throughout the codebase - Add warning log for unregistered failure reasons - Centralize failure reason definitions using AllDASubmitterFailureReasons() This addresses review feedback to make the code more type-safe and readable by removing defensive nil checks. Co-authored-by: Marko <[email protected]>

julienrbrt

lgtm!

github-project-automation bot added this to Evolve Oct 13, 2025

github-actions bot assigned tac0turtle Oct 13, 2025

tac0turtle commented Oct 13, 2025

View reviewed changes

block/internal/submitting/da_submitter.go Outdated Show resolved Hide resolved

github-actions bot and others added 2 commits October 13, 2025 15:06

fmt

6473d35

tac0turtle marked this pull request as ready for review October 13, 2025 15:13

tac0turtle requested a review from auricom October 13, 2025 15:14

julienrbrt reviewed Oct 14, 2025

View reviewed changes

github-actions bot and others added 2 commits October 14, 2025 08:40

fmt

a0b2a90

tac0turtle requested a review from julienrbrt October 14, 2025 08:44

julienrbrt approved these changes Oct 14, 2025

View reviewed changes

julienrbrt enabled auto-merge October 14, 2025 09:19

julienrbrt added this pull request to the merge queue Oct 14, 2025

Merged via the queue into main with commit 5a0a8fa Oct 14, 2025
47 of 48 checks passed

julienrbrt deleted the claude/issue-2755-20251013-1447 branch October 14, 2025 09:30

github-project-automation bot moved this to Done in Evolve Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Prometheus metrics for DA submission failures #2756

feat: add Prometheus metrics for DA submission failures #2756

Uh oh!

tac0turtle commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

julienrbrt left a comment

Uh oh!

Uh oh!

Uh oh!

julienrbrt Oct 14, 2025

Uh oh!

tac0turtle Oct 14, 2025

Uh oh!

claude bot Oct 14, 2025 •

edited

Loading

Uh oh!

julienrbrt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add Prometheus metrics for DA submission failures #2756

feat: add Prometheus metrics for DA submission failures #2756

Uh oh!

Conversation

tac0turtle commented Oct 13, 2025

Changes

Uh oh!

github-actions bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

✅ Strengths

🔴 Critical Issues

🟡 Moderate Issues

🟢 Minor Issues & Suggestions

🔧 Architecture & Design

📊 Performance Considerations

🔒 Security Review

📝 Documentation & Developer Experience

✅ Testing & Reliability

🎯 Summary & Recommendations

Uh oh!

Uh oh!

codecov bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

julienrbrt Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Tasks

Summary

1. Created Typed Failure Reason Constants

2. Updated Metrics to Use Typed Reasons

3. Refactored to Use NoOp Metrics Pattern

4. Updated recordFailure() Function

5. Benefits of Changes

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 13, 2025 •

edited

Loading

claude bot commented Oct 13, 2025 •

edited

Loading

codecov bot commented Oct 13, 2025 •

edited

Loading

claude bot Oct 14, 2025 •

edited

Loading

4. Updated `recordFailure()` Function