-
Notifications
You must be signed in to change notification settings - Fork 246
feat: add Prometheus metrics for DA submission failures #2756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add comprehensive Prometheus metrics to track DA submission failures
and retry behavior in the sequencer:
- da_submitter_failures_total{reason}: Counter for failures by reason
(timeout, too_big, already_in_mempool, not_included_in_block,
context_canceled, unknown)
- da_submitter_last_failure_timestamp{reason}: Timestamp of last failure
- da_submitter_pending_blobs: Current number of pending blobs
- da_submitter_resends_total: Total retry attempts
This addresses issue #2755 by providing better observability into DA
layer submission problems, allowing operators to diagnose and respond
to failures more effectively.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Marko <[email protected]>
|
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
|
Claude finished @tac0turtle's task —— View job Comprehensive Code ReviewI've completed a thorough review of this PR which adds Prometheus metrics for DA submission failures. Here's my detailed feedback: ✅ Strengths
🔴 Critical Issues1. Missing test coverage for metrics recording (block/internal/submitting/da_submitter.go:164-175) The Recommendation: Add unit tests that verify:
Example test structure: func TestDASubmitter_RecordFailure(t *testing.T) {
metrics := common.PrometheusMetrics("test")
submitter := NewDASubmitter(/* ... */, metrics, /* ... */)
// Test each failure reason
submitter.recordFailure(common.DASubmitterFailureReasonTimeout)
// Verify counter increased
// Verify timestamp gauge updated
}2. Potential race condition in concurrent metrics updates (block/internal/submitting/da_submitter.go:380-448) The Recommendation:
🟡 Moderate Issues3. Incomplete error context in unregistered failure reason warning (block/internal/submitting/da_submitter.go:167) The warning log for unregistered failure reasons only includes the reason string. For debugging, it would be helpful to include additional context about where this occurred. Recommendation: s.logger.Warn().
Str("reason", string(reason)).
Str("component", "da_submitter").
Msg("unregistered failure reason, metric not recorded")4. Magic number in concurrent marshaling (block/internal/submitting/da_submitter.go:520) The semaphore size of 32 for concurrent marshaling appears to be arbitrary with no documentation explaining the choice. Recommendation: Extract to a named constant with documentation: const (
// maxMarshalConcurrency limits the number of concurrent marshaling goroutines
// to prevent excessive memory usage and contention
maxMarshalConcurrency = 32
)5. Incomplete test coverage for failure scenarios (block/internal/submitting/da_submitter_mocks_test.go) The mock-based tests cover happy paths and basic retry scenarios but don't test:
Recommendation: Add tests for:
🟢 Minor Issues & Suggestions6. Documentation could be more explicit about thread safety (block/internal/common/metrics.go:44-100) The Recommendation: Add package-level documentation: // Metrics is safe for concurrent reads after initialization.
// Maps (DASubmitterFailures, DASubmitterLastFailure, etc.) are populated
// during initialization and never modified, making them safe for concurrent access.7. Potential for improved metric granularity (block/internal/submitting/da_submitter.go:388-390) The resends counter is incremented for any retry, regardless of reason. This loses information about which failure types cause the most retries. Recommendation: Consider adding a label to track resend reasons: DASubmitterResends: prometheus.NewCounterFrom(stdprometheus.CounterOpts{
// ...
ConstLabels: map[string]string{
"reason": string(reason), // Add this during recording
},
}, labels).With(labelsAndValues...)8. Missing constant for submission timeout (block/internal/submitting/da_submitter.go:23) The Recommendation: Consider making this configurable via the config struct or documenting why 60s is appropriate for all DA layers. 9. Test cleanup not consistent (block/internal/submitting/da_submitter_test.go:88-108)
Recommendation: Use 🔧 Architecture & Design10. Good separation of concerns The metrics logic is cleanly separated from the business logic, with the NoOp pattern allowing the submitter to work without metrics infrastructure. 11. Appropriate use of functional parameters The 12. Retry policy is well-designed The 📊 Performance Considerations13. Efficient metric updates The Prometheus metrics are updated via simple counter/gauge operations, which are atomic and have minimal overhead. No performance concerns here. 14. Concurrent marshaling is appropriate The semaphore-limited concurrent marshaling (line 520) is a good balance between throughput and resource usage. 🔒 Security Review15. No security concerns identified
📝 Documentation & Developer Experience16. Good inline comments The code includes helpful comments explaining complex logic, especially around gas price handling and retry behavior. 17. Missing package-level documentation While individual functions are well-documented, the Recommendation: Add to // Package submitting handles submission of headers and data to the DA layer
// with comprehensive retry logic, gas price management, and Prometheus metrics
// for observability.
//
// Metrics tracked:
// - da_submitter_failures_total: Counts failures by reason
// - da_submitter_last_failure_timestamp: Last failure time by reason
// - da_submitter_pending_blobs: Current backlog size
// - da_submitter_resends_total: Retry attempts✅ Testing & Reliability18. Test coverage is good but incomplete
Missing: Direct tests of metric recording (see Critical Issue #1) 19. No flaky test patterns detected Tests use appropriate timeouts and don't rely on timing for correctness (except where necessary for eventual consistency checks). 🎯 Summary & RecommendationsOverall Assessment: This is a well-implemented feature that adds valuable observability to DA submission. The code quality is high, the design is sound, and the NoOp pattern refactoring is excellent. Priority Actions:
Approval Recommendation: |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2756 +/- ##
==========================================
+ Coverage 61.39% 61.42% +0.03%
==========================================
Files 81 81
Lines 8573 8622 +49
==========================================
+ Hits 5263 5296 +33
- Misses 2816 2828 +12
- Partials 494 498 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Changes the DASubmitterPendingBlobs metric to track the total number of blobs awaiting submission across the entire queue, rather than just the current batch being submitted. This provides better visibility into the submission backlog for monitoring and alerting. - Added getTotalPendingFn parameter to submitToDA() - Updated metric to call NumPendingHeaders() or NumPendingData() - Updated metric documentation and help text - Updated all test calls with new parameter Co-authored-by: Marko <[email protected]>
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but commented about some code cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove all those metrics nil check in the code, and just pass a NoOp metrics if metrics is nil in the constructor? makes it less readable otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude can you implement this and other feedback from julien
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @tac0turtle's task —— View job
Tasks
- Read current implementation files
- Create typed failure reason constants
- Refactor metrics to use NoOp pattern in constructor
- Remove nil checks for metrics throughout code
- Update all call sites to use typed failure reasons
- Update tests
- Run linting and formatting
- Commit and push changes
Summary
I've successfully implemented all of @julienrbrt's feedback:
1. Created Typed Failure Reason Constants
- Added
DASubmitterFailureReasontype with 8 typed constants - Created
AllDASubmitterFailureReasons()function to centralize the list - This prevents typos and makes failure reasons discoverable at compile time
2. Updated Metrics to Use Typed Reasons
- Changed map types from
map[string]metrics.Countertomap[DASubmitterFailureReason]metrics.Counter - Updated both
PrometheusMetrics()andNopMetrics()to use the centralized list - All metric initialization now uses the typed constants
3. Refactored to Use NoOp Metrics Pattern
- Updated
NewDASubmitter()constructor to automatically useNopMetrics()ifmetricsis nil - Removed all
if s.metrics == nilchecks throughout the code (8 occurrences removed) - This makes the code more readable and eliminates defensive programming clutter
4. Updated recordFailure() Function
- Changed signature to accept
DASubmitterFailureReasoninstead ofstring - Added warning log for unregistered failure reasons (addressing the code review concern)
- All call sites now use typed constants like
common.DASubmitterFailureReasonTimeout
5. Benefits of Changes
- ✅ Type safety: Compiler catches invalid failure reasons
- ✅ Better maintainability: Single source of truth for failure reasons
- ✅ Cleaner code: No nil checks scattered throughout
- ✅ Better observability: Warning logs for unexpected failure reasons
- ✅ No breaking changes: Tests still pass with
NopMetrics()
All changes have been committed and pushed to the PR branch.
- Create DASubmitterFailureReason type with typed constants - Update metrics maps to use typed failure reasons instead of strings - Refactor DASubmitter constructor to use NopMetrics if metrics is nil - Remove all nil checks for metrics throughout the codebase - Add warning log for unregistered failure reasons - Centralize failure reason definitions using AllDASubmitterFailureReasons() This addresses review feedback to make the code more type-safe and readable by removing defensive nil checks. Co-authored-by: Marko <[email protected]>
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Add comprehensive Prometheus metrics to track DA submission failures and retry behavior in the sequencer.
Closes #2755
Changes
da_submitter_failures_total{reason}counterda_submitter_last_failure_timestamp{reason}gaugeda_submitter_pending_blobsgaugeda_submitter_resends_totalcounter🤖 Generated with Claude Code