Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 34 additions & 23 deletions espresso/docs/metrics.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Metrics

> **Log constants**: Log message strings referenced in this document are defined as Go constants in
> [`espresso/logmodule/dashboard_keys.go`](../logmodule/dashboard_keys.go) (DataDog dashboard logs) and
> [`espresso/logmodule/log_keys.go`](../logmodule/log_keys.go) (debugging logs). If you change a dashboard
> log string you must update the corresponding DataDog queries and alerts at the same time.

This document outlines the monitoring framework for our system components, organized into the following categories:

- **Key Metrics**: Metrics that belong on the dashboard for operational visibility
Expand All @@ -16,26 +21,32 @@ Each indicator points to a log event to monitor.
Metrics that belong on the dashboard:

- Blocks enqueued for batching to L1/AltDA:
`"Added L2 block to channel manager"`
`logmodule.AddedL2BlockToChannelManager`
- Espresso batch submissions
`"Submitted transaction to Espresso"`
`logmodule.SubmittedTransactionToEspresso`
- Espresso transaction confirmations
`logmodule.TransactionConfirmedOnEspresso`
- Blocks received from Espresso
`logmodule.ReceivedBlockFromEspresso`
- Channel sealed for L1 submission
`logmodule.ChannelClosed`
- L1 batch submissions
`"Transaction confirmed"`
`logmodule.TransactionSuccessfullyPublished`

### Recoverable Errors

Events that we need to monitor and raise alerts if they're encountered often:

- State reset (even once is suspicious)
`"Clearing state"`
`logmodule.ClearingState`
- Espresso transaction creation failed
`"Failed to derive batch from block"`
`logmodule.FailedToDeriveBatchFromBlock`
- L1 submission failed
`"Transaction failed to send"`
`logmodule.TransactionFailedToSend`
- AltDA submission failed
`"DA request failed"`
`logmodule.DARequestFailed`
- L2 reorg detected
`"Found L2 reorg"`
`logmodule.FoundL2Reorg`

### Critical Errors

Expand All @@ -48,7 +59,7 @@ Events that we need to monitor and raise alerts if they're encountered often:
Non-errors that can indicate preconditions for a problem to occur:

- Gas price too high
`effectiveGasPrice` field of `"Transaction confirmed"` log
`effectiveGasPrice` field of `logmodule.TransactionSuccessfullyPublished` log
- Espresso transaction backlog is growing
can be derived from Espresso transaction queue metrics above

Expand All @@ -57,18 +68,18 @@ Non-errors that can indicate preconditions for a problem to occur:
### Key Metrics

- New L1 safe blocks
`"New L1 safe block"`
`logmodule.NewL1SafeBlock`
- New L2 unsafe blocks
`"Inserted new L2 unsafe block"`
`logmodule.InsertedNewL2UnsafeBlock`
- New L2 safe blocks
`"safe head updated"`
`logmodule.CrossSafeHeadUpdated`

### Recoverable Errors

- Pipeline errors
`"Derivation process error"`
`logmodule.DerivationProcessError`
- Malformed batch
`"Dropping batch"`, `"Failed to parse frames"`
`logmodule.DroppingBatch`, `logmodule.FailedToParseFrames`

### Critical Errors

Expand All @@ -83,19 +94,19 @@ Events that need to raise urgent alerts as they indicate full chain stall:
### Key Metrics

- New L1 safe blocks
`"New L1 safe block"`
`logmodule.NewL1SafeBlock`
- New L2 unsafe blocks
`"Inserted new L2 unsafe block"`
`logmodule.InsertedNewL2UnsafeBlock`
- New L2 safe blocks
Either `"safe head updated"` or `"Hit finalized L2 head, returning immediately"` with increasing
L2 safe number. The former is the normal case, and the latter happens after a reset.
Either `logmodule.CrossSafeHeadUpdated` or `logmodule.HitFinalizedL2Head` with increasing L2 safe
number. The former is the normal case, and the latter happens after a reset.

### Recoverable Errors

- Pipeline errors
`"Derivation process error"`
`logmodule.DerivationProcessError`
- Malformed batch
`"Dropping batch"`, `"Failed to parse frames"`
`logmodule.DroppingBatch`, `logmodule.FailedToParseFrames`

### Critical Errors

Expand All @@ -112,11 +123,11 @@ All events of Decaff Validator Node, and:
### Key Metrics

- Blocks produced
`"Sequencer sealed block"`
`logmodule.SequencerSealedBlock`

### Recoverable Errors

- Engine failure
`"Engine failed temporarily, backing off sequencer"`
`logmodule.EngineFailedTemporarily`
- Engine reset
`"Engine reset confirmed, sequencer may continue"`
`logmodule.EngineResetConfirmed`
37 changes: 37 additions & 0 deletions espresso/logmodule/dashboard_keys.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Package logmodule defines log message string constants for the Espresso integration.
//
// dashboard_keys.go contains constants for events monitored by the DataDog dashboard.
// These must be kept in sync with any dashboard queries, alerts, or deployment investigation
// runbooks that reference them by name. If you change any of these strings, update the
// DataDog dashboard queries and alerts at the same time.
package logmodule

const (
// SequencerSealedBlock is emitted by the sequencer each time it seals a new L2 block.
// Monitored as the primary "blocks produced" metric.
SequencerSealedBlock = "Sequencer sealed block"

// CrossSafeHeadUpdated is emitted by the op-node status tracker each time the cross-safe L2
// head advances. Monitored as "new L2 safe blocks" for both the Caff and non-Caff validator nodes.
CrossSafeHeadUpdated = "Cross safe head updated"

// TransactionConfirmedOnEspresso is emitted by the batcher after it verifies that a transaction
// was included in HotShot consensus.
TransactionConfirmedOnEspresso = "Transaction confirmed on Espresso"

// TransactionSuccessfullyPublished is emitted by the tx manager after a transaction is accepted
// by the L1 RPC. Monitored as "L1 batch submissions".
TransactionSuccessfullyPublished = "Transaction successfully published"

// SubmittedTransactionToEspresso is emitted by the batcher each time it sends a transaction to
// the Espresso sequencer. Monitored as "Espresso batch submissions".
SubmittedTransactionToEspresso = "Submitted transaction to Espresso"

// ChannelClosed is emitted by the batcher channel manager when a channel is closed and ready
// for frame submission.
ChannelClosed = "Channel closed"

// ReceivedBlockFromEspresso is emitted by the batcher each time it reads a confirmed L2 block
// back from the Espresso query service.
ReceivedBlockFromEspresso = "Received block from Espresso"
)
56 changes: 56 additions & 0 deletions espresso/logmodule/log_keys.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
package logmodule

// Non-dashboard log constants for events referenced in metrics.md that are useful for
// debugging but are not directly monitored by the DataDog dashboard.

const (
// Batcher

// AddedL2BlockToChannelManager is emitted each time a new L2 block is enqueued for batching.
AddedL2BlockToChannelManager = "Added L2 block to channel manager"
Comment thread
philippecamacho marked this conversation as resolved.

// ClearingState is emitted on a batcher state reset. Even a single occurrence is suspicious.
ClearingState = "Clearing state"

// FailedToDeriveBatchFromBlock is emitted when the batcher cannot construct an Espresso transaction from an L2 block.
FailedToDeriveBatchFromBlock = "Failed to derive batch from block"

// TransactionFailedToSend is emitted when an L1 submission attempt fails.
TransactionFailedToSend = "Transaction failed to send"

// DARequestFailed is emitted when an AltDA submission fails.
DARequestFailed = "DA request failed"

// FoundL2Reorg is emitted when the batcher detects an L2 reorg.
FoundL2Reorg = "Found L2 reorg"

// Node (Caff, Non-caff, Sequencer)

// NewL1SafeBlock is emitted each time a new L1 safe block is observed.
NewL1SafeBlock = "New L1 safe block"

// InsertedNewL2UnsafeBlock is emitted each time a new L2 unsafe block is inserted.
InsertedNewL2UnsafeBlock = "Inserted new L2 unsafe block"

// HitFinalizedL2Head is emitted during a sync reset when the node reaches the finalized L2 head.
// An increasing L2 safe number here serves as an alternative indicator for "new L2 safe blocks"
// after a pipeline reset (non-Caff validator node only).
HitFinalizedL2Head = "Hit finalized L2 head, returning immediately"

// DerivationProcessError is emitted on a recoverable derivation pipeline error.
DerivationProcessError = "Derivation process error"

// DroppingBatch is emitted when a malformed or invalid batch is discarded.
DroppingBatch = "Dropping batch"

// FailedToParseFrames is emitted when frame parsing fails for a batch.
FailedToParseFrames = "Failed to parse frames"

// Sequencer

// EngineFailedTemporarily is emitted when the execution engine fails and the sequencer backs off.
EngineFailedTemporarily = "Engine failed temporarily, backing off sequencer"

// EngineResetConfirmed is emitted after a successful engine reset, allowing the sequencer to resume.
EngineResetConfirmed = "Engine reset confirmed, sequencer may continue"
)
3 changes: 2 additions & 1 deletion op-batcher/batcher/channel_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"math"

altda "github.com/ethereum-optimism/optimism/op-alt-da"
"github.com/ethereum-optimism/optimism/espresso/logmodule"
"github.com/ethereum-optimism/optimism/op-batcher/metrics"
"github.com/ethereum-optimism/optimism/op-node/rollup"
"github.com/ethereum-optimism/optimism/op-node/rollup/derive"
Expand Down Expand Up @@ -524,7 +525,7 @@ func (s *channelManager) outputFrames() error {
comprRatio = float64(outBytes) / float64(inBytes)
}

s.log.Info("Channel closed",
s.log.Info(logmodule.ChannelClosed,
"id", s.currentChannel.ID(),
"blocks_pending", s.pendingBlocks(),
"block_cursor", s.blockCursor,
Expand Down
9 changes: 5 additions & 4 deletions op-batcher/batcher/driver.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
espressoClient "github.com/EspressoSystems/espresso-network/sdks/go/client"
espressoLightClient "github.com/EspressoSystems/espresso-network/sdks/go/light-client"
"github.com/ethereum-optimism/optimism/espresso"
"github.com/ethereum-optimism/optimism/espresso/logmodule"
altda "github.com/ethereum-optimism/optimism/op-alt-da"
"github.com/ethereum-optimism/optimism/op-batcher/batcher/throttler"
config "github.com/ethereum-optimism/optimism/op-batcher/config"
Expand Down Expand Up @@ -405,7 +406,7 @@ func (l *BatchSubmitter) loadBlocksIntoState(ctx context.Context, start, end uin
for i := start; i <= end; i++ {
block, err := l.loadBlockIntoState(ctx, i)
if errors.Is(err, ErrReorg) {
l.Log.Warn("Found L2 reorg", "block_number", i)
l.Log.Warn(logmodule.FoundL2Reorg, "block_number", i)
return err
} else if err != nil {
l.Log.Warn("Failed to load block into state", "err", err)
Expand Down Expand Up @@ -948,7 +949,7 @@ func (l *BatchSubmitter) publishStateToL1(ctx context.Context, queue *txmgr.Queu

// clearState clears the state of the channel manager
func (l *BatchSubmitter) clearState(ctx context.Context) {
l.Log.Info("Clearing state")
l.Log.Info(logmodule.ClearingState)
defer l.Log.Info("State cleared")

clearStateWithL1Origin := func() bool {
Expand Down Expand Up @@ -1218,15 +1219,15 @@ func (l *BatchSubmitter) recordFailedDARequest(id txID, err error) {
defer l.channelMgrMutex.Unlock()
failover := errors.Is(err, altda.ErrAltDADown)
if err != nil {
l.Log.Warn("DA request failed", append([]interface{}{"failoverToEthDA", failover}, logFields(id, err)...)...)
l.Log.Warn(logmodule.DARequestFailed, append([]interface{}{"failoverToEthDA", failover}, logFields(id, err)...)...)
}
l.channelMgr.AltDASubmissionFailed(id, failover)
}

func (l *BatchSubmitter) recordFailedTx(id txID, err error) {
l.channelMgrMutex.Lock()
defer l.channelMgrMutex.Unlock()
l.Log.Warn("Transaction failed to send", logFields(id, err)...)
l.Log.Warn(logmodule.TransactionFailedToSend, logFields(id, err)...)
l.channelMgr.TxFailed(id)
}

Expand Down
13 changes: 7 additions & 6 deletions op-batcher/batcher/espresso.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/ethereum/go-ethereum/signer/core/apitypes"

"github.com/ethereum-optimism/optimism/espresso/bindings"
"github.com/ethereum-optimism/optimism/espresso/logmodule"
"github.com/ethereum-optimism/optimism/op-node/rollup/derive"
"github.com/ethereum-optimism/optimism/op-service/eth"
"github.com/ethereum-optimism/optimism/op-service/txmgr"
Expand Down Expand Up @@ -426,7 +427,7 @@ func (s *espressoTransactionSubmitter) handleVerifyReceiptJobResponse() {
// confirmed that the transaction was submitted to Espresso
commitment := jobResp.job.transaction.transaction.Commit()
hash, _ := tagged_base64.New("TX", commitment[:])
log.Info("Transaction confirmed on Espresso", "hash", hash.String())
log.Info(logmodule.TransactionConfirmedOnEspresso, "hash", hash.String())
}
}

Expand Down Expand Up @@ -568,7 +569,7 @@ func espressoSubmitTransactionWorker(
// Submit the transaction to Espresso
hash, err := cli.SubmitTransaction(ctx, *jobAttempt.job.transaction)
if err == nil {
log.Info("submitted transaction to Espresso", "hash", hash)
log.Info(logmodule.SubmittedTransactionToEspresso, "hash", hash)
}

jobAttempt.job.attempts++
Expand Down Expand Up @@ -716,7 +717,7 @@ func (s *espressoTransactionSubmitter) Start() {
func (l *BatchSubmitter) queueBlockToEspresso(ctx context.Context, block *types.Block) error {
espressoBatch, err := derive.BlockToEspressoBatch(l.RollupConfig, block)
if err != nil {
l.Log.Warn("Failed to derive batch from block", "err", err)
l.Log.Warn(logmodule.FailedToDeriveBatchFromBlock, "err", err)
return fmt.Errorf("failed to derive batch from block: %w", err)
}

Expand Down Expand Up @@ -799,7 +800,7 @@ func (l *BatchSubmitter) espressoBatchLoadingLoop(ctx context.Context, wg *sync.
}

l.Log.Info(
"Received block from Espresso",
logmodule.ReceivedBlockFromEspresso,
"blockNr", block.NumberU64(),
"blockHash", block.Hash(),
"parentHash", block.ParentHash(),
Expand All @@ -815,7 +816,7 @@ func (l *BatchSubmitter) espressoBatchLoadingLoop(ctx context.Context, wg *sync.
l.EspressoStreamer().Reset()
}

l.Log.Info("Added L2 block to channel manager")
l.Log.Info(logmodule.AddedL2BlockToChannelManager)
}

l.tryPublishSignal(publishSignal, pubInfo{})
Expand Down Expand Up @@ -859,7 +860,7 @@ func (l *BlockLoader) EnqueueBlocks(ctx context.Context, blocksToQueue inclusive
}

if len(l.queuedBlocks) > 0 && block.ParentHash() != l.queuedBlocks[len(l.queuedBlocks)-1].Hash {
l.batcher.Log.Warn("Found L2 reorg", "block_number", i)
l.batcher.Log.Warn(logmodule.FoundL2Reorg, "block_number", i)
l.reset(ctx)
break
}
Expand Down
3 changes: 2 additions & 1 deletion op-node/rollup/derive/attributes_queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (

op "github.com/EspressoSystems/espresso-streamers/op"
"github.com/ethereum-optimism/optimism/espresso"
"github.com/ethereum-optimism/optimism/espresso/logmodule"

"github.com/ethereum/go-ethereum/common"
"github.com/ethereum/go-ethereum/core/types"
Expand Down Expand Up @@ -162,7 +163,7 @@ func CaffNextBatch(s *op.BatchStreamer[EspressoBatch], ctx context.Context, pare
nextTimestamp := parent.Time + blockTime

if batch.Timestamp != nextTimestamp {
s.Log.Error("Dropping batch", "batch", espressoBatch.Number(), "timestamp", batch.Timestamp, "expected", nextTimestamp)
s.Log.Error(logmodule.DroppingBatch, "batch", espressoBatch.Number(),"timestamp", batch.Timestamp, "expected", nextTimestamp)
return nil, false, ErrTemporary
}

Expand Down
3 changes: 2 additions & 1 deletion op-node/rollup/derive/batch_queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (

"github.com/ethereum/go-ethereum/log"

"github.com/ethereum-optimism/optimism/espresso/logmodule"
"github.com/ethereum-optimism/optimism/op-node/rollup"
"github.com/ethereum-optimism/optimism/op-service/eth"
)
Expand Down Expand Up @@ -175,7 +176,7 @@ batchLoop:
remaining = append(remaining, batch)
continue
case BatchDrop:
batch.Batch.LogContext(bq.log).Warn("Dropping batch",
batch.Batch.LogContext(bq.log).Warn(logmodule.DroppingBatch,
"parent", parent.ID(),
"parent_time", parent.Time,
)
Expand Down
3 changes: 2 additions & 1 deletion op-node/rollup/derive/deriver.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"fmt"
"io"

"github.com/ethereum-optimism/optimism/espresso/logmodule"
"github.com/ethereum-optimism/optimism/op-node/rollup"
"github.com/ethereum-optimism/optimism/op-service/eth"
"github.com/ethereum-optimism/optimism/op-service/event"
Expand Down Expand Up @@ -162,7 +163,7 @@ func (d *PipelineDeriver) OnEvent(ctx context.Context, ev event.Event) bool {
// don't do a backoff for this error
d.emitter.Emit(ctx, DeriverMoreEvent{})
} else if err != nil {
d.pipeline.log.Error("Derivation process error", "err", err)
d.pipeline.log.Error(logmodule.DerivationProcessError, "err", err)
d.emitter.Emit(ctx, rollup.EngineTemporaryErrorEvent{Err: err})
} else {
if attrib != nil {
Expand Down
Loading
Loading