Skip to content

feat: send state metrics to xatu#3

Closed
weiihann wants to merge 419 commits into
samcm:bump-state-size-trackerfrom
weiihann:feat/xatu-state-metrics-integrate
Closed

feat: send state metrics to xatu#3
weiihann wants to merge 419 commits into
samcm:bump-state-size-trackerfrom
weiihann:feat/xatu-state-metrics-integrate

Conversation

@weiihann
Copy link
Copy Markdown

No description provided.

@weiihann weiihann changed the base branch from master to bump-state-size-tracker January 16, 2026 05:36
@weiihann weiihann force-pushed the feat/xatu-state-metrics-integrate branch from 469728a to 137c8c3 Compare January 16, 2026 05:42
@weiihann weiihann force-pushed the feat/xatu-state-metrics-integrate branch from 137c8c3 to 6447d7a Compare February 19, 2026 06:45
s1na and others added 27 commits February 26, 2026 13:55
The`plucky` and `oracular` have reached end of life. That's why
launchpad isn't building them anymore:
https://launchpad.net/~ethereum/+archive/ubuntu/ethereum/+packages.
We didn't upgrade to 1.25, so this jumps over one version. I want to
upgrade all builds to Go 1.26 soon, but let's start with the Docker
build to get a sense of any possible issues.
…33900)

The endianness was wrong, which means that the code chunks were stored
in the wrong location in the tree.
fix the flaky test found in
https://ci.appveyor.com/project/ethereum/go-ethereum/builds/53601688/job/af5ccvufpm9usq39

1. increase the timeout from 3+1s to 15s, and use timer instead of
sleep(in the CI env, it may need more time to sync the 1024 blocks)
2. add `synced.Load()` to ensure the full async chain is finished

Signed-off-by: Delweng <delweng@gmail.com>
With this, we are dropping support for protocol version eth/68. The only supported
version is eth/69 now. The p2p receipt encoding logic can be simplified a lot, and
processing of receipts during sync gets a little faster because we now transform
the network encoding into the database encoding directly, without decoding the
receipts first.

---------

Co-authored-by: Felix Lange <fjl@twurst.com>
I noticed that some autonomous agents have a tendency to commit binaries
if asked to create a PR.
Fixes priceheap comparison in some edge cases.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
This PR introduces a threshold (relative to current market base fees),
below which we suppress the diffusion of low fee transactions. Once base
fees go down, and if the transactions were not evicted in the meantime,
we release these transactions.

The PR also updates the bucketing logic to be more sensitive, removing
the extra logarithm. Blobpool description is also
updated to reflect the new behavior.

EIP-7918 changed the maximim blob fee decrease that can happen in a
slot. The PR also updates fee jump calculation to reflect this.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
…ethereum#33908)

The payload rebuild loop resets the timer with the full Recommit
duration after generateWork returns, making the actual interval
generateWork_elapsed + Recommit instead of Recommit alone.

Since fillTransactions uses Recommit (2s) as its timeout ceiling, the
effective rebuild interval can reach ~4s under heavy blob workloads —
only 1–2 rebuilds in a 6s half-slot window instead of the intended 3.

Fix by subtracting elapsed time from the timer reset.

### Before this fix

```
t=0s  timer fires, generateWork starts
t=2s  fillTransactions times out, timer.Reset(2s)
t=4s  second rebuild starts
t=6s  CL calls getPayload — gets the t=2s result (1 effective rebuild)
```

### After

```
t=0s  timer fires, generateWork starts
t=2s  fillTransactions times out, timer.Reset(2s - 2s = 0)
t=2s  second rebuild starts immediately
t=4s  timer.Reset(0), third rebuild starts
t=6s  CL calls getPayload — gets the t=4s result (3 effective rebuilds)
```
We got a report that after v1.17.0 a geth-teku node starts to time out
on engine_getBlobsV2 after around 3h of operation. The culprit seems to
be our optional http2 service which Teku attempts first. The exact cause
of the timeout is still unclear.

This PR is more of a workaround than proper fix until we figure out the
underlying issue. But I don't expect http2 to particularly benefit
engine API throughput and latency. Hence it should be fine to disable it
for now.
…reum#33946)

ethereum#33916 + cmd/keeper go mod
tidy

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
In `buildPayload()`, the background goroutine uses a `select` to wait on
the recommit timer, the stop channel, and the end timer. When both
`timer.C` and `payload.stop` are ready simultaneously, Go's `select`
picks a case non-deterministically. This means the loop can enter the
`timer.C` case and perform an unnecessary `generateWork` call even after
the payload has been resolved.

Add a non-blocking check of `payload.stop` at the top of the `timer.C`
case to exit immediately when the payload has already been delivered.
Return the Amsterdam instruction set from `LookupInstructionSet` when
`IsAmsterdam` is true, so Amsterdam rules no longer fall through to the
Osaka jump table.

---------

Co-authored-by: rjl493456442 <garyrong0905@gmail.com>
…reum#33869)

For bal-devnet-3 we need to update the EIP-8024 implementation to the
latest spec changes: ethereum/EIPs#11306

> Note: I deleted tests not specified in the EIP bc maintaining them
through EIP changes is too error prone.
Pebble maintains a batch pool to recycle the batch object. Unfortunately
batch object must be
explicitly returned via `batch.Close` function. This PR extends the
batch interface by adding
the close function and also invoke batch.Close in some critical code
paths.

Memory allocation must be measured before merging this change. What's
more, it's an open
question that whether we should apply batch.Close as much as possible in
every invocation.
Mainnet was already overriding --cache to 4096. This PR just makes this
the default.
…thereum#33927)

The BatchSpanProcessor queue size was incorrectly set to
DefaultMaxExportBatchSize (512) instead of DefaultMaxQueueSize (2048).

I noticed the issue on bloatnet when analyzing the block building
traces. During a particular run, the miner was including 1000
transactions in a single block. When telemetry is enabled, the miner
creates a span for each transaction added to the block. With the queue
capped at 512, spans were silently dropped when production outpaced the
span export, resulting in incomplete traces with orphaned spans. While
this doesn't eliminate the possibility of drops under extreme
load, using the correct default restores the 4x buffer between queue
capacity and export batch size that the SDK was designed around.
cuiweixie and others added 18 commits May 10, 2026 13:03
This fixes a theoretical overflow condition if an account has an impossibly high nonce.
Removes the appveyor.yml since we moved to github runners.

---------

Co-authored-by: Sina Mahmoodi <itz.s1na@gmail.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
Passing `--dev=false` currently still enters the dev-mode startup path
because a couple of branches check whether the flag was set, not its
boolean value.

This switches those branches to use `ctx.Bool`, so explicit false does
not start dev mode or emit a dev genesis, while `--dev` keeps its
existing behavior.
Changes core.Message to use Uint256 which is faster

---------

Co-authored-by: Gary Rong <garyrong0905@gmail.com>
…ereum#34827)

Every tracer that implements Stop/GetResult held a `reason error` field
that is written by Stop (called from the trace-timeout watchdog
goroutine in api.go) and read by GetResult (called by the RPC handler
main goroutine). These accesses were unsynchronized.
In the --create path, execFunc returns gasLeft as the second return
value, but the rest of the code treats this value as "gas used" (printed
as such, and compared in timedExec). This makes gas reporting incorrect
and can cause benchmark consistency checks to fail.
This is a refactoring PR to wrap all pre/post-execution system calls as
the exported functions, eliminating the duplicated system calls across
the codebase.

There are a few things unchanged but worths highlight:

- ChainMaker is left as unchanged, a significant rewrite is required
- BeaconRoot in header should be non-nil if Cancun is enabled

---------

Co-authored-by: jwasinger <j-wasinger@hotmail.com>
…m#34939)

Fixes the regression caught by
https://hive.ethpandaops.io/#/test/generic/1778481210-e59b7465e1d04f7ed1b0200838584b16?testnumber=137.
engine.AssembleBlock explicitly expects withdrawals to be non-nil for
pre-Shanghai blocks as opposed to FinaliseAndAssemble which stripped off
the withdrawal.
In b2843a1, metrics check len(res) == len(hashes) but res is
pre-allocated with make(), so length is always equal. Partial hit metric
never fires. Count non-nil elements instead.

---------

Co-authored-by: Bosul Mun <bsbs8645@snu.ac.kr>
This PR introduces a separate transaction pool type for sparse blobpool.

In sparse blobpool, PooledTransactions message delivers transactions without
blobs, partial or full cells are downloaded by Cells message. Blobpool no longer
stores transactions with complete sidecars, and it stores transactions without
blobs, along with the corresponding cells. Because of this, a dedicated type
distinct from types.Transaction is required.

This PR introduces a type called `BlobTxForPool` and stores each sidecar field
independently, in order to bypass the assumption that a sidecar always exists as
a complete unit.

Reintroducing the conversion queue was considered, but was ultimately omitted
because type conversion should be sufficiently fast. With sparse blobpool, blob
-> cell computation would take about ~13ms per blob. Not sure whether this is
fast enough, but otherwise we can add the conversion queue later on the sparse
blobpool branch.
1. should use !reflect.DeepEqual.
2. big.NewInt(0).SetBits([]big.Word{}) work around for DeepEqual when
big.Int is zero, unpack return a []big.Word{}.
Passing `--v2=false` currently still selects the v2 binding generator
because the command checks whether the flag was set.

This switches generation to use the boolean flag value, so explicit
false continues to generate legacy bindings while `--v2` keeps selecting
v2.
@weiihann weiihann force-pushed the feat/xatu-state-metrics-integrate branch from 6447d7a to 5e97196 Compare May 13, 2026 07:59
rjl493456442 and others added 8 commits May 13, 2026 10:53
This PR introduces OnGasChangeV2 tracing hook, as the pre-requisite for landing
EIP-8037.

---------

Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>
This PR extends the journal to track the pre-transaction values of
mutated balances, nonces, and code.

At the end of the transaction, these values are used to filter out no-op
changes, such as balance transitions from a-> b->a. These changes are
excluded from the block-level access list.

Additionally, there is a dedicated `bal.ConstructionBlockAccessList`
objects for gathering the state reads and writes within the current
transaction. These state writes will be keyed by the block accessList
index.

---------

Co-authored-by: jwasinger <j-wasinger@hotmail.com>
## Summary

The `--rpc.telemetry.sample-ratio` flag declares `Value: 1.0` and `geth
--help` advertises `(default: 1)`. In practice, however, omitting the
flag produces a sample ratio of `0`, causing
`sdktrace.TraceIDRatioBased(0)` to drop 100% of spans. Users who enable
`--rpc.telemetry` see the `OpenTelemetry trace export enabled` log line
and a clean startup, but no traces ever leave the process.

The root cause is the interaction between two pieces of code:

1. `cmd/utils/flags.go:setOpenTelemetry` (added in ethereum#34062) only copies
the flag value when `ctx.IsSet(...)` returns true:

   ```go
   if ctx.IsSet(RPCTelemetrySampleRatioFlag.Name) {
       tcfg.SampleRatio = ctx.Float64(RPCTelemetrySampleRatioFlag.Name)
   }
   ```

That is the right pattern for "don't clobber a config-file value with
the CLI default," but it implies that something else must initialise the
field when neither source sets it.

2. `node/defaults.go:DefaultConfig` never initialises
`OpenTelemetry.SampleRatio`, leaving it at the float64 zero value.

The result for the common CLI-only user (no TOML config) is `SampleRatio
= 0` → every span is silently dropped, despite the documented default of
1.

## Change

Seed `OpenTelemetry: OpenTelemetryConfig{SampleRatio: 1.0}` in
`node.DefaultConfig` so the documented default matches runtime behavior
and the `ctx.IsSet` guard in `setOpenTelemetry` continues to do what it
was designed to do.
Adds a new CLI flag --state.size-tracking-depth to control how many
recent block state sizes are tracked in memory. The default is set to
10000 blocks (previously hardcoded at 128).

This allows users to tune memory usage vs historical depth based on
their monitoring needs. Setting to 0 uses the default value.

Memory impact per block tracked: ~292 bytes (map entry + heap entry)
…relation

This change updates the state tracking mechanism to include the block hash alongside the block number and state root when calculating and publishing size statistics. This allows for more precise correlation between state size changes and specific blocks, improving observability and debugging capabilities for state growth.
count and bytes

feat: send state metrics to xatu

feat: change module and add network name
The state size tracer previously emitted a single signed delta per
category (account, storage, contract code, account trie nodes, storage
trie nodes). That representation lost information because a block-level
update appeared the same as no activity once writes and deletes cancelled.

The tracer now emits per-block writes and deletes counts and bytes
separately, nested under top-level "writes" and "deletes" keys in the
"State metrics" JSON log. An update is accounted as BOTH a write of the
new value AND a delete of the prev value, so consumers can recover the
net delta as (writes - deletes). The four symmetric categories use this
three-arm switch (create / update / delete). Contract code remains
write-only with hash dedup (reliable ref-counting would be needed for
deletions); its delete counters stay 0.

The mpt-depth side of the log is untouched.

Consumer side: xatu's sentry-logs Vector pipeline normalises the new
shape and ClickHouse derives the original delta fields via MATERIALIZED
columns.
@weiihann weiihann force-pushed the feat/xatu-state-metrics-integrate branch from 5e97196 to f5efc52 Compare May 14, 2026 05:02
@weiihann
Copy link
Copy Markdown
Author

Closing in favour of a fresh PR built on a rebased base. The state-size infra in bump-state-size-tracker conflicted heavily with current ethereum/go-ethereum master (414 commits behind, pre-dates PR ethereum#33490 which introduced the OnStateUpdate hook our tracer depends on). The remaining useful commits from bump-state-size-tracker (configurable depth + block-hash tracking) have been integrated into a new branch on weiihann/go-ethereum and a successor PR will follow.

@weiihann weiihann closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.