Skip to content

fix(l1): clearer sync diagnostics during catch-up#6733

Merged
edg-l merged 5 commits into
mainfrom
fix/l1/sync-diagnostics-logs
May 28, 2026
Merged

fix(l1): clearer sync diagnostics during catch-up#6733
edg-l merged 5 commits into
mainfrom
fix/l1/sync-diagnostics-logs

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented May 27, 2026

Motivation

A home validator reported their node "not finalizing" after 2 days of snap sync. ethrex was healthy and keeping pace; the consensus client was ~2 weeks behind and only feeding forkchoice heads a few blocks ahead of execution. The execution logs made this hard to see, and eth_syncing actively misled them.

Changes

  1. eth_syncing.highestBlock returned garbage. It reported last_fcu_head().to_low_u64_be(); the low 64 bits of the forkchoice head hash, not a block number. Now resolves the head hash to a block number via storage, then the pending-block store, falling back to current_block only when the target isn't downloaded yet (mid snap-sync).

  2. Sync target + staleness hint. On the first header batch of a cycle, log the consensus-provided forkchoice head and how far behind the local head is. If the head is more than 8 blocks ahead and older than 30 min, hint that the consensus client may still be catching up. Gated on distance so a synced node stays quiet. The target is read from the pre-rewind forkchoice head, not the rewound ancestor we request headers from.

  3. "Caught up" log. When a cycle that started behind finishes, log that it reached the consensus-provided head and is waiting for the next forkchoice update, so idle-waiting is distinguishable from a hang.

- eth_syncing: report real highestBlock number instead of the truncated
  forkchoice head hash (was emitting garbage like 0x3abc579fc0fc31b8)
- full sync: log the consensus-provided sync target and how far behind the
  local head is, and warn (when stale > 30m) that the consensus client may
  be catching up to chain head
- full sync: log when a behind cycle reaches the consensus-provided head so
  idle-waiting is distinguishable from a hang
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@github-actions github-actions Bot added the L1 Ethereum client label May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Lines of code report

Total lines added: 61
Total lines removed: 0
Total lines changed: 61

Detailed view
+--------------------------------------------+-------+------+
| File                                       | Lines | Diff |
+--------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs  | 437   | +53  |
+--------------------------------------------+-------+------+
| ethrex/crates/networking/rpc/eth/client.rs | 166   | +8   |
+--------------------------------------------+-------+------+

@edg-l edg-l marked this pull request as ready for review May 27, 2026 13:43
@edg-l edg-l requested a review from a team as a code owner May 27, 2026 13:43
@edg-l edg-l moved this to In Review in ethrex_l1 May 27, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

The changes look solid overall. The RPC fix is a clear bugfix, and the sync logging improvements are well-structured. A few specific points:

crates/networking/p2p/sync/full.rs

Line 143-145 (SystemTime usage):
Using SystemTime::now() for the stale-head warning is acceptable for logging-only purposes, but be aware this compares against the local system clock. If the operator's system time is significantly off, this will produce false warnings or miss real ones. Consider adding a comment that this is heuristic-only, not consensus-critical.

Line 136 & 332 (Redundant storage calls):
You call store.get_latest_block_number().await? twice (once at the start of the batch check, once in the final "caught up" log). Cache this value in a local variable to avoid the extra database hit, e.g.:

let local_head = store.get_latest_block_number().await?;
// ... use local_head for behind calculation ...
if started_behind {
    info!("Reached consensus-provided head at block {local_head}...");
}

Line 18-29 (humanize_secs):
The function is clear, but consider handling the seconds case for completeness (currently rounds down to minutes for sub-minute durations). Also, the constants 86_400 and 3_600 are correct but could use named constants for readability:

const SECS_PER_DAY: u64 = 24 * 60 * 60;
const SECS_PER_HOUR: u64 = 60 * 60;

Line 144 (Timestamp underflow):
Good use of saturating_sub to handle potential future timestamps (avoids underflow if block time > system time).

crates/networking/rpc/eth/client.rs

Line 64-70 (Block hash resolution):
This correctly fixes a bug where highest_block was previously derived from to_low_u64_be() on a 32-byte hash (which would return garbage data). Resolving the hash to a block number via get_block_number is the correct approach.

Line 69 (Fallback logic):
Using unwrap_or(current_block) when the head hash isn't found is a sensible fallback for mid-sync states, though be aware this will show highest_block == current_block (0% progress) during snap-sync until the target header is stored. This is better than reporting a nonsensical number.

General:
Both changes properly propagate errors using ? and avoid panics. The 30-minute (1800s) stale threshold and 8-block follow distance are reasonable defaults for Ethereum mainnet.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context. Let me write up the review.


Code Review: PR #6733fix(l1): clearer sync diagnostics during catch-up

Summary

This PR fixes a real bug in eth_syncing and adds useful operational diagnostics. The motivation is solid and the approach is sensible. A few items worth addressing below.


crates/networking/rpc/eth/client.rs

The highestBlock fix is clearly correct. The original .to_low_u64_be() on a hash was returning the low 64 bits of the raw 32-byte hash value — obviously garbage as a block number. The new code resolves the hash via storage with a sensible fallback to current_block for the mid-snap-sync case.

One minor note: if get_last_fcu_head() ever returns a zero hash (no FCU received yet), get_block_number(zero_hash) will return None and the fallback kicks in — this behavior is correct but it might be worth a brief comment noting that zero hash → fallback is intentional.


crates/networking/p2p/sync/full.rs

humanize_secs — missing sub-minute case

fn humanize_secs(secs: u64) -> String {
    // ...
    } else {
        format!("{mins}m")  // "0m" when secs < 60
    }
}

When secs < 60 this returns "0m". Given that STALE_FORKCHOICE_HEAD_SECS = 1800, the function will never be called with a sub-minute value in practice (the guard is age > 1800), so this won't surface. Still, a defensive else { format!("{secs}s") } branch would make the function correct in isolation without the caller's preconditions.

target is first_header.number, not sync_head

let target = first_header.number;
let target_ts = first_header.timestamp;

Headers are requested with NewToOld order (line 97), so first_header is indeed the newest block in the batch and its block number is a close approximation of the sync target. However, if a peer doesn't have the exact sync_head block and returns headers starting from a slightly lower block, target and target_ts will be slightly off. For a diagnostic this is fine, but the logged "Sync target from consensus forkchoice: block {target}" could mislead an operator into thinking target is the precise consensus head. Consider either:

  • Logging it as "near block {target}" to set expectations, or
  • Using the string "~{target}" to signal approximation.

started_behind / sync_target_logged interaction

The two booleans are correct but their relationship is subtle: sync_target_logged is set to true unconditionally, while started_behind is only set when behind > FOLLOW_DISTANCE. This means a near-head cycle (e.g., just 2 blocks behind) will silently skip the first-batch block entirely and emit no "caught up" log — which is the desired behavior. The logic is correct; the name sync_target_logged is slightly misleading since it gates more than just the log call. A name like first_batch_checked would be clearer.

"Caught up" log uses actual local head — correct

if started_behind {
    let local_head = store.get_latest_block_number().await?;
    info!("Reached consensus-provided head at block {local_head}. ...");
}

This correctly reflects the post-processing local head rather than the target, so it shows where execution actually landed. Good.

Stale-head message in PR description vs. actual diff

The PR description says: "The staleness hint is colorized (bold-yellow) only when stdout is a color-capable TTY..." — this was dropped in the second commit (fix(l1): drop color handling from stale-head warning). The description should be updated to remove that paragraph to avoid confusing future readers of the PR.


Correctness / Security

No consensus-rule logic is touched. Both changes are purely diagnostic (logging + RPC response field fix). No new async tasks, no unsafe code, no state mutation. The unwrap_or_default() on SystemTime::now().duration_since(UNIX_EPOCH) handles the impossible-but-theoretically-possible pre-epoch case cleanly.


Minor nit — long warn! string

The warn! call at line ~148 contains a very long format string split with \. This is fine for Rust (it's a static string), but splitting it into two warn! calls or using \n would make it more readable in the source.


Verdict

The eth_syncing fix is a clear correctness bug fix and should merge regardless. The diagnostic additions are well-gated (FOLLOW_DISTANCE prevents noise on synced nodes), the constants are well-documented, and the overall approach is sound. The items above are all non-blocking; the most actionable one is updating the stale PR description.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. In client.rs, falling back to current_block when get_block_number(head_hash) returns None changes eth_syncing.highestBlock from “unknown target” to “already at target”. That None case is normal while the FCU head has not been indexed yet, and also when the FCU points at a pending block, so during active sync this can report highestBlock == currentBlock even though eth_syncing is still true. Downstream progress monitors will treat that as fully caught up. Prefer resolving the head via pending-block/header state, or preserving the last known target height instead of collapsing to current_block.

  2. In full.rs and full.rs, the “consensus forkchoice head” logging is derived from first_header after peeling off pending_blocks. On resumed sync, that can be older than the actual FCU head by however many pending blocks were stripped, so the new stale-head warning can falsely blame the consensus client for lagging. If you want this warning to be actionable, it should use the original FCU target (or the pending head’s timestamp/number when present), not the rewound ancestor.

I didn’t see EVM/consensus-critical or memory-safety issues in this diff beyond those RPC/observability regressions. I did not run tests.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 27, 2026

Greptile Summary

Improves execution-layer diagnostics during catch-up sync: fixes eth_syncing.highestBlock (was emitting the low 64 bits of a hash instead of a block number), adds a per-cycle log of the consensus-provided target with a staleness hint when the forkchoice head is more than 30 minutes old, and emits a "caught up" message when a behind-cycle completes.

  • eth_syncing now resolves the FCU head hash to a real block number via storage, falling back to current_block when the header isn't stored yet (snap-sync mid-flight). This trades a garbage value for a potentially zero-gap report during snap-sync.
  • humanize_secs is a new local helper for the staleness warning; it works correctly for all inputs ≥ 60 s (the only reachable path) but returns \"0m\" for sub-minute durations, which could mislead if the function is reused.
  • The PR description mentions bold-yellow TTY colorisation for the staleness hint, but no such code appears in the diff — the warning is emitted as a plain warn!() call.

Confidence Score: 4/5

Safe to merge; changes are diagnostic-only with no effect on block execution or consensus logic.

Both changed files touch only logging and the eth_syncing RPC response — no block execution, storage writes, or consensus paths are modified. The current_block fallback in eth_syncing is a deliberate trade-off that improves on the previous garbage value, though it may cause monitoring tools to show a zero sync gap during snap-sync. The humanize_secs edge case for sub-60 s inputs is unreachable today but could surface if the helper is reused.

crates/networking/rpc/eth/client.rs — the highest_block fallback behaviour during snap-sync is worth a second look if any external monitoring relies on the eth_syncing gap.

Important Files Changed

Filename Overview
crates/networking/p2p/sync/full.rs Adds per-cycle sync-target logging and a "caught up" message; timestamp-based staleness warning is correct but humanize_secs has an unreachable "0m" edge case for sub-60 s inputs.
crates/networking/rpc/eth/client.rs Fixes eth_syncing.highestBlock by resolving the FCU head hash to a block number; the current_block fallback during snap-sync silences the previous garbage value but makes progress monitoring report 0 gap while syncing.

Sequence Diagram

sequenceDiagram
    participant CL as Consensus Client
    participant EL as Execution (sync_cycle_full)
    participant Store as Storage
    participant RPC as eth_syncing RPC

    CL->>EL: forkchoiceUpdated(head_hash)
    EL->>Store: get_latest_block_number()
    Store-->>EL: local_head
    EL->>EL: "behind = target - local_head"
    alt "behind > FOLLOW_DISTANCE (8)"
        EL->>EL: log sync target + distance
        EL->>EL: "age = now - first_header.timestamp"
        alt "age > 1800s"
            EL->>EL: warn consensus may be catching up
        end
    end
    EL->>Store: "fetch & execute headers/bodies"
    EL->>Store: get_latest_block_number()
    alt started_behind
        EL->>EL: info Reached consensus-provided head
    end

    Note over RPC: eth_syncing
    RPC->>Store: get_latest_block_number() → current_block
    RPC->>EL: get_last_fcu_head() → head_hash
    RPC->>Store: get_block_number(head_hash)
    alt header in storage
        Store-->>RPC: block number → highest_block
    else header not yet stored (snap-sync)
        Store-->>RPC: None → fallback to current_block
    end
    RPC-->>CL: "{startingBlock, currentBlock, highestBlock}"
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/networking/p2p/sync/full.rs:40-51
**`humanize_secs` returns `"0m"` for sub-minute inputs**

If `secs` is 0–59, `days`, `hours`, and `mins` are all 0, so the function returns `"0m"`. In the current call site this is unreachable (the guard is `age > STALE_FORKCHOICE_HEAD_SECS` = 1 800 s, so `mins ≥ 30`), but if the function is ever called with a smaller value — e.g. for a future diagnostic or unit test — it will silently produce a misleading output instead of e.g. `"< 1m"` or a seconds fallback.

### Issue 2 of 2
crates/networking/rpc/eth/client.rs:68-72
**`highest_block` falls back to `current_block`, hiding sync progress from consumers**

When the FCU head hash isn't in storage yet (e.g. during snap-sync), `get_block_number` returns `None` and `highest_block` is set to `current_block`. The `eth_syncing` response then shows `highestBlock == currentBlock` while the node is still actively syncing. Monitoring tools and dashboards that use the `highestBlock − currentBlock` gap to render a progress bar will show 0 % remaining and may treat the node as caught-up. Returning `0` (or any sentinel that is clearly ≤ `current_block`) would at least not falsely indicate full sync, though the PR description explicitly acknowledges this as an intentional trade-off over the previous garbage value.

Reviews (1): Last reviewed commit: "fix(l1): drop color handling from stale-..." | Re-trigger Greptile

Comment on lines +40 to +51
fn humanize_secs(secs: u64) -> String {
let days = secs / 86_400;
let hours = (secs % 86_400) / 3_600;
let mins = (secs % 3_600) / 60;
if days > 0 {
format!("{days}d {hours}h")
} else if hours > 0 {
format!("{hours}h {mins}m")
} else {
format!("{mins}m")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 humanize_secs returns "0m" for sub-minute inputs

If secs is 0–59, days, hours, and mins are all 0, so the function returns "0m". In the current call site this is unreachable (the guard is age > STALE_FORKCHOICE_HEAD_SECS = 1 800 s, so mins ≥ 30), but if the function is ever called with a smaller value — e.g. for a future diagnostic or unit test — it will silently produce a misleading output instead of e.g. "< 1m" or a seconds fallback.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 40-51

Comment:
**`humanize_secs` returns `"0m"` for sub-minute inputs**

If `secs` is 0–59, `days`, `hours`, and `mins` are all 0, so the function returns `"0m"`. In the current call site this is unreachable (the guard is `age > STALE_FORKCHOICE_HEAD_SECS` = 1 800 s, so `mins ≥ 30`), but if the function is ever called with a smaller value — e.g. for a future diagnostic or unit test — it will silently produce a misleading output instead of e.g. `"< 1m"` or a seconds fallback.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/networking/rpc/eth/client.rs Outdated
Comment on lines +68 to +72
let highest_block = context
.storage
.get_block_number(head_hash)
.await?
.unwrap_or(current_block);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 highest_block falls back to current_block, hiding sync progress from consumers

When the FCU head hash isn't in storage yet (e.g. during snap-sync), get_block_number returns None and highest_block is set to current_block. The eth_syncing response then shows highestBlock == currentBlock while the node is still actively syncing. Monitoring tools and dashboards that use the highestBlock − currentBlock gap to render a progress bar will show 0 % remaining and may treat the node as caught-up. Returning 0 (or any sentinel that is clearly ≤ current_block) would at least not falsely indicate full sync, though the PR description explicitly acknowledges this as an intentional trade-off over the previous garbage value.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/rpc/eth/client.rs
Line: 68-72

Comment:
**`highest_block` falls back to `current_block`, hiding sync progress from consumers**

When the FCU head hash isn't in storage yet (e.g. during snap-sync), `get_block_number` returns `None` and `highest_block` is set to `current_block`. The `eth_syncing` response then shows `highestBlock == currentBlock` while the node is still actively syncing. Monitoring tools and dashboards that use the `highestBlock − currentBlock` gap to render a progress bar will show 0 % remaining and may treat the node as caught-up. Returning `0` (or any sentinel that is clearly ≤ `current_block`) would at least not falsely indicate full sync, though the PR description explicitly acknowledges this as an intentional trade-off over the previous garbage value.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@ElFantasma ElFantasma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highestBlock fallback is 2-tier, but the PR description claims 3-tier. The body says it resolves the head hash "via storage, then the pending-block store, falling back to current_block" — but the code is get_block_number(head_hash).unwrap_or(current_block), and get_block_number only reads the BLOCK_NUMBERS table (verified — no pending/fork-choice store lookup). The missing middle tier matters precisely for the case this PR targets: a forkchoice head that's been received but not yet executed (mid snap-sync) won't be in BLOCK_NUMBERS, so highestBlock silently collapses to current_block and eth_syncing reports distance 0 — which is the same class of "looks synced when it isn't" the PR set out to fix. Either (a) add the pending-store lookup the body promises, or (b) correct the description to match the honest 2-tier code + the inline comment (which already says only "if the head's header isn't stored yet … fall back to the current block"). See inline.

Comment thread crates/networking/rpc/eth/client.rs Outdated
.storage
.get_block_number(head_hash)
.await?
.unwrap_or(current_block);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this resolves the head "via storage, then the pending-block store, falling back to current_block" — three tiers — but this is two: get_block_number(head_hash) (which only reads the BLOCK_NUMBERS table; I checked) then current_block. There's no pending-block-store tier.

That middle tier is exactly the interesting case: a forkchoice head that's been received but not yet executed (the mid-snap-sync scenario this PR is about) isn't in BLOCK_NUMBERS yet, so highest_block falls straight to current_block and eth_syncing reports highestBlock == currentBlock — distance 0, i.e. "looks synced" while still catching up. That's a milder version of the garbage you're fixing, but the same failure mode.

Two options:

  • If the pending/fork-choice store can resolve a received-but-unexecuted head to a number, add that lookup between the two (matches the body).
  • Otherwise, reword the PR body to the honest 2-tier behavior — the inline comment above already describes it correctly, so it's just the description that overstates.

Non-blocking, but worth closing the gap since accurate eth_syncing during catch-up is the whole point.

Comment thread crates/networking/p2p/sync/full.rs Outdated
// warn if that head is stale (a strong signal the consensus client is behind).
if !sync_target_logged {
sync_target_logged = true;
let target = first_header.number;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: target = first_header.number is the first header of the first fetched batch, which is a good-enough proxy for the diagnostic. The PR body says the target is read from "the pre-rewind forkchoice head, not the rewound ancestor we request headers from" — but first_header is the header we request from, so after a rewind to a common ancestor this target (and therefore the behind distance) understates the true forkchoice head the consensus client advertised.

For a log line that's fine — it errs toward under-warning, not false alarms. But if you want the distance to reflect the actual FCU head (matching the body's claim), resolve sync_head to its number once at the top of the cycle and use that. Low priority.

@edg-l edg-l enabled auto-merge May 28, 2026 06:41
@edg-l edg-l added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit c3755c1 May 28, 2026
53 checks passed
@edg-l edg-l deleted the fix/l1/sync-diagnostics-logs branch May 28, 2026 07:59
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants