fix: snapshot compare lotus OOM#5982
Conversation
WalkthroughUpdates the snapshot-parity CI workflow runner to buildjet-8vcpu-ubuntu-2204 and adjusts the snapshot parity docker-compose: enables Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Docker as docker-compose
participant Forest as forest (service)
participant Lotus as lotus (service)
Docker->>Forest: start
note right of Forest #D3E4CD: Run to completion (success)
Forest-->>Docker: completed_successfully
Docker->>Lotus: start (depends_on: forest service_completed_successfully)
note right of Lotus #FCE7C6: Starts only after forest completes successfully
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Assessment against linked issues
Possibly related issues
Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
5f2eb13 to
c6151ba
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/tests/snapshot_parity/docker-compose.yml (1)
21-26: Init step can fail underset -ewhen no snapshot exists (first run).Using
$(ls /data/*.car.zst)inside the test will exit non‑zero when the glob has no matches, and withset -ethat aborts the script before theifexecutes. This would break fresh runs.Use a glob-safe existence check:
- # if there are some files in the data directory, then we don't need to fetch the snapshot - if [ "$$(ls -A /data/*.car.zst)" ]; then - echo "Snapshot already fetched" - else - forest-tool snapshot fetch --chain calibnet -d /data - fi + # If a snapshot is already present, skip fetching; otherwise, fetch it. + if compgen -G "/data/*.car.zst" > /dev/null; then + echo "Snapshot already fetched" + else + forest-tool snapshot fetch --chain ${CHAIN:-calibnet} -d /data + fi
🧹 Nitpick comments (7)
.github/workflows/snapshot-parity.yml (2)
15-15: Re-evaluate timeout after sequentialization.Now that Forest and Lotus run sequentially (longer wall time), 60 minutes might be tight. Consider 90–120 minutes unless you have recent runs confirming sub-60 min completion.
Apply:
- timeout-minutes: 60 + timeout-minutes: 120
6-8: Avoid overlapping manual runs with a small concurrency guard.Weekly schedule won’t overlap, but workflow_dispatch can. Add a simple concurrency group.
Apply at workflow root (top-level, alongside “on” and “jobs”):
name: Snapshot parity test on: workflow_dispatch: schedule: - cron: "0 0 * * 0" # Runs at 00:00, only on Sunday +concurrency: + group: snapshot-parity + cancel-in-progress: false jobs: snapshot-parity:scripts/tests/snapshot_parity/docker-compose.yml (5)
1-4: Fix grammar in heading comment.Tiny nit: “This ensure” → “This ensures”.
-# This ensure that both implementations produce identical snapshots from the same starting point. +# This ensures that both implementations produce identical snapshots from the same starting point.
45-53: Parameterize chain and harden quoting; compute tipset once.
- Keep CHAIN consistent with the snapshot fetched in init; default to calibnet to avoid mismatches.
- Compute the snapshot file and tipset once to avoid repetition and quoting issues.
- forest --chain ${CHAIN} --encrypt-keystore false --no-gc \ + SNAPSHOT="$(ls /data/*.car.zst | tail -n 1)" + TIPSET="$(printf "%s\n" "$SNAPSHOT" | grep -Eo '[0-9]+' | tail -n 1)" + forest --chain ${CHAIN:-calibnet} --encrypt-keystore false --no-gc \ --rpc-address 0.0.0.0:${FOREST_RPC_PORT} \ - --import-snapshot $(ls /data/*.car.zst | tail -n 1) \ + --import-snapshot "$SNAPSHOT" \ --import-mode=symlink & forest-cli wait-api forest-cli sync wait - forest-cli snapshot export -t=$(ls /data/*.car.zst | tail -n 1 | grep -Eo '[0-9]+' | tail -n 1) -d=${EXPORT_EPOCHS} -o /data/exported/forest.car.zst + forest-cli snapshot export -t="$TIPSET" -d="${EXPORT_EPOCHS}" -o /data/exported/forest.car.zst kill -KILL $!
77-81: Prefer graceful shutdown over SIGKILL.
kill -KILL $!is hard kill; considerTERM+waitto allow cleanup. If reliability is a concern, fall back toKILLafter a short timeout.- kill -KILL $! + kill -TERM $! + # Wait up to 10s for graceful shutdown; then force kill if still running + for i in $(seq 1 10); do kill -0 $! 2>/dev/null || break; sleep 1; done + kill -KILL $! 2>/dev/null || true
68-71: Optional: add defaults for RPC ports to reduce env coupling.If callers forget to pass FOREST_RPC_PORT/LOTUS_RPC_PORT, multiaddr envs become invalid. Safe defaults can help.
- - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT}/http + - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT:-2345}/http ... - - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT}/http + - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT:-1234}/http - - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT}/http + - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT:-1234}/http
96-110: Comparison step is solid; consider pinning zstd install to reduce flakiness.Runtime apt-get inside CI can occasionally flake. If this becomes noisy, pre-bake zstd into the Forest image or pin version. Not blocking.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/snapshot-parity.yml(1 hunks)scripts/tests/snapshot_parity/docker-compose.yml(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.
🪛 actionlint (1.7.7)
.github/workflows/snapshot-parity.yml
9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: Analyze (rust)
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
.github/workflows/snapshot-parity.yml (1)
9-9: Done. I've added a search that includes hidden directories and checks for an existing ActionLint config. Let me know the output so we can confirm how many workflows reference BuildJet and whether you already have an actionlint.yaml file.scripts/tests/snapshot_parity/docker-compose.yml (2)
58-60: Sequentialization viadepends_on: { forest: service_completed_successfully }looks good.This enforces non-overlapping daemon runs, addressing OOM risk on constrained runners.
67-73: Environment cleanup on Lotus looks appropriate.Removing LOTUS_EVENTS_ENABLEACTOREVENTSAPI, LOTUS_FEVM_ENABLEETHRPC, and LOTUS_CHAININDEXER_ENABLEINDEXER aligns with a minimal export-only node. Less surface and lower memory.
c6151ba to
81bc41f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
.github/workflows/snapshot-parity.yml (4)
9-9: Silence actionlint false positive for BuildJet runner label.actionlint flags the custom
buildjet-*label as “unknown.” Add an actionlint config to whitelist BuildJet labels to keep CI green without weakening checks.You can add an
.actionlint.yamlat repo root:self-hosted-runner: labels: - buildjet-2vcpu-ubuntu-2204 - buildjet-4vcpu-ubuntu-2204 - buildjet-8vcpu-ubuntu-2204 - buildjet-16vcpu-ubuntu-2204 - buildjet-32vcpu-ubuntu-2204Reference: actionlint’s runner-label config. (github.com)
25-30: Set least-privilege token permissions for issue creation.Explicitly grant only what’s needed so the “create an issue on failure” step works even when org-level defaults are restricted.
Apply at the job level:
snapshot-parity: name: Snapshot parity test runs-on: buildjet-8vcpu-ubuntu-2204 + permissions: + contents: read + issues: writeGitHub recommends scoping
GITHUB_TOKENpermissions; issue creation requiresissues: write. (docs.github.com, github.blog)
11-12: Pin third-party actions to a commit SHA.For supply-chain hardening, pin external actions to immutable SHAs (checkout, docker-logs, create-an-issue). Consider running an automated pinning tool across workflows.
Example (conceptual):
-uses: actions/checkout@v4 +uses: actions/checkout@<commit-sha> # v4.x -uses: jwalton/gh-docker-logs@v2 +uses: jwalton/gh-docker-logs@<commit-sha> # v2 -uses: JasonEtco/create-an-issue@v2 +uses: JasonEtco/create-an-issue@<commit-sha> # v2Rationale and tooling: pin to full SHAs; admins can even enforce SHA pinning. (nearform.com, michaelheap.com, github.blog)
Also applies to: 18-19, 25-26
6-8: Optional: add a concurrency group to prevent overlapping manual vs. scheduled runs.Prevents two parity jobs from clashing (e.g., someone triggers workflow_dispatch during the weekly run). Low maintenance and keeps logs cleaner.
Example:
jobs: snapshot-parity: name: Snapshot parity test runs-on: buildjet-8vcpu-ubuntu-2204 + concurrency: + group: snapshot-parity-${{ github.ref }} + cancel-in-progress: falseAlso applies to: 14-15
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/snapshot-parity.yml(1 hunks)scripts/tests/snapshot_parity/docker-compose.yml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/tests/snapshot_parity/docker-compose.yml
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.
🪛 actionlint (1.7.7)
.github/workflows/snapshot-parity.yml
9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: Build MacOS
- GitHub Check: All lint checks
- GitHub Check: Analyze (rust)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
.github/workflows/snapshot-parity.yml (1)
9-9: Runner upgrade LGTM; meets the 32 GB RAM target.Switching to
buildjet-8vcpu-ubuntu-2204aligns with the PR goal to get 32 GB RAM on the runner (8 vCPU/32 GB per BuildJet’s matrix). This should materially reduce snapshot OOM risk. (buildjet.com)
Summary of changes
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes #5976
Other information and links
Change checklist
Summary by CodeRabbit
Chores
Tests