fix: snapshot compare lotus OOM by LesnyRumcajs · Pull Request #5982 · ChainSafe/forest

LesnyRumcajs · 2025-08-25T09:39:30Z

Summary of changes

Changes introduced in this pull request:

run snapshot export in sequence - we don't need speed here; it's done once a week on Sunday.
removed useless Lotus env variables - no need for FEVM or event indexing for a snapshot export.
upgraded the buildjet to a 32G RAM runner - it's run seldom, so it's fine.
proof of work https://github.com/ChainSafe/forest/actions/workflows/snapshot-parity.yml?query=branch%3Afix-snapshot-compare - making the export sequential didn't help, hence the one failed run.

Reference issue to close (if applicable)

Closes #5976

Change checklist

I have performed a self-review of my own code,
I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
I have added tests that prove my fix is effective or that my feature works (if possible),
I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

Chores
- CI workflow now uses a higher-capacity runner for the snapshot parity job.
Tests
- Snapshot parity tests now enforce explicit service startup ordering.
- Test environment simplified by removing obsolete environment variables and cleaning header/version declarations.

coderabbitai · 2025-08-25T09:39:38Z

Walkthrough

Updates the snapshot-parity CI workflow runner to buildjet-8vcpu-ubuntu-2204 and adjusts the snapshot parity docker-compose: enables lotus dependency on forest with service_completed_successfully, removes three LOTUS_* env vars, and removes the explicit compose version header. No other control-flow changes.

Changes

Cohort / File(s)	Summary of Changes
CI runner update `.github/workflows/snapshot-parity.yml`	Changed `runs-on` from `buildjet-4vcpu-ubuntu-2204` → `buildjet-8vcpu-ubuntu-2204`.
Snapshot parity compose adjustments `scripts/tests/snapshot_parity/docker-compose.yml`	Removed explicit `version` header; added `depends_on` for `lotus` on `forest` with `service_completed_successfully`; removed env vars `LOTUS_EVENTS_ENABLEACTOREVENTSAPI`, `LOTUS_FEVM_ENABLEETHRPC`, `LOTUS_CHAININDEXER_ENABLEINDEXER`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Docker as docker-compose
    participant Forest as forest (service)
    participant Lotus as lotus (service)

    Docker->>Forest: start
    note right of Forest #D3E4CD: Run to completion (success)
    Forest-->>Docker: completed_successfully

    Docker->>Lotus: start (depends_on: forest service_completed_successfully)
    note right of Lotus #FCE7C6: Starts only after forest completes successfully

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Resolve snapshot parity test failure (#5976)	❓	Diff shows runner and compose tweaks only; cannot confirm these fix parity failures from code changes alone.

Possibly related issues

RPC parity check fails periodically after updating running environment tobuildjet #5905 — Also changes CI runner to buildjet-8vcpu-ubuntu-2204; relevant if runner change affected parity results.

Possibly related PRs

chore: use buildjet ubuntu in calibnet rpc check ci #5878 — Workflow runner update to buildjet-8vcpu-ubuntu-2204, directly related to the runner change in this PR.

Suggested reviewers

hanabi1224
elmattic
akaladarshi

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-snapshot-compare

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/tests/snapshot_parity/docker-compose.yml (1)

21-26: Init step can fail under set -e when no snapshot exists (first run).

Using $(ls /data/*.car.zst) inside the test will exit non‑zero when the glob has no matches, and with set -e that aborts the script before the if executes. This would break fresh runs.

Use a glob-safe existence check:

-        # if there are some files in the data directory, then we don't need to fetch the snapshot
-        if [ "$$(ls -A /data/*.car.zst)" ]; then
-          echo "Snapshot already fetched"
-        else
-          forest-tool snapshot fetch --chain calibnet -d /data
-        fi
+        # If a snapshot is already present, skip fetching; otherwise, fetch it.
+        if compgen -G "/data/*.car.zst" > /dev/null; then
+          echo "Snapshot already fetched"
+        else
+          forest-tool snapshot fetch --chain ${CHAIN:-calibnet} -d /data
+        fi

🧹 Nitpick comments (7)

.github/workflows/snapshot-parity.yml (2)
15-15: Re-evaluate timeout after sequentialization.

Now that Forest and Lotus run sequentially (longer wall time), 60 minutes might be tight. Consider 90–120 minutes unless you have recent runs confirming sub-60 min completion.

Apply:
-        timeout-minutes: 60
+        timeout-minutes: 120
6-8: Avoid overlapping manual runs with a small concurrency guard.

Weekly schedule won’t overlap, but workflow_dispatch can. Add a simple concurrency group.

Apply at workflow root (top-level, alongside “on” and “jobs”):
 name: Snapshot parity test
 on:
   workflow_dispatch:
   schedule:
     - cron: "0 0 * * 0" # Runs at 00:00, only on Sunday
+concurrency:
+  group: snapshot-parity
+  cancel-in-progress: false
 jobs:
   snapshot-parity:
scripts/tests/snapshot_parity/docker-compose.yml (5)
1-4: Fix grammar in heading comment.

Tiny nit: “This ensure” → “This ensures”.
-# This ensure that both implementations produce identical snapshots from the same starting point.
+# This ensures that both implementations produce identical snapshots from the same starting point.
45-53: Parameterize chain and harden quoting; compute tipset once.

Keep CHAIN consistent with the snapshot fetched in init; default to calibnet to avoid mismatches.

Compute the snapshot file and tipset once to avoid repetition and quoting issues.
-        forest --chain ${CHAIN} --encrypt-keystore false --no-gc \
+        SNAPSHOT="$(ls /data/*.car.zst | tail -n 1)"
+        TIPSET="$(printf "%s\n" "$SNAPSHOT" | grep -Eo '[0-9]+' | tail -n 1)"
+        forest --chain ${CHAIN:-calibnet} --encrypt-keystore false --no-gc \
           --rpc-address 0.0.0.0:${FOREST_RPC_PORT} \
-          --import-snapshot $(ls /data/*.car.zst | tail -n 1) \
+          --import-snapshot "$SNAPSHOT" \
           --import-mode=symlink &
         forest-cli wait-api
         forest-cli sync wait
-        forest-cli snapshot export -t=$(ls /data/*.car.zst | tail -n 1 | grep -Eo '[0-9]+' | tail -n 1) -d=${EXPORT_EPOCHS} -o /data/exported/forest.car.zst
+        forest-cli snapshot export -t="$TIPSET" -d="${EXPORT_EPOCHS}" -o /data/exported/forest.car.zst
         kill -KILL $!
77-81: Prefer graceful shutdown over SIGKILL.

kill -KILL $! is hard kill; consider TERM + wait to allow cleanup. If reliability is a concern, fall back to KILL after a short timeout.
-        kill -KILL $!
+        kill -TERM $!
+        # Wait up to 10s for graceful shutdown; then force kill if still running
+        for i in $(seq 1 10); do kill -0 $! 2>/dev/null || break; sleep 1; done
+        kill -KILL $! 2>/dev/null || true
68-71: Optional: add defaults for RPC ports to reduce env coupling.

If callers forget to pass FOREST_RPC_PORT/LOTUS_RPC_PORT, multiaddr envs become invalid. Safe defaults can help.
-      - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT}/http
+      - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT:-2345}/http
...
-      - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT}/http
+      - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT:-1234}/http
-      - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT}/http
+      - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT:-1234}/http
96-110: Comparison step is solid; consider pinning zstd install to reduce flakiness.

Runtime apt-get inside CI can occasionally flake. If this becomes noisy, pre-bake zstd into the Forest image or pin version. Not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4b28817 and c6151ba.

📒 Files selected for processing (2)

.github/workflows/snapshot-parity.yml (1 hunks)
scripts/tests/snapshot_parity/docker-compose.yml (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.

🪛 actionlint (1.7.7)

.github/workflows/snapshot-parity.yml

9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Build MacOS
GitHub Check: Build Ubuntu
GitHub Check: cargo-publish-dry-run
GitHub Check: Build forest binaries on Linux AMD64
GitHub Check: All lint checks
GitHub Check: tests-release
GitHub Check: tests
GitHub Check: Analyze (rust)
GitHub Check: Analyze (go)

🔇 Additional comments (3)

.github/workflows/snapshot-parity.yml (1)

9-9: Done. I've added a search that includes hidden directories and checks for an existing ActionLint config. Let me know the output so we can confirm how many workflows reference BuildJet and whether you already have an actionlint.yaml file.

scripts/tests/snapshot_parity/docker-compose.yml (2)

58-60: Sequentialization via depends_on: { forest: service_completed_successfully } looks good.

This enforces non-overlapping daemon runs, addressing OOM risk on constrained runners.

67-73: Environment cleanup on Lotus looks appropriate.

Removing LOTUS_EVENTS_ENABLEACTOREVENTSAPI, LOTUS_FEVM_ENABLEETHRPC, and LOTUS_CHAININDEXER_ENABLEINDEXER aligns with a minimal export-only node. Less surface and lower memory.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

.github/workflows/snapshot-parity.yml (4)
9-9: Silence actionlint false positive for BuildJet runner label.

actionlint flags the custom buildjet-* label as “unknown.” Add an actionlint config to whitelist BuildJet labels to keep CI green without weakening checks.

You can add an .actionlint.yaml at repo root:
self-hosted-runner:
  labels:
    - buildjet-2vcpu-ubuntu-2204
    - buildjet-4vcpu-ubuntu-2204
    - buildjet-8vcpu-ubuntu-2204
    - buildjet-16vcpu-ubuntu-2204
    - buildjet-32vcpu-ubuntu-2204
Reference: actionlint’s runner-label config. (github.com)

25-30: Set least-privilege token permissions for issue creation.

Explicitly grant only what’s needed so the “create an issue on failure” step works even when org-level defaults are restricted.

Apply at the job level:
   snapshot-parity:
     name: Snapshot parity test
     runs-on: buildjet-8vcpu-ubuntu-2204
+    permissions:
+      contents: read
+      issues: write
GitHub recommends scoping GITHUB_TOKEN permissions; issue creation requires issues: write. (docs.github.com, github.blog)

11-12: Pin third-party actions to a commit SHA.

For supply-chain hardening, pin external actions to immutable SHAs (checkout, docker-logs, create-an-issue). Consider running an automated pinning tool across workflows.

Example (conceptual):
-uses: actions/checkout@v4
+uses: actions/checkout@<commit-sha> # v4.x

-uses: jwalton/gh-docker-logs@v2
+uses: jwalton/gh-docker-logs@<commit-sha> # v2

-uses: JasonEtco/create-an-issue@v2
+uses: JasonEtco/create-an-issue@<commit-sha> # v2
Rationale and tooling: pin to full SHAs; admins can even enforce SHA pinning. (nearform.com, michaelheap.com, github.blog)

Also applies to: 18-19, 25-26

6-8: Optional: add a concurrency group to prevent overlapping manual vs. scheduled runs.

Prevents two parity jobs from clashing (e.g., someone triggers workflow_dispatch during the weekly run). Low maintenance and keeps logs cleaner.

Example:
 jobs:
   snapshot-parity:
     name: Snapshot parity test
     runs-on: buildjet-8vcpu-ubuntu-2204
+    concurrency:
+      group: snapshot-parity-${{ github.ref }}
+      cancel-in-progress: false
Also applies to: 14-15

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c6151ba and 81bc41f.

📒 Files selected for processing (2)

.github/workflows/snapshot-parity.yml (1 hunks)
scripts/tests/snapshot_parity/docker-compose.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/tests/snapshot_parity/docker-compose.yml

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.

🪛 actionlint (1.7.7)

.github/workflows/snapshot-parity.yml

9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Build forest binaries on Linux AMD64
GitHub Check: tests-release
GitHub Check: tests
GitHub Check: cargo-publish-dry-run
GitHub Check: Build Ubuntu
GitHub Check: Build MacOS
GitHub Check: All lint checks
GitHub Check: Analyze (rust)
GitHub Check: Analyze (go)

🔇 Additional comments (1)

.github/workflows/snapshot-parity.yml (1)

9-9: Runner upgrade LGTM; meets the 32 GB RAM target.

Switching to buildjet-8vcpu-ubuntu-2204 aligns with the PR goal to get 32 GB RAM on the runner (8 vCPU/32 GB per BuildJet’s matrix). This should materially reduce snapshot OOM risk. (buildjet.com)

LesnyRumcajs changed the title ~~fix: snapshot compare lotus killed [skip ci]~~ fix: snapshot compare lotus OOM Aug 25, 2025

LesnyRumcajs marked this pull request as ready for review August 25, 2025 10:35

LesnyRumcajs requested a review from a team as a code owner August 25, 2025 10:35

LesnyRumcajs requested review from elmattic and hanabi1224 and removed request for a team August 25, 2025 10:35

LesnyRumcajs force-pushed the fix-snapshot-compare branch 3 times, most recently from 5f2eb13 to c6151ba Compare August 25, 2025 10:37

coderabbitai Bot reviewed Aug 25, 2025

View reviewed changes

fix: snapshot compare lotus killed

81bc41f

LesnyRumcajs force-pushed the fix-snapshot-compare branch from c6151ba to 81bc41f Compare August 25, 2025 10:45

coderabbitai Bot reviewed Aug 25, 2025

View reviewed changes

hanabi1224 approved these changes Aug 25, 2025

View reviewed changes

hanabi1224 enabled auto-merge August 25, 2025 11:19

elmattic approved these changes Aug 25, 2025

View reviewed changes

hanabi1224 added this pull request to the merge queue Aug 25, 2025

Merged via the queue into main with commit d44ab1a Aug 25, 2025
45 checks passed

hanabi1224 deleted the fix-snapshot-compare branch August 25, 2025 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: snapshot compare lotus OOM#5982

fix: snapshot compare lotus OOM#5982
hanabi1224 merged 1 commit intomainfrom
fix-snapshot-compare

LesnyRumcajs commented Aug 25, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Aug 25, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LesnyRumcajs commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Reference issue to close (if applicable)

Other information and links

Change checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Possibly related issues

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LesnyRumcajs commented Aug 25, 2025 •

edited

Loading

coderabbitai Bot commented Aug 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)