Skip to content

fix: snapshot compare lotus OOM#5982

Merged
hanabi1224 merged 1 commit intomainfrom
fix-snapshot-compare
Aug 25, 2025
Merged

fix: snapshot compare lotus OOM#5982
hanabi1224 merged 1 commit intomainfrom
fix-snapshot-compare

Conversation

@LesnyRumcajs
Copy link
Copy Markdown
Member

@LesnyRumcajs LesnyRumcajs commented Aug 25, 2025

Summary of changes

Changes introduced in this pull request:

Reference issue to close (if applicable)

Closes #5976

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

  • Chores

    • CI workflow now uses a higher-capacity runner for the snapshot parity job.
  • Tests

    • Snapshot parity tests now enforce explicit service startup ordering.
    • Test environment simplified by removing obsolete environment variables and cleaning header/version declarations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 25, 2025

Walkthrough

Updates the snapshot-parity CI workflow runner to buildjet-8vcpu-ubuntu-2204 and adjusts the snapshot parity docker-compose: enables lotus dependency on forest with service_completed_successfully, removes three LOTUS_* env vars, and removes the explicit compose version header. No other control-flow changes.

Changes

Cohort / File(s) Summary of Changes
CI runner update
.github/workflows/snapshot-parity.yml
Changed runs-on from buildjet-4vcpu-ubuntu-2204buildjet-8vcpu-ubuntu-2204.
Snapshot parity compose adjustments
scripts/tests/snapshot_parity/docker-compose.yml
Removed explicit version header; added depends_on for lotus on forest with service_completed_successfully; removed env vars LOTUS_EVENTS_ENABLEACTOREVENTSAPI, LOTUS_FEVM_ENABLEETHRPC, LOTUS_CHAININDEXER_ENABLEINDEXER.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Docker as docker-compose
    participant Forest as forest (service)
    participant Lotus as lotus (service)

    Docker->>Forest: start
    note right of Forest #D3E4CD: Run to completion (success)
    Forest-->>Docker: completed_successfully

    Docker->>Lotus: start (depends_on: forest service_completed_successfully)
    note right of Lotus #FCE7C6: Starts only after forest completes successfully
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Assessment against linked issues

Objective Addressed Explanation
Resolve snapshot parity test failure (#5976) Diff shows runner and compose tweaks only; cannot confirm these fix parity failures from code changes alone.

Possibly related issues

Possibly related PRs

Suggested reviewers

  • hanabi1224
  • elmattic
  • akaladarshi

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-snapshot-compare

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@LesnyRumcajs LesnyRumcajs changed the title fix: snapshot compare lotus killed [skip ci] fix: snapshot compare lotus OOM Aug 25, 2025
@LesnyRumcajs LesnyRumcajs marked this pull request as ready for review August 25, 2025 10:35
@LesnyRumcajs LesnyRumcajs requested a review from a team as a code owner August 25, 2025 10:35
@LesnyRumcajs LesnyRumcajs requested review from elmattic and hanabi1224 and removed request for a team August 25, 2025 10:35
@LesnyRumcajs LesnyRumcajs force-pushed the fix-snapshot-compare branch 3 times, most recently from 5f2eb13 to c6151ba Compare August 25, 2025 10:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/tests/snapshot_parity/docker-compose.yml (1)

21-26: Init step can fail under set -e when no snapshot exists (first run).

Using $(ls /data/*.car.zst) inside the test will exit non‑zero when the glob has no matches, and with set -e that aborts the script before the if executes. This would break fresh runs.

Use a glob-safe existence check:

-        # if there are some files in the data directory, then we don't need to fetch the snapshot
-        if [ "$$(ls -A /data/*.car.zst)" ]; then
-          echo "Snapshot already fetched"
-        else
-          forest-tool snapshot fetch --chain calibnet -d /data
-        fi
+        # If a snapshot is already present, skip fetching; otherwise, fetch it.
+        if compgen -G "/data/*.car.zst" > /dev/null; then
+          echo "Snapshot already fetched"
+        else
+          forest-tool snapshot fetch --chain ${CHAIN:-calibnet} -d /data
+        fi
🧹 Nitpick comments (7)
.github/workflows/snapshot-parity.yml (2)

15-15: Re-evaluate timeout after sequentialization.

Now that Forest and Lotus run sequentially (longer wall time), 60 minutes might be tight. Consider 90–120 minutes unless you have recent runs confirming sub-60 min completion.

Apply:

-        timeout-minutes: 60
+        timeout-minutes: 120

6-8: Avoid overlapping manual runs with a small concurrency guard.

Weekly schedule won’t overlap, but workflow_dispatch can. Add a simple concurrency group.

Apply at workflow root (top-level, alongside “on” and “jobs”):

 name: Snapshot parity test
 on:
   workflow_dispatch:
   schedule:
     - cron: "0 0 * * 0" # Runs at 00:00, only on Sunday
+concurrency:
+  group: snapshot-parity
+  cancel-in-progress: false
 jobs:
   snapshot-parity:
scripts/tests/snapshot_parity/docker-compose.yml (5)

1-4: Fix grammar in heading comment.

Tiny nit: “This ensure” → “This ensures”.

-# This ensure that both implementations produce identical snapshots from the same starting point.
+# This ensures that both implementations produce identical snapshots from the same starting point.

45-53: Parameterize chain and harden quoting; compute tipset once.

  • Keep CHAIN consistent with the snapshot fetched in init; default to calibnet to avoid mismatches.
  • Compute the snapshot file and tipset once to avoid repetition and quoting issues.
-        forest --chain ${CHAIN} --encrypt-keystore false --no-gc \
+        SNAPSHOT="$(ls /data/*.car.zst | tail -n 1)"
+        TIPSET="$(printf "%s\n" "$SNAPSHOT" | grep -Eo '[0-9]+' | tail -n 1)"
+        forest --chain ${CHAIN:-calibnet} --encrypt-keystore false --no-gc \
           --rpc-address 0.0.0.0:${FOREST_RPC_PORT} \
-          --import-snapshot $(ls /data/*.car.zst | tail -n 1) \
+          --import-snapshot "$SNAPSHOT" \
           --import-mode=symlink &
         forest-cli wait-api
         forest-cli sync wait
-        forest-cli snapshot export -t=$(ls /data/*.car.zst | tail -n 1 | grep -Eo '[0-9]+' | tail -n 1) -d=${EXPORT_EPOCHS} -o /data/exported/forest.car.zst
+        forest-cli snapshot export -t="$TIPSET" -d="${EXPORT_EPOCHS}" -o /data/exported/forest.car.zst
         kill -KILL $!

77-81: Prefer graceful shutdown over SIGKILL.

kill -KILL $! is hard kill; consider TERM + wait to allow cleanup. If reliability is a concern, fall back to KILL after a short timeout.

-        kill -KILL $!
+        kill -TERM $!
+        # Wait up to 10s for graceful shutdown; then force kill if still running
+        for i in $(seq 1 10); do kill -0 $! 2>/dev/null || break; sleep 1; done
+        kill -KILL $! 2>/dev/null || true

68-71: Optional: add defaults for RPC ports to reduce env coupling.

If callers forget to pass FOREST_RPC_PORT/LOTUS_RPC_PORT, multiaddr envs become invalid. Safe defaults can help.

-      - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT}/http
+      - FULLNODE_API_INFO=/dns/forest/tcp/${FOREST_RPC_PORT:-2345}/http
...
-      - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT}/http
+      - LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/${LOTUS_RPC_PORT:-1234}/http
-      - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT}/http
+      - FULLNODE_API_INFO=/dns/lotus/tcp/${LOTUS_RPC_PORT:-1234}/http

96-110: Comparison step is solid; consider pinning zstd install to reduce flakiness.

Runtime apt-get inside CI can occasionally flake. If this becomes noisy, pre-bake zstd into the Forest image or pin version. Not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4b28817 and c6151ba.

📒 Files selected for processing (2)
  • .github/workflows/snapshot-parity.yml (1 hunks)
  • scripts/tests/snapshot_parity/docker-compose.yml (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.
🪛 actionlint (1.7.7)
.github/workflows/snapshot-parity.yml

9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
  • GitHub Check: tests-release
  • GitHub Check: tests
  • GitHub Check: Analyze (rust)
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
.github/workflows/snapshot-parity.yml (1)

9-9: Done. I've added a search that includes hidden directories and checks for an existing ActionLint config. Let me know the output so we can confirm how many workflows reference BuildJet and whether you already have an actionlint.yaml file.

scripts/tests/snapshot_parity/docker-compose.yml (2)

58-60: Sequentialization via depends_on: { forest: service_completed_successfully } looks good.

This enforces non-overlapping daemon runs, addressing OOM risk on constrained runners.


67-73: Environment cleanup on Lotus looks appropriate.

Removing LOTUS_EVENTS_ENABLEACTOREVENTSAPI, LOTUS_FEVM_ENABLEETHRPC, and LOTUS_CHAININDEXER_ENABLEINDEXER aligns with a minimal export-only node. Less surface and lower memory.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
.github/workflows/snapshot-parity.yml (4)

9-9: Silence actionlint false positive for BuildJet runner label.

actionlint flags the custom buildjet-* label as “unknown.” Add an actionlint config to whitelist BuildJet labels to keep CI green without weakening checks.

You can add an .actionlint.yaml at repo root:

self-hosted-runner:
  labels:
    - buildjet-2vcpu-ubuntu-2204
    - buildjet-4vcpu-ubuntu-2204
    - buildjet-8vcpu-ubuntu-2204
    - buildjet-16vcpu-ubuntu-2204
    - buildjet-32vcpu-ubuntu-2204

Reference: actionlint’s runner-label config. (github.com)


25-30: Set least-privilege token permissions for issue creation.

Explicitly grant only what’s needed so the “create an issue on failure” step works even when org-level defaults are restricted.

Apply at the job level:

   snapshot-parity:
     name: Snapshot parity test
     runs-on: buildjet-8vcpu-ubuntu-2204
+    permissions:
+      contents: read
+      issues: write

GitHub recommends scoping GITHUB_TOKEN permissions; issue creation requires issues: write. (docs.github.com, github.blog)


11-12: Pin third-party actions to a commit SHA.

For supply-chain hardening, pin external actions to immutable SHAs (checkout, docker-logs, create-an-issue). Consider running an automated pinning tool across workflows.

Example (conceptual):

-uses: actions/checkout@v4
+uses: actions/checkout@<commit-sha> # v4.x

-uses: jwalton/gh-docker-logs@v2
+uses: jwalton/gh-docker-logs@<commit-sha> # v2

-uses: JasonEtco/create-an-issue@v2
+uses: JasonEtco/create-an-issue@<commit-sha> # v2

Rationale and tooling: pin to full SHAs; admins can even enforce SHA pinning. (nearform.com, michaelheap.com, github.blog)

Also applies to: 18-19, 25-26


6-8: Optional: add a concurrency group to prevent overlapping manual vs. scheduled runs.

Prevents two parity jobs from clashing (e.g., someone triggers workflow_dispatch during the weekly run). Low maintenance and keeps logs cleaner.

Example:

 jobs:
   snapshot-parity:
     name: Snapshot parity test
     runs-on: buildjet-8vcpu-ubuntu-2204
+    concurrency:
+      group: snapshot-parity-${{ github.ref }}
+      cancel-in-progress: false

Also applies to: 14-15

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c6151ba and 81bc41f.

📒 Files selected for processing (2)
  • .github/workflows/snapshot-parity.yml (1 hunks)
  • scripts/tests/snapshot_parity/docker-compose.yml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/tests/snapshot_parity/docker-compose.yml
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: LesnyRumcajs
PR: ChainSafe/forest#5907
File: src/rpc/methods/state.rs:523-570
Timestamp: 2025-08-06T15:44:33.467Z
Learning: LesnyRumcajs prefers to rely on BufWriter's Drop implementation for automatic flushing rather than explicit flush() calls in Forest codebase.
🪛 actionlint (1.7.7)
.github/workflows/snapshot-parity.yml

9-9: label "buildjet-8vcpu-ubuntu-2204" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests-release
  • GitHub Check: tests
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build Ubuntu
  • GitHub Check: Build MacOS
  • GitHub Check: All lint checks
  • GitHub Check: Analyze (rust)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
.github/workflows/snapshot-parity.yml (1)

9-9: Runner upgrade LGTM; meets the 32 GB RAM target.

Switching to buildjet-8vcpu-ubuntu-2204 aligns with the PR goal to get 32 GB RAM on the runner (8 vCPU/32 GB per BuildJet’s matrix). This should materially reduce snapshot OOM risk. (buildjet.com)

@hanabi1224 hanabi1224 enabled auto-merge August 25, 2025 11:19
@hanabi1224 hanabi1224 added this pull request to the merge queue Aug 25, 2025
Merged via the queue into main with commit d44ab1a Aug 25, 2025
45 checks passed
@hanabi1224 hanabi1224 deleted the fix-snapshot-compare branch August 25, 2025 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[automated] Snapshot parity test failure

3 participants