fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448

ericallam · 2025-08-28T12:43:01Z

No description provided.

…etrying

changeset-bot · 2025-08-28T12:43:06Z

⚠️ No Changeset found

Latest commit: da33d99

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-08-28T12:43:10Z

Walkthrough

Adds configurable suspended-heartbeat retry settings via four new environment variables and wires them into RunEngine initialization as suspendedHeartbeatRetriesConfig. Implements exponential backoff retrying for SUSPENDED snapshots in engine/index.ts, passing delayMs and restartAttempt to ExecutionSnapshotSystem.restartHeartbeatForRun (signature updated) and scheduling heartbeats using availableAt. Refactors waitpointSystem to return a structured WaitpointContinuationResult instead of string statuses, expanding fetched waitpoint data. Updates workerCatalog heartbeatSnapshot schema to include optional restartAttempt. Tests add scenarios for manual waitpoint vs run waitpoint behavior and cover the new retry configuration and flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/stalled-suspended-snapshot-backoff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (7)

internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1)

399-406: Validate inputs: clamp delayMs and log restartAttempt.
Prevent negative scheduling and include attempt in logs for observability.

   public async restartHeartbeatForRun({
     runId,
-    delayMs,
-    restartAttempt,
+    delayMs,
+    restartAttempt,
     tx,
   }: {
     runId: string;
-    delayMs: number;
-    restartAttempt: number;
+    delayMs: number;
+    restartAttempt: number;
     tx?: PrismaClientOrTransaction;
   }): Promise<ExecutionResult> {
     const prisma = tx ?? this.$.prisma;
 
     const latestSnapshot = await getLatestExecutionSnapshot(prisma, runId);
 
-    this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", {
+    const safeDelayMs = Math.max(0, delayMs);
+    this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", {
       runId,
       snapshotId: latestSnapshot.id,
-      delayMs,
+      delayMs: safeDelayMs,
+      restartAttempt,
     });

apps/webapp/app/env.server.ts (1)

522-529: Tighten validation to prevent misconfiguration.
Ensure sane ranges to avoid zero/negative or extreme values that could starve or flood retries.

-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().default(12),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce
-    .number()
-    .int()
-    .default(60_000 * 60 * 6),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce.number().int().default(60_000),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().default(2),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().min(1).default(12),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce
+    .number()
+    .int()
+    .min(1_000)
+    .default(60_000 * 60 * 6),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce
+    .number()
+    .int()
+    .min(1_000)
+    .default(60_000),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().min(1).default(2),

internal-packages/run-engine/src/engine/index.ts (2)

1164-1170: Persisted restartAttempt: clarify semantics and rename local var for readability

The new optional restartAttempt plumbs through correctly. Two asks:

Clarify whether maxCount is “max retries” or “max attempts including first retry” to avoid confusion downstream.

Rename the local $restartAttempt to nextRestartAttempt for clarity and to avoid unconventional $ prefix.
-    restartAttempt?: number;
+    restartAttempt?: number;
-              const $restartAttempt = (restartAttempt ?? 0) + 1; // Start at 1
+              const nextRestartAttempt = (restartAttempt ?? 0) + 1; // Start at 1
1294-1301: Log volume: avoid dumping full result objects at info level

result can include arrays of waitpoints; consider logging counts/types instead to keep logs lean at scale.
-          this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", {
-            runId,
-            result,
-            snapshotId: latestSnapshot.id,
-          });
+          this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", {
+            runId,
+            snapshotId: latestSnapshot.id,
+            status: result.status,
+            waitpointCount: result.status === "blocked" ? result.waitpoints.length : 0,
+            waitpointTypes: result.status === "blocked" ? Array.from(new Set(result.waitpoints.map(w => w.type))) : [],
+          });

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

630-634: PENDING_CANCEL grouped with FINISHED but reason says “run is finished”

Either separate the branches or make the reason reflect the actual status to aid debugging.

-          return {
-            status: "skipped",
-            reason: "run is finished",
-          };
+          return {
+            status: "skipped",
+            reason: `run is ${snapshot.executionStatus.toLowerCase()}`,
+          };

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)

605-608: Unify checkpoint source for clarity

Elsewhere you use blockedExecutionData.snapshot.id; here you use blockedResult.id. They should be the same, but let’s standardize for readability.
-          snapshotId: blockedResult.id,
+          snapshotId: blockedExecutionData!.snapshot.id,
796-799: Fix misleading comment: run transitions to QUEUED due to scheduled restart, not because we “don’t restart”

The expectation is QUEUED; update the comment to reflect that the previously scheduled restart processes the cleared waitpoints and moves the run forward.
-      // We don't restart the heartbeat because there are no run or batch waitpoints
+      // The previously scheduled heartbeat restart runs, sees no blocking waitpoints, and re-queues the run

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e6f6d93 and da33d99.

📒 Files selected for processing (8)

apps/webapp/app/env.server.ts (1 hunks)
apps/webapp/app/v3/runEngine.server.ts (1 hunks)
internal-packages/run-engine/src/engine/index.ts (2 hunks)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1 hunks)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (11 hunks)
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (4 hunks)
internal-packages/run-engine/src/engine/types.ts (1 hunks)
internal-packages/run-engine/src/engine/workerCatalog.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

internal-packages/run-engine/src/engine/types.ts
internal-packages/run-engine/src/engine/workerCatalog.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
apps/webapp/app/v3/runEngine.server.ts
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts
apps/webapp/app/env.server.ts
internal-packages/run-engine/src/engine/index.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

We use zod a lot in packages/core and in the webapp

Files:

apps/webapp/app/v3/runEngine.server.ts
apps/webapp/app/env.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: In the webapp, all environment variables must be accessed through the env export of env.server.ts, instead of directly accessing process.env.
When importing from @trigger.dev/core in the webapp, never import from the root @trigger.dev/core path; always use one of the subpath exports as defined in the package's package.json.

Files:

apps/webapp/app/v3/runEngine.server.ts
apps/webapp/app/env.server.ts

**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Our tests are all vitest

Files:

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts

🧬 Code graph analysis (2)

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)

internal-packages/testcontainers/src/index.ts (3)

containerTest (233-241)

prisma (91-112)

redisOptions (132-165)

internal-packages/run-engine/src/engine/tests/setup.ts (2)

setupAuthenticatedEnvironment (21-82)

setupBackgroundWorker (84-293)

internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (2)

packages/core/src/v3/schemas/runEngine.ts (2)

ExecutionResult (123-126)

ExecutionResult (128-128)

internal-packages/run-engine/src/engine/index.ts (1)

runId (1161-1411)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (10)

internal-packages/run-engine/src/engine/types.ts (1)

74-79: LGTM: new suspendedHeartbeatRetriesConfig option is well-scoped and backward-compatible.

apps/webapp/app/v3/runEngine.server.ts (1)

110-115: LGTM: wiring env → suspendedHeartbeatRetriesConfig is correct and follows env.server.ts.

internal-packages/run-engine/src/engine/index.ts (2)

1302-1334: Good guardrails to prevent endless retries on SUSPENDED

Early exits on missing config, no waitpoints, or no RUN/BATCH waitpoints look correct and align with the PR objective of stopping futile retries. Nice.

1376-1382: restartAttempt correctly round-tripped
restartAttempt is passed through to the heartbeatSnapshot payload in ExecutionSnapshotSystem.restartHeartbeatForRun (see executionSnapshotSystem.ts:421). No further changes needed.

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (4)

24-39: Typed continuation result: good abstraction

The structured WaitpointContinuationResult and narrowed waitpoint payload shape improve clarity and reduce overfetch. Nice.

515-516: Right-sized select for waitpoint fields

Selecting only id/status/type/completedAfter is appropriate for the new result type.

527-531: Blocked-path return is correct

Returning the current waitpoints when still blocked makes the caller’s decision-making straightforward.

734-737: Unblocked-path return is consistent and useful

Returning the completed waitpoints provides context to the caller without extra reads. LGTM.

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)

639-642: Manual waitpoint path assertion matches design

As intended, no RUN/BATCH waitpoints means no scheduled restart; run stays SUSPENDED. Good negative test.

684-690: Config-driven timing makes the test stable

Using heartbeatTimeout for initialDelay/maxDelay ensures the retry lands within the test window. Good.

internal-packages/run-engine/src/engine/index.ts

internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

internal-packages/run-engine/src/engine/workerCatalog.ts

fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly r…

da33d99

…etrying

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

matt-aitken approved these changes Aug 28, 2025

View reviewed changes

nicktrn approved these changes Aug 28, 2025

View reviewed changes

matt-aitken merged commit b0b0df6 into main Aug 28, 2025
31 checks passed

matt-aitken deleted the fix/stalled-suspended-snapshot-backoff branch August 28, 2025 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448

fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448

Uh oh!

ericallam commented Aug 28, 2025

Uh oh!

changeset-bot bot commented Aug 28, 2025

Uh oh!

coderabbitai bot commented Aug 28, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448

fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448

Uh oh!

Conversation

ericallam commented Aug 28, 2025

Uh oh!

changeset-bot bot commented Aug 28, 2025

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Aug 28, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)