Skip to content

Conversation

ericallam
Copy link
Member

No description provided.

Copy link

changeset-bot bot commented Aug 28, 2025

⚠️ No Changeset found

Latest commit: da33d99

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

Walkthrough

Adds configurable suspended-heartbeat retry settings via four new environment variables and wires them into RunEngine initialization as suspendedHeartbeatRetriesConfig. Implements exponential backoff retrying for SUSPENDED snapshots in engine/index.ts, passing delayMs and restartAttempt to ExecutionSnapshotSystem.restartHeartbeatForRun (signature updated) and scheduling heartbeats using availableAt. Refactors waitpointSystem to return a structured WaitpointContinuationResult instead of string statuses, expanding fetched waitpoint data. Updates workerCatalog heartbeatSnapshot schema to include optional restartAttempt. Tests add scenarios for manual waitpoint vs run waitpoint behavior and cover the new retry configuration and flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/stalled-suspended-snapshot-backoff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1)

399-406: Validate inputs: clamp delayMs and log restartAttempt.
Prevent negative scheduling and include attempt in logs for observability.

   public async restartHeartbeatForRun({
     runId,
-    delayMs,
-    restartAttempt,
+    delayMs,
+    restartAttempt,
     tx,
   }: {
     runId: string;
-    delayMs: number;
-    restartAttempt: number;
+    delayMs: number;
+    restartAttempt: number;
     tx?: PrismaClientOrTransaction;
   }): Promise<ExecutionResult> {
     const prisma = tx ?? this.$.prisma;
 
     const latestSnapshot = await getLatestExecutionSnapshot(prisma, runId);
 
-    this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", {
+    const safeDelayMs = Math.max(0, delayMs);
+    this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", {
       runId,
       snapshotId: latestSnapshot.id,
-      delayMs,
+      delayMs: safeDelayMs,
+      restartAttempt,
     });
apps/webapp/app/env.server.ts (1)

522-529: Tighten validation to prevent misconfiguration.
Ensure sane ranges to avoid zero/negative or extreme values that could starve or flood retries.

-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().default(12),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce
-    .number()
-    .int()
-    .default(60_000 * 60 * 6),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce.number().int().default(60_000),
-  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().default(2),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().min(1).default(12),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce
+    .number()
+    .int()
+    .min(1_000)
+    .default(60_000 * 60 * 6),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce
+    .number()
+    .int()
+    .min(1_000)
+    .default(60_000),
+  RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().min(1).default(2),
internal-packages/run-engine/src/engine/index.ts (2)

1164-1170: Persisted restartAttempt: clarify semantics and rename local var for readability

The new optional restartAttempt plumbs through correctly. Two asks:

  • Clarify whether maxCount is “max retries” or “max attempts including first retry” to avoid confusion downstream.
  • Rename the local $restartAttempt to nextRestartAttempt for clarity and to avoid unconventional $ prefix.
-    restartAttempt?: number;
+    restartAttempt?: number;
-              const $restartAttempt = (restartAttempt ?? 0) + 1; // Start at 1
+              const nextRestartAttempt = (restartAttempt ?? 0) + 1; // Start at 1

1294-1301: Log volume: avoid dumping full result objects at info level

result can include arrays of waitpoints; consider logging counts/types instead to keep logs lean at scale.

-          this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", {
-            runId,
-            result,
-            snapshotId: latestSnapshot.id,
-          });
+          this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", {
+            runId,
+            snapshotId: latestSnapshot.id,
+            status: result.status,
+            waitpointCount: result.status === "blocked" ? result.waitpoints.length : 0,
+            waitpointTypes: result.status === "blocked" ? Array.from(new Set(result.waitpoints.map(w => w.type))) : [],
+          });
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

630-634: PENDING_CANCEL grouped with FINISHED but reason says “run is finished”

Either separate the branches or make the reason reflect the actual status to aid debugging.

-          return {
-            status: "skipped",
-            reason: "run is finished",
-          };
+          return {
+            status: "skipped",
+            reason: `run is ${snapshot.executionStatus.toLowerCase()}`,
+          };
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)

605-608: Unify checkpoint source for clarity

Elsewhere you use blockedExecutionData.snapshot.id; here you use blockedResult.id. They should be the same, but let’s standardize for readability.

-          snapshotId: blockedResult.id,
+          snapshotId: blockedExecutionData!.snapshot.id,

796-799: Fix misleading comment: run transitions to QUEUED due to scheduled restart, not because we “don’t restart”

The expectation is QUEUED; update the comment to reflect that the previously scheduled restart processes the cleared waitpoints and moves the run forward.

-      // We don't restart the heartbeat because there are no run or batch waitpoints
+      // The previously scheduled heartbeat restart runs, sees no blocking waitpoints, and re-queues the run
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e6f6d93 and da33d99.

📒 Files selected for processing (8)
  • apps/webapp/app/env.server.ts (1 hunks)
  • apps/webapp/app/v3/runEngine.server.ts (1 hunks)
  • internal-packages/run-engine/src/engine/index.ts (2 hunks)
  • internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1 hunks)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (11 hunks)
  • internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (4 hunks)
  • internal-packages/run-engine/src/engine/types.ts (1 hunks)
  • internal-packages/run-engine/src/engine/workerCatalog.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • internal-packages/run-engine/src/engine/types.ts
  • internal-packages/run-engine/src/engine/workerCatalog.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
  • internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/engine/index.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

We use zod a lot in packages/core and in the webapp

Files:

  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/env.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: In the webapp, all environment variables must be accessed through the env export of env.server.ts, instead of directly accessing process.env.
When importing from @trigger.dev/core in the webapp, never import from the root @trigger.dev/core path; always use one of the subpath exports as defined in the package's package.json.

Files:

  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/env.server.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Our tests are all vitest

Files:

  • internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
🧬 Code graph analysis (2)
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)
internal-packages/testcontainers/src/index.ts (3)
  • containerTest (233-241)
  • prisma (91-112)
  • redisOptions (132-165)
internal-packages/run-engine/src/engine/tests/setup.ts (2)
  • setupAuthenticatedEnvironment (21-82)
  • setupBackgroundWorker (84-293)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (2)
packages/core/src/v3/schemas/runEngine.ts (2)
  • ExecutionResult (123-126)
  • ExecutionResult (128-128)
internal-packages/run-engine/src/engine/index.ts (1)
  • runId (1161-1411)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (10)
internal-packages/run-engine/src/engine/types.ts (1)

74-79: LGTM: new suspendedHeartbeatRetriesConfig option is well-scoped and backward-compatible.

apps/webapp/app/v3/runEngine.server.ts (1)

110-115: LGTM: wiring env → suspendedHeartbeatRetriesConfig is correct and follows env.server.ts.

internal-packages/run-engine/src/engine/index.ts (2)

1302-1334: Good guardrails to prevent endless retries on SUSPENDED

Early exits on missing config, no waitpoints, or no RUN/BATCH waitpoints look correct and align with the PR objective of stopping futile retries. Nice.


1376-1382: restartAttempt correctly round-tripped
restartAttempt is passed through to the heartbeatSnapshot payload in ExecutionSnapshotSystem.restartHeartbeatForRun (see executionSnapshotSystem.ts:421). No further changes needed.

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (4)

24-39: Typed continuation result: good abstraction

The structured WaitpointContinuationResult and narrowed waitpoint payload shape improve clarity and reduce overfetch. Nice.


515-516: Right-sized select for waitpoint fields

Selecting only id/status/type/completedAfter is appropriate for the new result type.


527-531: Blocked-path return is correct

Returning the current waitpoints when still blocked makes the caller’s decision-making straightforward.


734-737: Unblocked-path return is consistent and useful

Returning the completed waitpoints provides context to the caller without extra reads. LGTM.

internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)

639-642: Manual waitpoint path assertion matches design

As intended, no RUN/BATCH waitpoints means no scheduled restart; run stays SUSPENDED. Good negative test.


684-690: Config-driven timing makes the test stable

Using heartbeatTimeout for initialDelay/maxDelay ensures the retry lands within the test window. Good.

@matt-aitken matt-aitken merged commit b0b0df6 into main Aug 28, 2025
31 checks passed
@matt-aitken matt-aitken deleted the fix/stalled-suspended-snapshot-backoff branch August 28, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants