-
-
Notifications
You must be signed in to change notification settings - Fork 847
fix(run-engine): prevent stalled SUSPENDED snapshots from endlessly retrying #2448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughAdds configurable suspended-heartbeat retry settings via four new environment variables and wires them into RunEngine initialization as suspendedHeartbeatRetriesConfig. Implements exponential backoff retrying for SUSPENDED snapshots in engine/index.ts, passing delayMs and restartAttempt to ExecutionSnapshotSystem.restartHeartbeatForRun (signature updated) and scheduling heartbeats using availableAt. Refactors waitpointSystem to return a structured WaitpointContinuationResult instead of string statuses, expanding fetched waitpoint data. Updates workerCatalog heartbeatSnapshot schema to include optional restartAttempt. Tests add scenarios for manual waitpoint vs run waitpoint behavior and cover the new retry configuration and flow. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (7)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (1)
399-406
: Validate inputs: clamp delayMs and log restartAttempt.
Prevent negative scheduling and include attempt in logs for observability.public async restartHeartbeatForRun({ runId, - delayMs, - restartAttempt, + delayMs, + restartAttempt, tx, }: { runId: string; - delayMs: number; - restartAttempt: number; + delayMs: number; + restartAttempt: number; tx?: PrismaClientOrTransaction; }): Promise<ExecutionResult> { const prisma = tx ?? this.$.prisma; const latestSnapshot = await getLatestExecutionSnapshot(prisma, runId); - this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", { + const safeDelayMs = Math.max(0, delayMs); + this.$.logger.debug("restartHeartbeatForRun: enqueuing heartbeat", { runId, snapshotId: latestSnapshot.id, - delayMs, + delayMs: safeDelayMs, + restartAttempt, });apps/webapp/app/env.server.ts (1)
522-529
: Tighten validation to prevent misconfiguration.
Ensure sane ranges to avoid zero/negative or extreme values that could starve or flood retries.- RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().default(12), - RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce - .number() - .int() - .default(60_000 * 60 * 6), - RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce.number().int().default(60_000), - RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().default(2), + RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_COUNT: z.coerce.number().int().min(1).default(12), + RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_MAX_DELAY_MS: z.coerce + .number() + .int() + .min(1_000) + .default(60_000 * 60 * 6), + RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_INITIAL_DELAY_MS: z.coerce + .number() + .int() + .min(1_000) + .default(60_000), + RUN_ENGINE_SUSPENDED_HEARTBEAT_RETRIES_FACTOR: z.coerce.number().min(1).default(2),internal-packages/run-engine/src/engine/index.ts (2)
1164-1170
: Persisted restartAttempt: clarify semantics and rename local var for readabilityThe new optional restartAttempt plumbs through correctly. Two asks:
- Clarify whether maxCount is “max retries” or “max attempts including first retry” to avoid confusion downstream.
- Rename the local $restartAttempt to nextRestartAttempt for clarity and to avoid unconventional
$
prefix.- restartAttempt?: number; + restartAttempt?: number;- const $restartAttempt = (restartAttempt ?? 0) + 1; // Start at 1 + const nextRestartAttempt = (restartAttempt ?? 0) + 1; // Start at 1
1294-1301
: Log volume: avoid dumping full result objects at info level
result
can include arrays of waitpoints; consider logging counts/types instead to keep logs lean at scale.- this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", { - runId, - result, - snapshotId: latestSnapshot.id, - }); + this.logger.info("handleStalledSnapshot SUSPENDED continueRunIfUnblocked", { + runId, + snapshotId: latestSnapshot.id, + status: result.status, + waitpointCount: result.status === "blocked" ? result.waitpoints.length : 0, + waitpointTypes: result.status === "blocked" ? Array.from(new Set(result.waitpoints.map(w => w.type))) : [], + });internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)
630-634
: PENDING_CANCEL grouped with FINISHED but reason says “run is finished”Either separate the branches or make the reason reflect the actual status to aid debugging.
- return { - status: "skipped", - reason: "run is finished", - }; + return { + status: "skipped", + reason: `run is ${snapshot.executionStatus.toLowerCase()}`, + };internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)
605-608
: Unify checkpoint source for clarityElsewhere you use blockedExecutionData.snapshot.id; here you use blockedResult.id. They should be the same, but let’s standardize for readability.
- snapshotId: blockedResult.id, + snapshotId: blockedExecutionData!.snapshot.id,
796-799
: Fix misleading comment: run transitions to QUEUED due to scheduled restart, not because we “don’t restart”The expectation is QUEUED; update the comment to reflect that the previously scheduled restart processes the cleared waitpoints and moves the run forward.
- // We don't restart the heartbeat because there are no run or batch waitpoints + // The previously scheduled heartbeat restart runs, sees no blocking waitpoints, and re-queues the run
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
apps/webapp/app/env.server.ts
(1 hunks)apps/webapp/app/v3/runEngine.server.ts
(1 hunks)internal-packages/run-engine/src/engine/index.ts
(2 hunks)internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts
(1 hunks)internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
(11 hunks)internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
(4 hunks)internal-packages/run-engine/src/engine/types.ts
(1 hunks)internal-packages/run-engine/src/engine/workerCatalog.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}
: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
internal-packages/run-engine/src/engine/types.ts
internal-packages/run-engine/src/engine/workerCatalog.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
apps/webapp/app/v3/runEngine.server.ts
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts
apps/webapp/app/env.server.ts
internal-packages/run-engine/src/engine/index.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
We use zod a lot in packages/core and in the webapp
Files:
apps/webapp/app/v3/runEngine.server.ts
apps/webapp/app/env.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}
: In the webapp, all environment variables must be accessed through theenv
export ofenv.server.ts
, instead of directly accessingprocess.env
.
When importing from@trigger.dev/core
in the webapp, never import from the root@trigger.dev/core
path; always use one of the subpath exports as defined in the package's package.json.
Files:
apps/webapp/app/v3/runEngine.server.ts
apps/webapp/app/env.server.ts
**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Our tests are all vitest
Files:
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts
🧬 Code graph analysis (2)
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)
internal-packages/testcontainers/src/index.ts (3)
containerTest
(233-241)prisma
(91-112)redisOptions
(132-165)internal-packages/run-engine/src/engine/tests/setup.ts (2)
setupAuthenticatedEnvironment
(21-82)setupBackgroundWorker
(84-293)
internal-packages/run-engine/src/engine/systems/executionSnapshotSystem.ts (2)
packages/core/src/v3/schemas/runEngine.ts (2)
ExecutionResult
(123-126)ExecutionResult
(128-128)internal-packages/run-engine/src/engine/index.ts (1)
runId
(1161-1411)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (10)
internal-packages/run-engine/src/engine/types.ts (1)
74-79
: LGTM: new suspendedHeartbeatRetriesConfig option is well-scoped and backward-compatible.apps/webapp/app/v3/runEngine.server.ts (1)
110-115
: LGTM: wiring env → suspendedHeartbeatRetriesConfig is correct and follows env.server.ts.internal-packages/run-engine/src/engine/index.ts (2)
1302-1334
: Good guardrails to prevent endless retries on SUSPENDEDEarly exits on missing config, no waitpoints, or no RUN/BATCH waitpoints look correct and align with the PR objective of stopping futile retries. Nice.
1376-1382
: restartAttempt correctly round-tripped
restartAttempt is passed through to the heartbeatSnapshot payload in ExecutionSnapshotSystem.restartHeartbeatForRun (see executionSnapshotSystem.ts:421). No further changes needed.internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (4)
24-39
: Typed continuation result: good abstractionThe structured WaitpointContinuationResult and narrowed waitpoint payload shape improve clarity and reduce overfetch. Nice.
515-516
: Right-sized select for waitpoint fieldsSelecting only id/status/type/completedAfter is appropriate for the new result type.
527-531
: Blocked-path return is correctReturning the current waitpoints when still blocked makes the caller’s decision-making straightforward.
734-737
: Unblocked-path return is consistent and usefulReturning the completed waitpoints provides context to the caller without extra reads. LGTM.
internal-packages/run-engine/src/engine/tests/heartbeats.test.ts (2)
639-642
: Manual waitpoint path assertion matches designAs intended, no RUN/BATCH waitpoints means no scheduled restart; run stays SUSPENDED. Good negative test.
684-690
: Config-driven timing makes the test stableUsing heartbeatTimeout for initialDelay/maxDelay ensures the retry lands within the test window. Good.
No description provided.