Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): Ensure tasks timeout even if they don't receive settings #12431

Merged
merged 3 commits into from
Jan 3, 2025

Conversation

tomi
Copy link
Contributor

@tomi tomi commented Jan 2, 2025

Summary

If the n8n instance happens to crash on a specific time, task runner task
might not receive a response to e.g. settings or data request. In cases like
this the task runner was left hanging forever. This PR makes sure the tasks
get aborted correctly.

Also refactors the task execution lifecycle to be explicit which states the task
might have and how different events are handled in different states.

Related Linear tickets, Github issues, and Community forum posts

https://linear.app/n8n/issue/CAT-459/community-issue-code-node-stopped-working

https://community.n8n.io/t/code-nodes-stopped-working/67141

fixes #12354

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

@tomi tomi force-pushed the cat-459-timeout-task-when-no-settings-are-received branch from 527a146 to efe3980 Compare January 2, 2025 12:49
@tomi tomi added the release/backport Changes that need to be backported to older releases. label Jan 2, 2025
Copy link

codecov bot commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 77.00000% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/@n8n/task-runner/src/task-runner.ts 76.05% 17 Missing ⚠️
packages/@n8n/task-runner/src/task-state.ts 72.22% 5 Missing ⚠️
...k-runner/src/js-task-runner/__tests__/test-data.ts 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@n8n-assistant n8n-assistant bot added the n8n team Authored by the n8n team label Jan 2, 2025
If the n8n instance happens to crash on a specific time, task runner task
might not receive a response to e.g. settings or data request. In cases like
this the task runner was left hanging forever. This PR makes sure the tasks
get aborted correctly.

Also refactors the task execution lifecycle to be explicit which states the task
might have and how different events are handled in different states.
@tomi tomi force-pushed the cat-459-timeout-task-when-no-settings-are-received branch from efe3980 to 8d5fbe8 Compare January 2, 2025 14:11
Copy link
Contributor

@ivov ivov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were slowly moving to a state machine with so many booleans :)

Comment on lines +310 to +325
await taskState.caseOf({
// If the cancelled task hasn't received settings yet, we can finish it
waitingForSettings: () => this.finishTask(taskState),

for (const [requestId, request] of this.nodeTypesRequests.entries()) {
if (request.taskId === taskId) {
request.reject(new TaskCancelledError(reason));
this.nodeTypesRequests.delete(requestId);
}
}
// If the task has already timed out or is already cancelled, we can
// ignore the cancellation
'aborting:timeout': noOp,
'aborting:cancelled': noOp,

const controller = this.taskCancellations.get(taskId);
if (controller) {
controller.abort();
this.taskCancellations.delete(taskId);
running: () => {
taskState.status = 'aborting:cancelled';
taskState.abortController.abort('cancelled');
this.cancelTaskRequests(taskId, reason);
},
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the broker cancels a task and the task was waiting for settings, then we clean up in finishTask without cancelling the task requests, but if the broker cancels a task and the task was already running, then we cancel the task requests without cleanup.

Why is this? I'd expect we'd do cleanup and cancel task requests in both these transitions.

Copy link
Contributor Author

@tomi tomi Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because when the task is in:

  • waitingForSettings state : it hasn't been executed yet, so it can't have any in-flight requests
  • running : The task is currently executing, so we have to wait until the control comes back from the task (i.e. the task execution promise resolves/rejects). We don't want to release a new "slot" for next task until that. This happens concurrently (but not in parallel!). Hopefully this image clarifies it:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I misremembered and thought we requested node types as preparation rather than as part of the task.

packages/@n8n/task-runner/src/task-runner.ts Outdated Show resolved Hide resolved
packages/@n8n/task-runner/src/task-runner.ts Outdated Show resolved Hide resolved
packages/@n8n/task-runner/src/task-state.ts Outdated Show resolved Hide resolved

constructor(opts: TaskStateOpts) {
this.taskId = opts.taskId;
this.timeoutTimer = setTimeout(opts.onTimeout, opts.timeoutInS * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In master we currently give task execution its full time budget of taskTimeout, but now we're allowing the broker to consume that budget. If a main is overloaded and the broker is slow, can this cause cascading failures (timeouts) of tasks because they're all receiving too little time to actually execute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. This was the bug here. If we never receive the task settings, it would never timeout and wait forever. The other option would be to add a separate timeout for this case, but that would add more complexity. It shouldn't take seconds to receive the settings from the main, so it should be fine to eat that from the timeout budget. You can always increase the timeout if that is a concern

packages/@n8n/task-runner/src/task-state.ts Show resolved Hide resolved
packages/@n8n/task-runner/src/task-runner.ts Outdated Show resolved Hide resolved
packages/@n8n/task-runner/src/task-runner.ts Show resolved Hide resolved
packages/@n8n/task-runner/src/task-runner.ts Outdated Show resolved Hide resolved
*
* The class only holds the state, and does not have any logic.
*
* The task has the following lifecycle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diagramming this!

@tomi tomi requested a review from ivov January 2, 2025 16:02
@tomi
Copy link
Contributor Author

tomi commented Jan 2, 2025

Thank you for the comments @ivov 🙇 Addressed them

Copy link
Contributor

@ivov ivov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested manually and working well, thanks for the fix!

Copy link
Contributor

github-actions bot commented Jan 3, 2025

⚠️ Some Cypress E2E specs are failing, please fix them before merging

Copy link

cypress bot commented Jan 3, 2025

n8n    Run #8552

Run Properties:  status check passed Passed #8552  •  git commit 773ad6faee: 🌳 🖥️ browsers:node18.12.0-chrome107 🤖 tomi 🗃️ e2e/*
Project n8n
Branch Review cat-459-timeout-task-when-no-settings-are-received
Run status status check passed Passed #8552
Run duration 04m 39s
Commit git commit 773ad6faee: 🌳 🖥️ browsers:node18.12.0-chrome107 🤖 tomi 🗃️ e2e/*
Committer Tomi Turtiainen
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 3
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 484
View all changes introduced in this branch ↗︎

Copy link
Contributor

github-actions bot commented Jan 3, 2025

✅ All Cypress E2E specs passed

@tomi tomi requested a review from ivov January 3, 2025 09:53
@tomi
Copy link
Contributor Author

tomi commented Jan 3, 2025

@ivov had to merge master to fix an e2e test, could you reapprove 🙏

Copy link
Contributor

github-actions bot commented Jan 3, 2025

✅ All Cypress E2E specs passed

@tomi tomi merged commit b194026 into master Jan 3, 2025
37 checks passed
@tomi tomi deleted the cat-459-timeout-task-when-no-settings-are-received branch January 3, 2025 10:27
@github-actions github-actions bot mentioned this pull request Jan 8, 2025
@janober
Copy link
Member

janober commented Jan 9, 2025

Got released with [email protected]

1 similar comment
@janober
Copy link
Member

janober commented Jan 9, 2025

Got released with [email protected]

@janober
Copy link
Member

janober commented Jan 9, 2025

Got released with [email protected]

1 similar comment
@janober
Copy link
Member

janober commented Jan 9, 2025

Got released with [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n8n team Authored by the n8n team release/backport Changes that need to be backported to older releases. Released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code Node Stopped Working
3 participants