-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): Ensure tasks timeout even if they don't receive settings #12431
fix(core): Ensure tasks timeout even if they don't receive settings #12431
Conversation
527a146
to
efe3980
Compare
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
If the n8n instance happens to crash on a specific time, task runner task might not receive a response to e.g. settings or data request. In cases like this the task runner was left hanging forever. This PR makes sure the tasks get aborted correctly. Also refactors the task execution lifecycle to be explicit which states the task might have and how different events are handled in different states.
efe3980
to
8d5fbe8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were slowly moving to a state machine with so many booleans :)
await taskState.caseOf({ | ||
// If the cancelled task hasn't received settings yet, we can finish it | ||
waitingForSettings: () => this.finishTask(taskState), | ||
|
||
for (const [requestId, request] of this.nodeTypesRequests.entries()) { | ||
if (request.taskId === taskId) { | ||
request.reject(new TaskCancelledError(reason)); | ||
this.nodeTypesRequests.delete(requestId); | ||
} | ||
} | ||
// If the task has already timed out or is already cancelled, we can | ||
// ignore the cancellation | ||
'aborting:timeout': noOp, | ||
'aborting:cancelled': noOp, | ||
|
||
const controller = this.taskCancellations.get(taskId); | ||
if (controller) { | ||
controller.abort(); | ||
this.taskCancellations.delete(taskId); | ||
running: () => { | ||
taskState.status = 'aborting:cancelled'; | ||
taskState.abortController.abort('cancelled'); | ||
this.cancelTaskRequests(taskId, reason); | ||
}, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the broker cancels a task and the task was waiting for settings, then we clean up in finishTask
without cancelling the task requests, but if the broker cancels a task and the task was already running, then we cancel the task requests without cleanup.
Why is this? I'd expect we'd do cleanup and cancel task requests in both these transitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because when the task is in:
waitingForSettings
state : it hasn't been executed yet, so it can't have any in-flight requestsrunning
: The task is currently executing, so we have to wait until the control comes back from the task (i.e. the task execution promise resolves/rejects). We don't want to release a new "slot" for next task until that. This happens concurrently (but not in parallel!). Hopefully this image clarifies it:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I misremembered and thought we requested node types as preparation rather than as part of the task.
|
||
constructor(opts: TaskStateOpts) { | ||
this.taskId = opts.taskId; | ||
this.timeoutTimer = setTimeout(opts.onTimeout, opts.timeoutInS * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In master
we currently give task execution its full time budget of taskTimeout
, but now we're allowing the broker to consume that budget. If a main is overloaded and the broker is slow, can this cause cascading failures (timeouts) of tasks because they're all receiving too little time to actually execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. This was the bug here. If we never receive the task settings, it would never timeout and wait forever. The other option would be to add a separate timeout for this case, but that would add more complexity. It shouldn't take seconds to receive the settings from the main, so it should be fine to eat that from the timeout budget. You can always increase the timeout if that is a concern
* | ||
* The class only holds the state, and does not have any logic. | ||
* | ||
* The task has the following lifecycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diagramming this!
Thank you for the comments @ivov 🙇 Addressed them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested manually and working well, thanks for the fix!
|
n8n Run #8552
Run Properties:
|
Project |
n8n
|
Branch Review |
cat-459-timeout-task-when-no-settings-are-received
|
Run status |
Passed #8552
|
Run duration | 04m 39s |
Commit |
773ad6faee: 🌳 🖥️ browsers:node18.12.0-chrome107 🤖 tomi 🗃️ e2e/*
|
Committer | Tomi Turtiainen |
View all properties for this run ↗︎ |
Test results | |
---|---|
Failures |
0
|
Flaky |
3
|
Pending |
0
|
Skipped |
0
|
Passing |
484
|
View all changes introduced in this branch ↗︎ |
✅ All Cypress E2E specs passed |
@ivov had to merge master to fix an e2e test, could you reapprove 🙏 |
✅ All Cypress E2E specs passed |
Got released with |
1 similar comment
Got released with |
Got released with |
1 similar comment
Got released with |
Summary
If the n8n instance happens to crash on a specific time, task runner task
might not receive a response to e.g. settings or data request. In cases like
this the task runner was left hanging forever. This PR makes sure the tasks
get aborted correctly.
Also refactors the task execution lifecycle to be explicit which states the task
might have and how different events are handled in different states.
Related Linear tickets, Github issues, and Community forum posts
https://linear.app/n8n/issue/CAT-459/community-issue-code-node-stopped-working
https://community.n8n.io/t/code-nodes-stopped-working/67141
fixes #12354
Review / Merge checklist
release/backport
(if the PR is an urgent fix that needs to be backported)