-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2207 - Investigate Root Cause of Camunda Workers Getting Stuck #2219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Great job investigating the bug!
Thanks for doing the testing @ann-aot to find out that global error handler wasn't working and we had a dev discussion this morning and agreed on the point that we must fix the global error handler instead of enforcing developer to explicitly add the try catch everywhere. Will continue the review after the global error handler fix. |
The global error handler seems to be failing because an unhandled expectation inside the Zeebe worker method would be handled by the rpc-exceptions-handler.ts and this will return an Observable that is not expected currently. To test the theory, if we change the global handler as below it will report the Still, if the unhandled exception is not of the type IMO we need to fix the global error handler and also talk about the possibility to ensure the controllers will always provide a better error handling. |
}); | ||
} catch (error: unknown) { | ||
return job.fail(`Unexpected error while seting the PIR status. ${error}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo on "seting".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great research and findings. Please let's talk about having the global error handler fixed and also I would like to check if the default timeouts are actually woking.
sources/packages/backend/apps/workers/src/zeebe/zeebe-transport-strategy.ts
Show resolved
Hide resolved
}); | ||
} catch (error: unknown) { | ||
const errorMessage = `Unexpected error while verifying the application exceptions. ${error}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we would be repeating this along all controller and jobs, I was expecting that a basic helper would be created, as shown below.
} catch (error: unknown) {
return createUnexpectedJobFail(error, job);
}
The helper method can be something like below.
createUnexpectedJobFail(
error: unknown,
job: ZeebeJob,
logger?: Logger,
): MustReturnJobActionAcknowledgement {
const jobLogger = logger ?? new Logger(job.type);
const errorMessage = `Unexpected error while processing job. ${parseException(error)}`;
jobLogger.error(errorMessage);
return job.fail(errorMessage);
}
As an out-of-PR-scope suggestion, I believe that we can create the below error formatter to avoid logging as jobLogger.log(
${error.name} ${error.message});
.
parseException(error: unknown): string {
return JSON.stringify(error, null, 2);
}
* Parse the error in a prettier format for better | ||
* readability. | ||
*/ | ||
function parseException(error: unknown): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would a generic method to be used every where, not only by workers. It can be moved to the generic lib.
sources/packages/backend/apps/workers/src/utilities/error-handler.ts
Outdated
Show resolved
Hide resolved
return job.complete(outputVariables); | ||
} catch (error: unknown) { | ||
const errorMessage = `Unexpected error while loading supporting user data. ${error}`; | ||
return createUnexpectedJobFail(errorMessage, job, jobLogger); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that we no longer need line 103.
The helper would be called as return createUnexpectedJobFail(error, job, jobLogger);
@@ -63,7 +64,21 @@ export class ZeebeTransportStrategy | |||
`Starting job for processInstanceKey ${job.processInstanceKey}. Retries left: ${job.retries}.`, | |||
); | |||
try { | |||
return await jobHandler(job); | |||
// The jobHandler can potentially return a Promise or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments. 👍
.../packages/backend/apps/workers/src/controllers/supporting-user/supporting-user.controller.ts
Show resolved
Hide resolved
sources/packages/backend/apps/workers/src/controllers/assessment/assessment.controller.ts
Show resolved
Hide resolved
sources/packages/backend/apps/workers/src/utilities/error-handler.ts
Outdated
Show resolved
Hide resolved
Great work @ann-aot and great team effort putting our heads together to resolve this issue. Added few minor comments. |
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the changes, looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The issue was able to replicate in one of the workers when we submitted an application with a long income, which was an error in the worker logic during the assessment. However, the worker is not getting any acknowledgment from the code and it is waiting
So, the next time a new assessment comes, it gets stuck in the same worker for an infinite time.
checked our worker global error handle in https://vscode.dev/github/bcgov/SIMS/blob/fix/sims-%232207/sources/packages/backend/apps/workers/src/zeebe/zeebe-transport-strategy.ts#L57
![image](https://private-user-images.githubusercontent.com/77353155/263403375-0e091620-683d-4a4a-a0a8-8ee31e42baba.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MzQ5NDQsIm5iZiI6MTczODkzNDY0NCwicGF0aCI6Ii83NzM1MzE1NS8yNjM0MDMzNzUtMGUwOTE2MjAtNjgzZC00YTRhLWEwYTgtOGVlMzFlNDJiYWJhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDEzMjQwNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY4NzY5ZTU2NTMzZmM3ZDg4ZjM4MGUzNjEwMTNkZjdmZWE0NjY3MWZlMzBkZDdlMDdmNDk4ZTY5NmY1YTNlMGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.b7hu61ugRWPoStGiEsllxBJ291OFwPCFIcyOAsPh9fI)
and found that, the
return await jobHandler(job);
is not throwing any error, when there is an error, as a result, thecatch
is not invoked and the worker is not getting the acknowledgment.fixed the global handler, the jobhandler is returning an observable, so, added a logic to await for it when the return type is observable,
![image](https://private-user-images.githubusercontent.com/77353155/264455683-fd834315-b22e-495c-a92c-a79123740b7e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MzQ5NDQsIm5iZiI6MTczODkzNDY0NCwicGF0aCI6Ii83NzM1MzE1NS8yNjQ0NTU2ODMtZmQ4MzQzMTUtYjIyZS00OTVjLWE5MmMtYTc5MTIzNzQwYjdlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDEzMjQwNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBkYmNmOGZhNjA5MzgxNTQzNjMyMjNmM2E2NjczNjc2ZjE3NzZhMGVlN2UxNDJhNzdjMzc2MDQ4MGZlOWY1MjYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._xMy4txUsWHxQWL5IwKD6FvShsc2xstiTBVBhVvlLp0)
And added try-catch, and if there is any error in the controller level, the code acknowledges the worker as a failed job, so that the worker is free and can focus on the next jobs.
This causes an incident in the Camunda, Which developers can take a look at (Before this fix, Camunda never caused an incident for these scenarios).
-Replicate
maxJobsToActivate:
to1
and decreased thetimeout
to10
.create income verification
worker stuck in both the caseStep 2, after the fix
Step 4, after the fix,
Ref: https://docs.camunda.io/docs/components/best-practices/development/writing-good-workers/#nodejs-client