Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2207 - Investigate Root Cause of Camunda Workers Getting Stuck #2219

Merged
merged 10 commits into from
Aug 31, 2023

Conversation

ann-aot
Copy link
Contributor

@ann-aot ann-aot commented Aug 25, 2023

  • The issue was able to replicate in one of the workers when we submitted an application with a long income, which was an error in the worker logic during the assessment. However, the worker is not getting any acknowledgment from the code and it is waiting

  • So, the next time a new assessment comes, it gets stuck in the same worker for an infinite time.

  • checked our worker global error handle in https://vscode.dev/github/bcgov/SIMS/blob/fix/sims-%232207/sources/packages/backend/apps/workers/src/zeebe/zeebe-transport-strategy.ts#L57
    image
    and found that, the return await jobHandler(job); is not throwing any error, when there is an error, as a result, the catch is not invoked and the worker is not getting the acknowledgment.

  • fixed the global handler, the jobhandler is returning an observable, so, added a logic to await for it when the return type is observable,
    image

  • And added try-catch, and if there is any error in the controller level, the code acknowledges the worker as a failed job, so that the worker is free and can focus on the next jobs.

  • This causes an incident in the Camunda, Which developers can take a look at (Before this fix, Camunda never caused an incident for these scenarios).

-Replicate

  1. To replicate the issue, I decreased the maxJobsToActivate: to 1 and decreased the timeout to 10.
    image
  2. And created/edited 2 applications with a lengthy income,
    image
  3. which made the create income verification worker stuck in both the case
    image
  4. And then, when you submit an application with a proper income, the new assessment is also stuck in the same worker.
  5. After the fix, step 2 will create incidents in Camunda, and step 4, will work as expected.
    Step 2, after the fix
    image
    image
    image
    Step 4, after the fix,
    image

Ref: https://docs.camunda.io/docs/components/best-practices/development/writing-good-workers/#nodejs-client

@ann-aot ann-aot self-assigned this Aug 25, 2023
@ann-aot ann-aot marked this pull request as ready for review August 25, 2023 19:43
Copy link
Collaborator

@andrepestana-aot andrepestana-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great job investigating the bug!

@dheepak-aot
Copy link
Collaborator

dheepak-aot commented Aug 28, 2023

Thanks for doing the testing @ann-aot to find out that global error handler wasn't working and we had a dev discussion this morning and agreed on the point that we must fix the global error handler instead of enforcing developer to explicitly add the try catch everywhere.

Will continue the review after the global error handler fix.

@andrewsignori-aot
Copy link
Collaborator

Thanks for doing the testing @ann-aot to find out that the global error handler wasn't working we had a dev discussion this morning and agreed on the point that we must fix the global error handler instead of enforcing the developer to explicitly add the try-catch everywhere.

Will continue the review after the global error handler fix.

The global error handler seems to be failing because an unhandled expectation inside the Zeebe worker method would be handled by the rpc-exceptions-handler.ts and this will return an Observable that is not expected currently.

To test the theory, if we change the global handler as below it will report the job.fail properly and we will able to see the retries counter going down and the Camunda incident generated after 3 attempts.

image

Still, if the unhandled exception is not of the type RpcExecption Nest will generate a generic "Internal Server Error" exception message that will be sent to Camunda. The log with the proper exception will be logged in the worker's POD but Camunda will receive only the generic message.

IMO we need to fix the global error handler and also talk about the possibility to ensure the controllers will always provide a better error handling.
@dheepak-aot @guru-aot @andrepestana-aot @sh16011993 FYI

});
} catch (error: unknown) {
return job.fail(`Unexpected error while seting the PIR status. ${error}`);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on "seting".

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great research and findings. Please let's talk about having the global error handler fixed and also I would like to check if the default timeouts are actually woking.

@andrewsignori-aot andrewsignori-aot changed the title #2207 - Investigate Root Cause of Camunda' 'Workers Getting Stuck#2207 #2207 - Investigate Root Cause of Camunda' 'Workers Getting Stuck Aug 29, 2023
@andrewsignori-aot andrewsignori-aot changed the title #2207 - Investigate Root Cause of Camunda' 'Workers Getting Stuck #2207 - Investigate Root Cause of Camunda Workers Getting Stuck Aug 30, 2023
});
} catch (error: unknown) {
const errorMessage = `Unexpected error while verifying the application exceptions. ${error}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we would be repeating this along all controller and jobs, I was expecting that a basic helper would be created, as shown below.

} catch (error: unknown) {
  return createUnexpectedJobFail(error, job);
}

The helper method can be something like below.

createUnexpectedJobFail(
    error: unknown,
    job: ZeebeJob,
    logger?: Logger,
  ): MustReturnJobActionAcknowledgement {
    const jobLogger = logger ?? new Logger(job.type);
    const errorMessage = `Unexpected error while processing job. ${parseException(error)}`;
    jobLogger.error(errorMessage);
    return job.fail(errorMessage);
  }

As an out-of-PR-scope suggestion, I believe that we can create the below error formatter to avoid logging as jobLogger.log(${error.name} ${error.message});.

parseException(error: unknown): string {
  return JSON.stringify(error, null, 2);
}

* Parse the error in a prettier format for better
* readability.
*/
function parseException(error: unknown): string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would a generic method to be used every where, not only by workers. It can be moved to the generic lib.

return job.complete(outputVariables);
} catch (error: unknown) {
const errorMessage = `Unexpected error while loading supporting user data. ${error}`;
return createUnexpectedJobFail(errorMessage, job, jobLogger);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that we no longer need line 103.
The helper would be called as return createUnexpectedJobFail(error, job, jobLogger);

@@ -63,7 +64,21 @@ export class ZeebeTransportStrategy
`Starting job for processInstanceKey ${job.processInstanceKey}. Retries left: ${job.retries}.`,
);
try {
return await jobHandler(job);
// The jobHandler can potentially return a Promise or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments. 👍

@dheepak-aot
Copy link
Collaborator

Great work @ann-aot and great team effort putting our heads together to resolve this issue. Added few minor comments.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@github-actions
Copy link

Backend Unit Tests Coverage Report

Totals Coverage
Statements: 17.94% ( 2170 / 12094 )
Methods: 8.2% ( 126 / 1537 )
Lines: 20.8% ( 1906 / 9163 )
Branches: 9.9% ( 138 / 1394 )

@github-actions
Copy link

E2E Workflow Workers Coverage Report

Totals Coverage
Statements: 47.47% ( 300 / 632 )
Methods: 40% ( 32 / 80 )
Lines: 51.86% ( 251 / 484 )
Branches: 25% ( 17 / 68 )

@github-actions
Copy link

E2E Queue Consumers Coverage Report

Totals Coverage
Statements: 72.5% ( 406 / 560 )
Methods: 63.38% ( 45 / 71 )
Lines: 74.53% ( 357 / 479 )
Branches: 40% ( 4 / 10 )

Copy link
Collaborator

@dheepak-aot dheepak-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes 👍

@github-actions
Copy link

E2E SIMS API Coverage Report

Totals Coverage
Statements: 53.44% ( 3900 / 7298 )
Methods: 49.89% ( 470 / 942 )
Lines: 58.41% ( 3181 / 5446 )
Branches: 27.36% ( 249 / 910 )

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the changes, looks good 👍

Copy link
Collaborator

@guru-aot guru-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ann-aot ann-aot merged commit be59d3b into main Aug 31, 2023
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 18:55 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 18:57 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 18:57 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 18:57 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 18:57 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 19:00 — with GitHub Actions Inactive
@ann-aot ann-aot temporarily deployed to DEV August 31, 2023 19:00 — with GitHub Actions Inactive
@ann-aot ann-aot deleted the fix/sims-#2207 branch September 7, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants