Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task service shutdown, errors, and task handling #736

Open
wants to merge 1 commit into
base: darin/improve-client-authentication-and-error-handling
Choose a base branch
from

Conversation

darinkrauss
Copy link
Contributor

  • Use context deadline to forcibly cancel tasks at extended timeout
  • Allow task queue shutdown to fully process any outstanding tasks
  • Separate task queue shutdown into worker and manager groups
  • Ensure critical database updates are not affected by context cancel
  • Fix task service termination
  • Fix pending iterator usage and error handling
  • Use platform errors
  • Minor logging updates
  • Update test runner

The existing code that detected a long running tasks (at 2 times the runner maximum duration) would simple update the task in the database to pending and available. This inadvertently caused the same task to be run again while the long running task was still running. This yielded a number of errors. The code was updated to cancel the context on long running tasks, which would force long running task to actually stop cleanly.

When the current task queue implementation was signaled to shutdown it would cancel the common context for the worker routines and the manager routine. The manager routine is responsible for taking finished tasks and storing the task in the database. Upon shutdown, however, many times the manager routine would exit before one or more of the work routines and so those tasks would not be updated in the database (and, thus, left permanently in the running state - this was why the "unstick tasks" code was written). The code was updated to separate out two cancel functions and two wait groups (one for workers, one for manager) and allow the manager to fully handle all of the worker tasks (saving them to the database) before itself shutdown. Unless there is a crash, there should no longer be forever running tasks.

The task manager did not protect critical database operations (such as saving the final state of a canceled task) from a context cancel. The code was updated to protect those critical database operations from a cancel.

Also, some smaller changes to the task-related code.

NOTE: This is a part of the overall Dexcom work which will be collected into a final PR for final approval. However, since the work covers a number of different issues, I broke up the development into multiple smaller and more focused PRs. Once all of the smaller PRs are approved, I'll create a final overall PR for approval that will eventually be tested and deployed. (The smaller PRs will not be tested nor deployed.)

- Use context deadline to forcibly cancel tasks at extended timeout
- Allow task queue shutdown to fully process any outstanding tasks
- Separate task queue shutdown into worker and manager groups
- Ensure critical database updates are not affected by context cancel
- Fix task service termination
- Fix pending iterator usage and error handling
- Use platform errors
- Minor logging updates
- Update test runner
@@ -226,8 +256,15 @@ func (q *queue) startManager(ctx context.Context) {
}

select {
case <-ctx.Done():
return
case <-ctx.Done(): // Drain and complete any interrupted tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing a non-blocking select until there are no new messages, consider closing the channel after the workers waitgroup is done and drain the channel with for tsk := range completeTask to make the block below easier to understand.

@@ -269,6 +269,13 @@ func (s *Service) initializeClinicsClient() error {
return nil
}

func (s *Service) terminateClinicsClient() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this accomplish? Removing the reference doesn't terminate the client. I'm having trouble understanding why this code exists. Yes, it's consistent, but adds complexity and doesn't do anything useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants