Fix task service shutdown, errors, and task handling #736

darinkrauss · 2024-06-13T05:31:16Z

Use context deadline to forcibly cancel tasks at extended timeout
Allow task queue shutdown to fully process any outstanding tasks
Separate task queue shutdown into worker and manager groups
Ensure critical database updates are not affected by context cancel
Fix task service termination
Fix pending iterator usage and error handling
Use platform errors
Minor logging updates
Update test runner

The existing code that detected a long running tasks (at 2 times the runner maximum duration) would simple update the task in the database to pending and available. This inadvertently caused the same task to be run again while the long running task was still running. This yielded a number of errors. The code was updated to cancel the context on long running tasks, which would force long running task to actually stop cleanly.

When the current task queue implementation was signaled to shutdown it would cancel the common context for the worker routines and the manager routine. The manager routine is responsible for taking finished tasks and storing the task in the database. Upon shutdown, however, many times the manager routine would exit before one or more of the work routines and so those tasks would not be updated in the database (and, thus, left permanently in the running state - this was why the "unstick tasks" code was written). The code was updated to separate out two cancel functions and two wait groups (one for workers, one for manager) and allow the manager to fully handle all of the worker tasks (saving them to the database) before itself shutdown. Unless there is a crash, there should no longer be forever running tasks.

The task manager did not protect critical database operations (such as saving the final state of a canceled task) from a context cancel. The code was updated to protect those critical database operations from a cancel.

Also, some smaller changes to the task-related code.

NOTE: This is a part of the overall Dexcom work which will be collected into a final PR for final approval. However, since the work covers a number of different issues, I broke up the development into multiple smaller and more focused PRs. Once all of the smaller PRs are approved, I'll create a final overall PR for approval that will eventually be tested and deployed. (The smaller PRs will not be tested nor deployed.)

- Use context deadline to forcibly cancel tasks at extended timeout - Allow task queue shutdown to fully process any outstanding tasks - Separate task queue shutdown into worker and manager groups - Ensure critical database updates are not affected by context cancel - Fix task service termination - Fix pending iterator usage and error handling - Use platform errors - Minor logging updates - Update test runner

toddkazakov · 2024-06-13T08:35:48Z

task/queue/queue.go

@@ -226,8 +256,15 @@ func (q *queue) startManager(ctx context.Context) {
 			}

 			select {
-			case <-ctx.Done():
-				return
+			case <-ctx.Done(): // Drain and complete any interrupted tasks


Instead of doing a non-blocking select until there are no new messages, consider closing the channel after the workers waitgroup is done and drain the channel with for tsk := range completeTask to make the block below easier to understand.

toddkazakov · 2024-06-13T08:41:56Z

task/service/service/service.go

@@ -269,6 +269,13 @@ func (s *Service) initializeClinicsClient() error {
 	return nil
 }

+func (s *Service) terminateClinicsClient() {


What does this accomplish? Removing the reference doesn't terminate the client. I'm having trouble understanding why this code exists. Yes, it's consistent, but adds complexity and doesn't do anything useful.

darinkrauss requested review from jh-bate and toddkazakov June 13, 2024 05:31

toddkazakov reviewed Jun 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix task service shutdown, errors, and task handling #736

Fix task service shutdown, errors, and task handling #736

darinkrauss commented Jun 13, 2024

toddkazakov Jun 13, 2024

toddkazakov Jun 13, 2024

Fix task service shutdown, errors, and task handling #736

Are you sure you want to change the base?

Fix task service shutdown, errors, and task handling #736

Conversation

darinkrauss commented Jun 13, 2024

toddkazakov Jun 13, 2024

Choose a reason for hiding this comment

toddkazakov Jun 13, 2024

Choose a reason for hiding this comment