Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Reconciliation on a 5 second basis #2122

Closed
relistan opened this issue Jul 23, 2020 · 1 comment · Fixed by #2123
Closed

Continuous Reconciliation on a 5 second basis #2122

relistan opened this issue Jul 23, 2020 · 1 comment · Fixed by #2123

Comments

@relistan
Copy link
Contributor

relistan commented Jul 23, 2020

Hey folks, I have spent a fair bit of time debugging this and I believe there is a problem with the code introduced in fb05da3 related to stuck launching tasks. It appears to continually request reconciliation for all tasks in the system after they have been running for a short while.

Here's my setup I validated this on:

  • Local Docker stack
  • One agent
  • One master
  • One ZK node
  • Singularity 1.2.0 and current master - same behavior

Steps:

  1. Deploy nginx with OPTIMISTIC placement strategy
  2. Wait a couple of minutes
  3. See it start to log Requested explicit reconcile of task ... every 5 seconds when the SchedulerPoller runs. It's calling the Mesos master for every one of these.

Logs

singularity_1   | INFO  [2020-07-23 13:57:03,437] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1   | INFO  [2020-07-23 13:57:03,437] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1   | INFO  [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1   | INFO  [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.007), 0 accepted, 0 declined/cached
singularity_1   | INFO  [2020-07-23 13:57:08,447] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1   | INFO  [2020-07-23 13:57:08,447] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1   | INFO  [2020-07-23 13:57:08,455] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1   | INFO  [2020-07-23 13:57:08,458] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.011), 0 accepted, 0 declined/cached

Possible Cause

Looking at the code it appears to me that

public List<SingularityTaskId> getLaunchingTasks() {
return getActiveTaskIds()
.stream()
.filter(t -> !exists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING)))
.collect(Collectors.toList());
is actually going to return all tasks that are not starting, rather than all of them still starting. To confirm, if I look in /api/state I see the following:

"activeTasks":1,
"launchingTasks":0,
"activeRequests":1

And yet it continues to run the reconciliation. I validated in Zookeeper what is in the history for my nginx task above. It only has two entries:

[zk: localhost:2181(CONNECTED) 3] ls /singularity/tasks/history/dev_nginx/dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT/updates
[TASK_LAUNCHED, TASK_RUNNING]

So I am pretty sure it's the code above that is the culprit. If I rebuild current master branch without the ! in front of exists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING))) I no longer see the issue.

@relistan
Copy link
Contributor Author

We're now running this live and it seems to resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant