You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey folks, I have spent a fair bit of time debugging this and I believe there is a problem with the code introduced in fb05da3 related to stuck launching tasks. It appears to continually request reconciliation for all tasks in the system after they have been running for a short while.
Here's my setup I validated this on:
Local Docker stack
One agent
One master
One ZK node
Singularity 1.2.0 and current master - same behavior
Steps:
Deploy nginx with OPTIMISTIC placement strategy
Wait a couple of minutes
See it start to log Requested explicit reconcile of task ... every 5 seconds when the SchedulerPoller runs. It's calling the Mesos master for every one of these.
Logs
singularity_1 | INFO [2020-07-23 13:57:03,437] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1 | INFO [2020-07-23 13:57:03,437] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1 | INFO [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1 | INFO [2020-07-23 13:57:03,444] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.007), 0 accepted, 0 declined/cached
singularity_1 | INFO [2020-07-23 13:57:08,447] com.hubspot.singularity.scheduler.SingularityScheduler: Requested explicit reconcile of task dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT
singularity_1 | INFO [2020-07-23 13:57:08,447] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Received 0 offer(s)
singularity_1 | INFO [2020-07-23 13:57:08,455] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: 0 remaining offers not accounted for in offer check
singularity_1 | INFO [2020-07-23 13:57:08,458] com.hubspot.singularity.mesos.SingularityMesosOfferScheduler: Finished handling 0 new offer(s) 0 from cache (00:00.011), 0 accepted, 0 declined/cached
is actually going to return all tasks that are not starting, rather than all of them still starting. To confirm, if I look in /api/state I see the following:
And yet it continues to run the reconciliation. I validated in Zookeeper what is in the history for my nginx task above. It only has two entries:
[zk: localhost:2181(CONNECTED) 3] ls /singularity/tasks/history/dev_nginx/dev_nginx-latest_6dc0b72-1595512077034-1-docker1-DEFAULT/updates
[TASK_LAUNCHED, TASK_RUNNING]
So I am pretty sure it's the code above that is the culprit. If I rebuild current master branch without the ! in front of exists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING))) I no longer see the issue.
The text was updated successfully, but these errors were encountered:
Hey folks, I have spent a fair bit of time debugging this and I believe there is a problem with the code introduced in fb05da3 related to stuck launching tasks. It appears to continually request reconciliation for all tasks in the system after they have been running for a short while.
Here's my setup I validated this on:
master
- same behaviorSteps:
OPTIMISTIC
placement strategyRequested explicit reconcile of task ...
every 5 seconds when theSchedulerPoller
runs. It's calling the Mesos master for every one of these.Logs
Possible Cause
Looking at the code it appears to me that
Singularity/SingularityService/src/main/java/com/hubspot/singularity/data/TaskManager.java
Lines 892 to 897 in 4f7a41d
/api/state
I see the following:And yet it continues to run the reconciliation. I validated in Zookeeper what is in the history for my
nginx
task above. It only has two entries:So I am pretty sure it's the code above that is the culprit. If I rebuild current
master
branch without the!
in front ofexists(getUpdatePath(t, ExtendedTaskState.TASK_STARTING)))
I no longer see the issue.The text was updated successfully, but these errors were encountered: