-
Notifications
You must be signed in to change notification settings - Fork 8
Scaling not working after scaling in to 1 #120
Comments
After my manual scaling out to fix the issue I don't see this log anymore : |
@commarla thanks for the report; i'll take a look this afternoon. |
I have tried to reproduce this locally without success, so will keep trying over the next day or so. Using the nomad meta policy engine the example job was scaled down to its minimum allowed count of 1:
Under minimal load Sherpa wanted to scale-in, but was prevented by the scaling policy:
I then loaded the service and triggered a successful scaling out event:
From the description you gave, it is almost as if the meta engine incorrectly removed the policy during the last scaling event. Does the job contain multiple groups or just a single one? Are there any more debug logs from the times you were having issues that I could look at? |
Thanks @jrasell I also already got this issue with other jobs but didn't took time to fill an issue. Is there a way to check if sherpa needs more threads or some tuning ? |
Hi, It is live. My task group scales in during the night and this morning every alloc is at 100%+ CPU and nothing happens. there are 3 allocs from one task group without sherpa meta (suffix and sherpa keeps logging that
The CPU never goes under 1,3GHz which is 80% of allocated CPU (1600MHz) If you want I can send you privately the whole log file without redacting it. |
@commarla if you can use [email protected] that would be grand. I'll do more testing today and try and track this down. Would you be also able to send me what the Nomad API for |
Fix incorrect searching of allocs causing missed allocs in jobs.
Describe the bug
Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.
To reproduce
I am using nomad meta.
The job is scaling in to 1 then never scale out
I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.
In this example the service was stuck for 15+ days.
In 0.3.0 I have the following logs:
After upgrading to 0.4.0
Expected behavior
A scaling out event.
Environment:
sherpa system info
):sherpa --version
):The text was updated successfully, but these errors were encountered: