Scaling not working after scaling in to 1 #120

commarla · 2020-01-02T14:29:03Z

Describe the bug
Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.

To reproduce
I am using nomad meta.

sherpa_cooldown:"180"
sherpa_enabled:"1"
sherpa_max_count:"5"
sherpa_min_count:"1"
sherpa_scale_in_count:"1"
sherpa_scale_in_cpu_percentage_threshold:"30"
sherpa_scale_in_memory_percentage_threshold:"30"
sherpa_scale_out_count:"1"
sherpa_scale_out_cpu_percentage_threshold:"70"
sherpa_scale_out_memory_percentage_threshold:"70"

The job is scaling in to 1 then never scale out

I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.

In this example the service was stuck for 15+ days.
In 0.3.0 I have the following logs:

Jan 02 14:37:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:37:56.421144110Z","message":"no task groups in job have scaling policies enabled"}
Jan 02 14:38:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:38:56.422893887Z","message":"no task groups in job have scaling policies enabled"}

After upgrading to 0.4.0

Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-02T13:57:21.298110900Z","message":"triggering autoscaling job evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-02T13:57:21.306032109Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-02T13:57:21.306043091Z","message":"triggering autoscaling job group evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-02T13:57:21.306154810Z","message":"scaling evaluation completed and no scaling required"}

Expected behavior
A scaling out event.

Environment:

Sherpa server information (retrieve with sherpa system info):

/usr/bin # sherpa system info
Nomad Address                http://nomad.eu-central-1.elb.amazonaws.com:4646
Policy Engine                Nomad Job Group Meta
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

/usr/bin # sherpa --version
sherpa version v0.4.0
        Date:   2020-01-02 14:26:03.176156495 +0000 UTC
        Commit: 212ca6e
        Branch: v0.4.0
        State:  v0.4.0

Server Operating System/Architecture: Debian 9.11
Sherpa server configuration parameters:

SHERPA_AUTOSCALER_ENABLED=true
SHERPA_AUTOSCALER_EVALUATION_INTERVAL=60
SHERPA_AUTOSCALER_NUM_THREADS=20
SHERPA_BIND_ADDR=0.0.0.0
SHERPA_BIND_PORT=8000
SHERPA_CLUSTER_ADVERTISE_ADDR=http://127.0.0.1:8000
SHERPA_CLUSTER_NAME=prod-main-admin
SHERPA_LOG_FORMAT=auto
SHERPA_LOG_LEVEL=debug
SHERPA_LOG_USE_COLOR=true
SHERPA_POLICY_ENGINE_API_ENABLED=false
SHERPA_POLICY_ENGINE_NOMAD_META_ENABLED=true
SHERPA_POLICY_ENGINE_STRICT_CHECKING_ENABLED=true
SHERPA_STORAGE_CONSUL_ENABLED=true
SHERPA_STORAGE_CONSUL_PATH=sherpa/
SHERPA_TELEMETRY_PROMETHEUS=true
SHERPA_UI=true

The text was updated successfully, but these errors were encountered:

commarla · 2020-01-02T14:34:03Z

After my manual scaling out to fix the issue I don't see this log anymore : failed to collect Nomad metrics, skipping Nomad based checks

jrasell · 2020-01-02T14:36:12Z

@commarla thanks for the report; i'll take a look this afternoon.

jrasell · 2020-01-02T17:32:01Z

I have tried to reproduce this locally without success, so will keep trying over the next day or so. Using the nomad meta policy engine the example job was scaled down to its minimum allowed count of 1:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
43cd970f  3d79e0f6  cache       3        run      running   5m8s ago    2m19s ago
75fd797a  3d79e0f6  cache       2        stop     complete  6m20s ago   3m19s ago
c41c66d8  3d79e0f6  cache       1        stop     complete  16m33s ago  5m7s ago
e9d80206  3d79e0f6  cache       0        stop     complete  16m33s ago  14m46s ago

Under minimal load Sherpa wanted to scale-in, but was prevented by the scaling policy:

5:17PM INF Nomad resource utilisation calculation cpu-value-percentage=10.466054549222797 group=cache job=example2 mem-value-percentage=0
5:17PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:17PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"in","group":"cache"}
5:17PM DBG scaling action will break job group minimum threshold group=cache job=example2
5:17PM ERR failed to trigger scaling of job error="scaling action will break job group minimum threshold" job=example2

I then loaded the service and triggered a successful scaling out event:

5:23PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:23PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"out","group":"cache"}
5:23PM DBG meta watcher last index has changed new=83 old=72
5:23PM DBG job modify index has changed is greater than last recorded job=example2 new=83 old=72
5:23PM DBG received job list update message to handle
5:23PM DBG reading job group meta stanzas job=example2
5:23PM DBG meta watcher last index has changed new=85 old=83
5:23PM INF successfully triggered autoscaling of job evaluation-id=e0155fc9-7414-11c9-67ac-39232bcd8992 id=dc7e8685-3bd0-410d-ac61-9a4b2f1765cc job=example2

From the description you gave, it is almost as if the meta engine incorrectly removed the policy during the last scaling event. Does the job contain multiple groups or just a single one? Are there any more debug logs from the times you were having issues that I could look at?

commarla · 2020-01-02T20:08:44Z

Thanks @jrasell
There are 2 groups and I don't see anything else in the logs.

I also already got this issue with other jobs but didn't took time to fill an issue.
All my jobs with sherpa meta have at least 2 task groups.

Is there a way to check if sherpa needs more threads or some tuning ?

commarla · 2020-01-03T08:02:27Z

Hi,

It is live. My task group scales in during the night and this morning every alloc is at 100%+ CPU and nothing happens.

there are 3 allocs from one task group without sherpa meta (suffix -app) and one with sherpa (-spot suffix).

and sherpa keeps logging that

Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-03T07:57:21.373291703Z","message":"triggering autoscaling job evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-03T07:57:21.384312224Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-03T07:57:21.384323537Z","message":"triggering autoscaling job group evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-03T07:57:21.384358656Z","message":"scaling evaluation completed and no scaling required"}

The CPU never goes under 1,3GHz which is 80% of allocated CPU (1600MHz)

If you want I can send you privately the whole log file without redacting it.

jrasell · 2020-01-03T08:08:55Z

@commarla if you can use [email protected] that would be grand. I'll do more testing today and try and track this down.

Would you be also able to send me what the Nomad API for /v1/job/ + jobID + /allocations shows in regards to the job that is currently going through this problem?

Fix incorrect searching of allocs causing missed allocs in jobs.

jrasell added area/autoscaling Issues related to internal autoscaling kind/bug Issues relating to bugs labels Jan 2, 2020

jrasell self-assigned this Jan 2, 2020

jrasell added the status/needs-investigation Issues that need additional investigation label Jan 2, 2020

jrasell mentioned this issue Jan 3, 2020

Fix incorrect searching of allocs causing missed allocs in jobs. #121

Merged

jrasell closed this as completed in #121 Jan 3, 2020

jrasell added a commit that referenced this issue Jan 3, 2020

Merge pull request #121 from jrasell/gh-120

3ab0b32

Fix incorrect searching of allocs causing missed allocs in jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling not working after scaling in to 1 #120

Scaling not working after scaling in to 1 #120

commarla commented Jan 2, 2020

commarla commented Jan 2, 2020

jrasell commented Jan 2, 2020

jrasell commented Jan 2, 2020

commarla commented Jan 2, 2020

commarla commented Jan 3, 2020 •

edited

Loading

jrasell commented Jan 3, 2020 •

edited

Loading

Scaling not working after scaling in to 1 #120

Scaling not working after scaling in to 1 #120

Comments

commarla commented Jan 2, 2020

commarla commented Jan 2, 2020

jrasell commented Jan 2, 2020

jrasell commented Jan 2, 2020

commarla commented Jan 2, 2020

commarla commented Jan 3, 2020 • edited Loading

jrasell commented Jan 3, 2020 • edited Loading

commarla commented Jan 3, 2020 •

edited

Loading

jrasell commented Jan 3, 2020 •

edited

Loading