Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Scaling not working after scaling in to 1 #120

Closed
commarla opened this issue Jan 2, 2020 · 6 comments · Fixed by #121
Closed

Scaling not working after scaling in to 1 #120

commarla opened this issue Jan 2, 2020 · 6 comments · Fixed by #121
Assignees
Labels
area/autoscaling Issues related to internal autoscaling kind/bug Issues relating to bugs status/needs-investigation Issues that need additional investigation

Comments

@commarla
Copy link

commarla commented Jan 2, 2020

Describe the bug
Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.

To reproduce
I am using nomad meta.

sherpa_cooldown:"180"
sherpa_enabled:"1"
sherpa_max_count:"5"
sherpa_min_count:"1"
sherpa_scale_in_count:"1"
sherpa_scale_in_cpu_percentage_threshold:"30"
sherpa_scale_in_memory_percentage_threshold:"30"
sherpa_scale_out_count:"1"
sherpa_scale_out_cpu_percentage_threshold:"70"
sherpa_scale_out_memory_percentage_threshold:"70"

The job is scaling in to 1 then never scale out

Screenshot 2020-01-02 15 09 53

I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.

Screenshot 2020-01-02 15 13 14

In this example the service was stuck for 15+ days.
In 0.3.0 I have the following logs:

Jan 02 14:37:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:37:56.421144110Z","message":"no task groups in job have scaling policies enabled"}
Jan 02 14:38:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:38:56.422893887Z","message":"no task groups in job have scaling policies enabled"}

After upgrading to 0.4.0

Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-02T13:57:21.298110900Z","message":"triggering autoscaling job evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-02T13:57:21.306032109Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-02T13:57:21.306043091Z","message":"triggering autoscaling job group evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-02T13:57:21.306154810Z","message":"scaling evaluation completed and no scaling required"}

Expected behavior
A scaling out event.

Environment:

  • Sherpa server information (retrieve with sherpa system info):
/usr/bin # sherpa system info
Nomad Address                http://nomad.eu-central-1.elb.amazonaws.com:4646
Policy Engine                Nomad Job Group Meta
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true
  • Sherpa CLI version (retrieve with sherpa --version):
/usr/bin # sherpa --version
sherpa version v0.4.0
        Date:   2020-01-02 14:26:03.176156495 +0000 UTC
        Commit: 212ca6e
        Branch: v0.4.0
        State:  v0.4.0
  • Server Operating System/Architecture: Debian 9.11
  • Sherpa server configuration parameters:
SHERPA_AUTOSCALER_ENABLED=true
SHERPA_AUTOSCALER_EVALUATION_INTERVAL=60
SHERPA_AUTOSCALER_NUM_THREADS=20
SHERPA_BIND_ADDR=0.0.0.0
SHERPA_BIND_PORT=8000
SHERPA_CLUSTER_ADVERTISE_ADDR=http://127.0.0.1:8000
SHERPA_CLUSTER_NAME=prod-main-admin
SHERPA_LOG_FORMAT=auto
SHERPA_LOG_LEVEL=debug
SHERPA_LOG_USE_COLOR=true
SHERPA_POLICY_ENGINE_API_ENABLED=false
SHERPA_POLICY_ENGINE_NOMAD_META_ENABLED=true
SHERPA_POLICY_ENGINE_STRICT_CHECKING_ENABLED=true
SHERPA_STORAGE_CONSUL_ENABLED=true
SHERPA_STORAGE_CONSUL_PATH=sherpa/
SHERPA_TELEMETRY_PROMETHEUS=true
SHERPA_UI=true
@commarla
Copy link
Author

commarla commented Jan 2, 2020

After my manual scaling out to fix the issue I don't see this log anymore : failed to collect Nomad metrics, skipping Nomad based checks

@jrasell jrasell added area/autoscaling Issues related to internal autoscaling kind/bug Issues relating to bugs labels Jan 2, 2020
@jrasell jrasell self-assigned this Jan 2, 2020
@jrasell
Copy link
Owner

jrasell commented Jan 2, 2020

@commarla thanks for the report; i'll take a look this afternoon.

@jrasell
Copy link
Owner

jrasell commented Jan 2, 2020

I have tried to reproduce this locally without success, so will keep trying over the next day or so. Using the nomad meta policy engine the example job was scaled down to its minimum allowed count of 1:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
43cd970f  3d79e0f6  cache       3        run      running   5m8s ago    2m19s ago
75fd797a  3d79e0f6  cache       2        stop     complete  6m20s ago   3m19s ago
c41c66d8  3d79e0f6  cache       1        stop     complete  16m33s ago  5m7s ago
e9d80206  3d79e0f6  cache       0        stop     complete  16m33s ago  14m46s ago

Under minimal load Sherpa wanted to scale-in, but was prevented by the scaling policy:

5:17PM INF Nomad resource utilisation calculation cpu-value-percentage=10.466054549222797 group=cache job=example2 mem-value-percentage=0
5:17PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:17PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"in","group":"cache"}
5:17PM DBG scaling action will break job group minimum threshold group=cache job=example2
5:17PM ERR failed to trigger scaling of job error="scaling action will break job group minimum threshold" job=example2

I then loaded the service and triggered a successful scaling out event:

5:23PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:23PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"out","group":"cache"}
5:23PM DBG meta watcher last index has changed new=83 old=72
5:23PM DBG job modify index has changed is greater than last recorded job=example2 new=83 old=72
5:23PM DBG received job list update message to handle
5:23PM DBG reading job group meta stanzas job=example2
5:23PM DBG meta watcher last index has changed new=85 old=83
5:23PM INF successfully triggered autoscaling of job evaluation-id=e0155fc9-7414-11c9-67ac-39232bcd8992 id=dc7e8685-3bd0-410d-ac61-9a4b2f1765cc job=example2

From the description you gave, it is almost as if the meta engine incorrectly removed the policy during the last scaling event. Does the job contain multiple groups or just a single one? Are there any more debug logs from the times you were having issues that I could look at?

@jrasell jrasell added the status/needs-investigation Issues that need additional investigation label Jan 2, 2020
@commarla
Copy link
Author

commarla commented Jan 2, 2020

Thanks @jrasell
There are 2 groups and I don't see anything else in the logs.

I also already got this issue with other jobs but didn't took time to fill an issue.
All my jobs with sherpa meta have at least 2 task groups.

Is there a way to check if sherpa needs more threads or some tuning ?

@commarla
Copy link
Author

commarla commented Jan 3, 2020

Hi,

It is live. My task group scales in during the night and this morning every alloc is at 100%+ CPU and nothing happens.

Screenshot 2020-01-03 08 55 17

there are 3 allocs from one task group without sherpa meta (suffix -app) and one with sherpa (-spot suffix).

and sherpa keeps logging that

Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-03T07:57:21.373291703Z","message":"triggering autoscaling job evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-03T07:57:21.384312224Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-03T07:57:21.384323537Z","message":"triggering autoscaling job group evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-03T07:57:21.384358656Z","message":"scaling evaluation completed and no scaling required"}

The CPU never goes under 1,3GHz which is 80% of allocated CPU (1600MHz)

image

If you want I can send you privately the whole log file without redacting it.

@jrasell
Copy link
Owner

jrasell commented Jan 3, 2020

@commarla if you can use [email protected] that would be grand. I'll do more testing today and try and track this down.

Would you be also able to send me what the Nomad API for /v1/job/ + jobID + /allocations shows in regards to the job that is currently going through this problem?

jrasell added a commit that referenced this issue Jan 3, 2020
Fix incorrect searching of allocs causing missed allocs in jobs.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/autoscaling Issues related to internal autoscaling kind/bug Issues relating to bugs status/needs-investigation Issues that need additional investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants