Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling job trigger should still be "active" if trigger value cannot not be obtained temporarily #2561

Closed
taylorchu opened this issue Jan 26, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@taylorchu
Copy link

taylorchu commented Jan 26, 2022

Report

we sometimes see the job is under provisioned for a brief period because metrics-api times out.

But it should be applied to other scalers as well.

Expected Behavior

httpLog.Error(err, fmt.Sprintf("Error when checking metric value: %s", err))

it is common in keda that the trigger is inactive when trigger value cannot be obtained. but this is temporary.

Actual Behavior

scaler ignores this trigger, and under provisions.

Steps to Reproduce the Problem

  1. specify metrics-api endpoint
  2. shut down metrics-api endpoint after some success requests

Logs from KEDA operator

│ {"level":"error","ts":1643138171.2917557,"logger":"metrics_api_scaler","msg":"Error when checking metric value: Get \"http://xxxxx\": dial tcp xxxxxxxx:80: connec │

KEDA Version

2.5.0

Kubernetes Version

1.21

Platform

Google Cloud

Scaler Details

metrics-api

Anything else?

No response

@taylorchu taylorchu added the bug Something isn't working label Jan 26, 2022
@taylorchu
Copy link
Author

  1. can we specify last trigger value for an error duration? similar to scaled object, we could add fallback duration to fallback config, and then expose fallback config in scaled job.
  2. alternatively, expose hpa raw config like in scaled object. we could make hpa scale down slowly to wait for recovery.

@JorTurFer
Copy link
Member

Hi @taylorchu
Did you try increasing the http timeout? This could improve the response in case of slow response from the upstream. Related with the fallback config for ScaledJob, I don't have any strong opinion, @kedacore/keda-core-contributors ?

Related with your second point, I don't get your point. What do you mean with hpa row config? The ScaledJob doesn't use HPA, it's the operator itself who manages the jobs, not the HPA Controller

@taylorchu
Copy link
Author

Timeout won’t help because it is due to the endpoint is restarting, or autoscaled down. It really needs retry.

Ignore my second point. I thought it uses hpa.

@JorTurFer
Copy link
Member

JorTurFer commented Feb 10, 2022

I think that this can be closed after #2604
There was an unexpected behavior that scaled to zero if error on the upstream
WDYT @taylorchu?

@taylorchu
Copy link
Author

I just updated keda to the latest 3 hours ago, I will reply if it is still a problem later. it happens roughly once every 2 week.

Repository owner moved this from Backlog to Ready To Ship in Roadmap - KEDA Core Feb 11, 2022
@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants