Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling is slow with many scalers #1449

Closed
coderanger opened this issue Dec 24, 2020 · 18 comments
Closed

Scaling is slow with many scalers #1449

coderanger opened this issue Dec 24, 2020 · 18 comments
Assignees
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@coderanger
Copy link
Contributor

coderanger commented Dec 24, 2020

I finally go around to doing the 2.0 upgrade, and I'm a little confused at some performance issues.

I have around 700 RabbitMQ ScaledObjects. I wrote a script to convert all of them from the v1 to v2 format, and then did the upgrade. Just to see it reconcile every object once (to add the finalizers) took ~3 hours. And the logs are full of Throttling request took 29.426205883s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/[...]/deployments/[...] so I'm guessing it's making an excessive number of API requests, getting throttled, and taking a long time. The part I'm confused about is that a Get request like that should be in the controller-runtime cache already and so shouldn't even hit the apiserver. It wasn't clear to me where in the controller they are even coming from.

Expected Behavior

2.0 upgrade should take very little time.

Actual Behavior

2.0 upgrade took many hours during which scaling was unavailable.

Steps to Reproduce the Problem

  1. Have way too many ScaledObjects in a just-applied state.
  2. Launch the manager pod.
  3. Wait.
  4. Profit.

Logs from KEDA operator

2020-12-24T06:21:53.539Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:03.061997 1 request.go:621] Throttling request took 29.430920668s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:13.062046 1 request.go:621] Throttling request took 29.431011286s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
2020-12-24T06:22:23.040Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:23.111974 1 request.go:621] Throttling request took 29.473182423s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:33.112340 1 request.go:621] Throttling request took 29.432306204s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:43.162005 1 request.go:621] Throttling request took 29.431112037s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
2020-12-24T06:22:52.538Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:53.162007 1 request.go:621] Throttling request took 29.475063103s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:23:03.162022 1 request.go:621] Throttling request took 29.430427868s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:23:13.212043 1 request.go:621] Throttling request took 29.430636995s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...

Note that every one of those ...s is a different deployment with no obvious order.

Specifications

  • KEDA Version: 2.0.0
  • Platform & Version: GKE
  • Kubernetes Version: v1.17.13-gke.2001
  • Scaler(s): RabbitMQ (but I don't think that directly matters)
@coderanger
Copy link
Contributor Author

Aside from the /scale requests handled in #1453, I think the main issue is

if err := h.client.Get(context.TODO(), client.ObjectKey{Namespace: obj.Namespace, Name: obj.Spec.ScaleTargetRef.Name}, unstruct); err != nil {

I think that Get() call is not hitting the informer cache somehow and so is making it all the way to the apiserver. Will have to dig a bit on how the unstructured cache works.

@coderanger
Copy link
Contributor Author

Screen Shot 2020-12-25 at 10 58 20 PM

The rate has been constant since the 2.0 upgrade so it's not a cold start issue, would have to be completely skipping the cache somehow.

@coderanger coderanger changed the title Extremely slow initial resync Scaling is slow with many scalers Dec 28, 2020
@coderanger
Copy link
Contributor Author

So from muddling around in the code I've confirmed there are three places that trigger requests that are getting throttled and generally gumming up the works. The first is the isScalable get which is fixed in #1453. The next is the unstructured object get mentioned above which should be hitting the controller-runtime informer cache, will dig into the more later. And then the third is the scaling get in scaling/executor/scale_scaledobjects.go. That one I think can be improved in two ways, the first is just straight up bumping the the QPS on the rest client passed in to the scale client, the second is adding some better lazy evaluation in that RequestScale method since the majority of the time the current scale isn't actually needed.

More updates as I have them.

@coderanger
Copy link
Contributor Author

Hmm, actually I think lazy eval isn't going to help there. How would you all feel about making some special case implementation for the common core types (basically just Deployment I think?) that uses the plain object for the Get so it can use the informer cache and keep API overhead much lower? Would still need to fall back to /scale for custom resources and for setting things, but making a little abstraction over the the scale client to short circuit things for getting the current number of replicas seems like it would be okay. Annoying to have special cases but given how often Deployments are used a small improvement there has a big impact.

@coderanger
Copy link
Contributor Author

Updates on my performance adventures!

The calls in scale_handler, I just hallucinated that Unstructureds support caching at all, they do not. So I've reworked that code to use a typed get if possible which does hit the cache.

The executor, I put in the Deployment special case and it works very smoothly. Slightly annoying to have the special case but its well contained and with strong fallback.

I'll get this all polished into PRs tomorrow morning but with all these changes, the time from start to reconciling all ~700 scaledobjects at least once goes from 3 hours to 90 seconds so I think that should help performance a lot.

@coderanger
Copy link
Contributor Author

Okay all three issues have PRs open. I split them up for easier review but all three are important for different aspects of performance :)

@mariusgrigoriu
Copy link

Running a build with the 3 PR listed here greatly improves scaling using 1000 ScaledObjects with the Kafka scaler. The creation of HPA objects still takes a long time once there are around 500+ ScaledObjects, but that would probably be another PR.

Thanks to @coderanger for creating these patches. I hope we can see them merged soon.

@tomkerkhove
Copy link
Member

Thanks for the thorough analysis! I'll leave it up to @zroubalik

@zroubalik
Copy link
Member

@coderanger that's really awesome. Thanks for the deep research and investigation. Unfortunately, we still don't have a way how to test KEDA behavior with that many scalers to spot this problem, so this will help a lot!

When implementing the relevant parts of code, first I was thinking about implementing a special cases for some resource such as Deployments and others when doing /scale requests, for the manipulating theUnstructured part and so on. But then I was like, nah, it should hit the cache anyway, so it would be better to use unified approach in the code, to not introduce additional complexity...how wrong I was? 💩 😄

I'll take a look on your PRs (and thanks for splitting them, it definitely helps with the review) 🎉

@zroubalik
Copy link
Member

@coderanger would you be open to test more tweaks in your large setup? For a long time I'd like to get some data on defining MaxConcurrentReconciles > 1 in the controller, to see whether there are any benefits.

https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/#controller-configurations

@coderanger
Copy link
Contributor Author

@zroubalik I can try that but I don't think it will help. The limiting factor at this point isn't the reconcile time, it's the client-side API call rate limit. Doing more parallel reconciles will just leave more things blocked in a workqueue wait :)

@zroubalik
Copy link
Member

Are you sure that the bottleneck is the client-side API call rate limit and not the calls to external services to get the metrics (in your case the RabbitMQ)? Just to clarify, I am not trying to say you are wrong, just wanted to double check ;-)

@mariusgrigoriu
Copy link

We tested the PR listed in this issue with 1000 Kafka ScaledObjects. Running all three patches improved the response time for scaling from unusable to roughly equivalent to running 10 operators that each handle 100 Kafka ScaledObjects, which is good.

There are still performance issues with initial setup of HPAs as the number of ScaledObjects increase, however that does not affect scaling and shouldn't block the patches. For example, it takes about 2 minutes for the 1001st ScaledObject to setup its HPA.

@zroubalik
Copy link
Member

@mariusgrigoriu thanks for the info!

There are still performance issues with initial setup of HPAs as the number of ScaledObjects increase, however that does not affect scaling and shouldn't block the patches. For example, it takes about 2 minutes for the 1001st ScaledObject to setup its HPA.

You mean that you had 1000 ScaleObjects deployed and ready (HPAs were created) and then you deployed ScaledObject number 1001, the operation for this SO takes ~2 minutes?

@mariusgrigoriu
Copy link

The patches mentioned in this issue address the issue of scaling responsiveness. However, they don't address the issue of how long it takes for the HPAs to be created. In fairness, it should probably be logged as a separate issue.

The creation of HPA starts slowing down massively after around 500 Scaled Objects. We had 1000 SO, added one more, and waited a couple minutes for that to become ready.

Actual scaling of pods works well with the patches above.

@tomkerkhove
Copy link
Member

Can you give some more information on this or has this been resolved in a newer version?

@stale
Copy link

stale bot commented Oct 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021
@stale
Copy link

stale bot commented Oct 20, 2021

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

5 participants