-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Degradation while scaling out large number of Deployments, 700<N<1250 #6063
Comments
@jeevantpant check if this helps, we were having a similar issue with scale as well. |
Hello,
For the parallel topic, I'd suggest increasing the current value of KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES 5 to IDK, 20 (and check if it improves and solves, if only improves, increase more) -> https://keda.sh/docs/2.15/operate/cluster/#configure-maxconcurrentreconciles-for-controllers. This will allow more parallel actions reconilling ScaledObjects (if this is the bottleneck) For the Kubernetes client throttling, you can increase these other paramenters -> https://keda.sh/docs/2.15/operate/cluster/#kubernetes-client-parameters |
There have also been some improvements related with status handling, so upgrading to v2.15 could improve the performance as it reduces significantly the calls to the API server in some cases (if this is the root cause of your case) |
Thanks so much @JorTurFer for such insightful suggestions and options for trying out. Your valuable suggestions seemed to have solved our issue which we were facing to scale out the deployments. I wanted to post the observations and findings when we tried all the above suggestions given by you.
One final question on the above configuration @JorTurFer , if you could please help us with that .
a) DO you think that the following values set for the kube client parameters: [ kube-api-qps: 60 / kube-api-burst: 90 ] would be of any risk or issue when running on a busier cluster - where there is a significantly more traffic to kube-api-server? b) Have you ever used these high values before for a live setup OR seen this high numbers for these parameters causing any issues/stress in your experience? |
Hello
About using that high values, I know about clusters configured with 600/900 (and even more for 1 case). They depend on the scaler topology, the amount of failures, etc... I think that in a cluster that already has 1k ScaledObjects, the control plane should be big enough to handle those requests (but monitoring is always a good idea) |
hello @JorTurFer , Can you describe in more detail how/what specific keyword or string we need to search for when we look at the throttling messages in the KEDA operator.? Also, can you also help us if there is a way to control log level for KEDA pods, so that we can just send error logs ?. This would help us watch more accurately for errors due to low volume of logs. |
Are you using help to deploy KEDA? If yes, you can configure the log level here: About the message, you can just look for |
Report
We observe performance degradation while scaling out together a large number of deployments, say N, via KEDA. We tested scaling behavior for number of scaledobjects, N = 100,200,500,1000,1500,2000.
We expect KEDA to scale deployment replicas from 0-->2 during activation window.
- In the below testing, only CRON based external scaler is being used to observed performance from scaling to/from 0 to constant Desired replicas count and vice versa.
- We notice that when number of ScaledObjects, N is 700<N<1250, it takes a significant amount of time to completely scale out all the target deployment replicas to come up to desired number of replicas (only 1->2 scaling). Approximately 2.5hrs.
- We see that KEDA is taking ~5mins to activate all ScaledObjects and bring replicas of all deployments from 0-->1,but its KEDA/HPA taking lot of time to scale the replicas form 1-->2.
NOTE:
- We have ensured that we have enough compute and all resourcequotas in surplus, to ensure that this is not a resource crunch.
- We have validated the behavior when N = 1500 or even 2000, all deployments are able to scale up within ~14mins - 15 mins which is expected considering the node scaleup and pod going to Running state.
- We only see this anomaly when number of scaledObjects and deployments were within 700 to 1250
Expected Behavior
Actual Behavior
Steps to Reproduce the Problem
`#Scaleobject.yaml
Deployment.yaml
`
2. We need to create N number of scaledobjects , where number of scaledobjects/deployments in this case N = 1050. (We saw any value between 700 and 1250 was showing this behavior and can be used for reproducing this bug.)
3. Please make sure there is no resource crunch while scaling and make sure we have enough compute for all 1050 deployments scaling up to 2 each (worker nodes and compute with surplus namespace resourcequota)
Logs from KEDA operator
CRON window timing -
Start : 2024-08-05T14:00:00.000+05:30
End : 2024-08-05T19:00:00.000+05:30
KEDA Version
2.13.1
Kubernetes Version
1.28
Platform
Amazon Web Services
Scaler Details
CRON
Anything else?
No response
The text was updated successfully, but these errors were encountered: