Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Scaling Strategy for ScaledJob #1227

Merged
merged 2 commits into from
Oct 8, 2020
Merged

Adding Scaling Strategy for ScaledJob #1227

merged 2 commits into from
Oct 8, 2020

Conversation

TsuyoshiUshio
Copy link
Contributor

@TsuyoshiUshio TsuyoshiUshio commented Oct 7, 2020

I talked with Several Customers of the ScaledJob. The use case is totally different, and the request for the Scaling Logic is also different. Also I found some limitation for many scalers. For considering, all of these, I'd like to introduce new scale logic.

Probably, we can discuss the custom scale logic that I talk you later.

What do we need to know?

1. ScaledJob scaling expectation

HPA scaler is about Scale-In, and Scale-Out, however, ScaledJob has no Scale-In/Out If new messages arrives in the queue, it should create new jobs until it reaches maxReplicaCount.

2. Limitation of the queueLength implementation

Scalers use various SDKs. The behavior is different. If we fetch the queueLength, it usually includes locked message and there is no way to fetch the number of message that not locked.
The ideal metric is the number of message that is not locked. Why? The ideal number of the scale will be newly arrived messages. There are several jobs running, however, we can ignore it until it reaches MaxReplicaCount. That is the ideal behavior. For getting newly arrived message, it should not includes the number of locked message that means, messages that is already consumed by jobs. However, a lot of implementation of SDKs includes the locked messages as a number of queueLength. That causes an issue for ScaledJob.

The kind of scalers

a. Some of them can fetch the queueLength as the number of all messages - locked messages. ideal metrics e.g. Azure Storage Queue.
b. Some of them is not. e.g. Azure Service Bus. It will return queueLength as the number of all messages

We need to adjust two use cases. In case of a, Things is easy, it is ideal metric. We can have one firm scale logic for Scaled Job.
However, in case of b. We need to compromise the scale logic. And it might be very different for each use cases.

I introduce three scaling strategy default, custom , accurate. For the a. I provide accurate. For the b. I provide default and custom. If you don't specify anything, it will be default

The logic of three strategies

  • queueLength: the length of messages on a queue
  • runningJobCount: the number of running jobs

default

The original logic of ScaledJob. It is default to protect from a breaking change.

queueLength - runingJobCount

custom

You can configure customScalingQueueLengthDeduction and customScalingRunningJobPercentage on the ScaledJob definition. Users can change the parameter based on the default logic.

queueLength - customScalingQueueLengthDeduction - (runningJobCount * customScalingRunningJobPercentage)

However, it doesn't exceed MaxReplicaCount.

accurate

If the scaler returns ideal metrics or, customer delete a message once consumed it, it will be the ideal number of the queue then you can use this strategy.

	if (maxScale + runningJobCount) > maxReplicaCount {
		return maxReplicaCount - runningJobCount
	}
	return queueLength

Example Configuration

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: azure-servicebus-queue-consumer
  namespace: default
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: azure-servicebus-queue-receive
          image: tsuyoshiushio/azure-servicebus-queue-client:dev
          imagePullPolicy: Always
          command: ["receive"]
          envFrom:
            - secretRef:
                name: azure-servicebus-queue-secret
        restartPolicy: Never
    backoffLimit: 4  
  pollingInterval: 5   # Optional. Default: 30 seconds
  cooldownPeriod: 30   # Optional. Default: 300 seconds
  maxReplicaCount: 100  # Optional. Default: 100
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
  scalingStrategy: "custom"
  customScalingQueueLengthDeduction: 1
  customScalingRunningJobPercentage: "0.5"

Checklist

Fixes #
#1186
#1207

Documentation

kedacore/keda-docs#279

Signed-off-by: Tsuyoshi Ushio <[email protected]>
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, minor comments

pkg/scaling/executor/scale_jobs.go Outdated Show resolved Hide resolved
pkg/scaling/executor/scale_jobs.go Outdated Show resolved Hide resolved
pkg/scaling/executor/scale_jobs.go Outdated Show resolved Hide resolved
pkg/scaling/executor/scale_jobs.go Outdated Show resolved Hide resolved
api/v1alpha1/scaledjob_types.go Outdated Show resolved Hide resolved
@TsuyoshiUshio
Copy link
Contributor Author

Hi @zroubalik
I accept all your advice.

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @TsuyoshiUshio !

@zroubalik zroubalik merged commit 9433bc1 into v2 Oct 8, 2020
@zroubalik zroubalik deleted the tsushi/customlogic branch October 8, 2020 07:31
silenceper pushed a commit to silenceper/keda that referenced this pull request Oct 10, 2020
@orsher
Copy link

orsher commented Oct 28, 2020

Hey,

I tried using "accurate" strategy, but because it takes pods lots of time to actually get started and consume the message, every polling_time, the trigger is creating new pods for messages that pods were already created for.
In our case, we use cluster auto scaler and we start with 0 nodes what will make the first few pods to wait a few minutes before even starting and consuming the message but I think there are lots of other situations where pods won't necessarily get started between each poll. Some cap on the number of running pods for example would cause the same issue.

Is that the expected behaviour? Am I missing something here?

If that's the way it works, distinguishing between runningJobs and waitingJobs somehow could help calculating the right amount of new pods to schedule.

@zroubalik
Copy link
Member

@TsuyoshiUshio PTAL

@TsuyoshiUshio
Copy link
Contributor Author

Hi @orsher
The accurate strategy assume that the number of the queue length doesn't include the running job.
It depends on a scaler you choose and the implementation of the job.

For example, some customer using this strategy with ServiceBus. The service bus scaler queue length includes running jobs.
However the customer's job is once consume a message, it will delete from the queue immediately. That is why the expected scale will be queue length. The queue length doesn't include running jobs.

Which scaler do you use? If your scaler's queue length include the number of running jobs, and your job consume a message then do something, after finishing the job, the message is deleted from queue, then you can use default strategy.
If it is not the either case you can use custom strategy.

However, we need to learn more use case, If your use case is not I described, could you share which scaler, and how is your job's message consuming strategy? I'd happy to hear that and reflect it to keda.

@orsher
Copy link

orsher commented Nov 8, 2020

Hey @TsuyoshiUshio ,
Sure,
We're using rabbitmq and we're deleting the message right after the message is consumed and only after processing the task itself.
So, according to the docs and what you just described, at least if I got it right, the accurate strategy should be a good fit here.
I'll try to describe our use case and what we are experiencing using an example.
time 0 - a producer sends x messages to rabbitmq
time 1 - KEDA rabbitmq trigger for the first time and schedule x jobs
time 2 - Jobs are waiting for scheduling by Kubernetes, might task some time depending on the current available resources
time 3 - KEDA rabbitmq trigger for the second time and schedule again x jobs because non of the current submitted jobs got schedule and consumed a message

We get much more jobs sent than what we need and it will keep sending some more jobs until there will be available resources for the jobs to actually start running, consume and ack the messages.

I hope that this flow is clear.

Am I missing something here?

@fjmacagno
Copy link

fjmacagno commented Nov 13, 2020

I think there is a typo in this documentation that has been confusing me:

in

if (maxScale + runningJobCount) > maxReplicaCount {
	return maxReplicaCount - runningJobCount
}
return queueLength

maxScale and queueLength should be the same thing.

@orsher my plan to deal with that is to increase the poll interval, and write my consumers to shut themselves down if there is no message to consume. But i agree, it should take into account pending jobs when calculating the new scale, probably something like

if ((queueLength - scheduledJobCount) + runningJobCount) > maxReplicaCount {
	return maxReplicaCount - runningJobCount
}
return (queueLength - scheduledJobCount)

which could be simplified to

return min(queueLength - scheduledJobCount, maxReplicaCount - runningJobCount)

@audunsol
Copy link

audunsol commented Nov 17, 2020

I see the exact same problem as @orsher using Azure Storage Queues. It is kind of independent of the queue type and strategy used right now, as it is a problem with an inner fast control loop scaling up the number of jobs a lot faster than our outer control loops making resources available (scaling out virtual machines/nodes in our case).

We have already made our jobs do some queue message polling 10 times, then shut down if there are no messages, so the extra jobs are not a big problem from that standpoint.

But since some of our jobs are very resource intensive and long running (which is kinda the point of using Jobs instead of Deployments for us), we have to put some high resource.request values for memory and CPU on our jobs, which in turn makes our nodes exhausted quickly and then requires the cluster autoscaler to respond by starting new machines (may take a minute or two). This in turn makes KEDA scale up more jobs (since the queue is still not drained within our poll interval), which in turn schedules more high-demand jobs, which in turn requests more nodes. At the end, in the worst cases, we have spun up a lot of new nodes that end up running some very simple "poll 10 times then shut down" job (since there are a lot more jobs than queue messages). So we are starting and stopping a lot of machines as it is now. Also, we then get the problem of scaling down nodes when some of them are doing long running jobs.

There are of course a lot of variables we can tweak, like the polling interval, but we would like to start our long running jobs as fast as possible (i.e. having a short poll interval), and have a design that allows some slower startup only when we have bursty requests.

Taking scheduledJobCount into account would probably solve most of this for us.


Edit: I think this issue is covering the problem already: #1323

@TsuyoshiUshio
Copy link
Contributor Author

TsuyoshiUshio commented Nov 19, 2020

Sorry for the inconvenience. I'll have a look However, we should talk in #1323 rather than merged PR.
In this case which strategy (or parameters) would you useful for you guys.
I create the accurate strategy based on a customer's issue that fit for them. (I though it might required for others)
In this case, your container takes times. Already several guys report that, so I need to do something. What @fjmacagno says are very interesting. My strategy is the same as you mentioned (longer polling interval and on the container, once the message is 0 just quit) However, the idea of scheduledJobCount. Let me investigate it. Thank you for the feedback guys!

thomas-lamure added a commit to thomas-lamure/keda-charts that referenced this pull request Dec 2, 2020
tomkerkhove pushed a commit to kedacore/charts that referenced this pull request Dec 2, 2020
@ChayanBansal
Copy link

Hi @orsher
The accurate strategy assume that the number of the queue length doesn't include the running job.
It depends on a scaler you choose and the implementation of the job.

For example, some customer using this strategy with ServiceBus. The service bus scaler queue length includes running jobs.
However the customer's job is once consume a message, it will delete from the queue immediately. That is why the expected scale will be queue length. The queue length doesn't include running jobs.

Which scaler do you use? If your scaler's queue length include the number of running jobs, and your job consume a message then do something, after finishing the job, the message is deleted from queue, then you can use default strategy.
If it is not the either case you can use custom strategy.

However, we need to learn more use case, If your use case is not I described, could you share which scaler, and how is your job's message consuming strategy? I'd happy to hear that and reflect it to keda.

Hey @TsuyoshiUshio,

I am using Azure Queue Storage and from what I understand it gives the count of all messages (invisible/running and visible/non-running). You have mentioned that in such cases we should use 'default' strategy, but the official doc recommends the use of 'accurate' strategy for Azure Queue Storage.
Can you please clarify.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants