Error handling scheduling request #594

brainbug95 · 2025-01-10T06:05:25Z

Describe the bug

I have to schedule ~150 EC2 instances up and down every day in one VPC in one AWS Account. About every 30 days we get an error

ERROR : Error handling scheduling request <...> botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the PutItem operation: Item size has exceeded the maximum allowed size

see scheduler-error.json for a complete log entry.

This stops the scheduler from working for this AWS Account with the effect that the whole environment that should have been started is not resulting in an outage.

To fix this I need to manually delete the Dynamo TableItem for this AWS Account ID in the StateTable of the solution. Then tha scheduler starts working again.

To Reproduce

I have seen this behavior only in two of our AWS accounts where Kubernetes Clusters with a highly volatile number of EC2 instances are deployed. So I am guessing it might have something to do with the constantly changing number of instances and states the scheduler has to track.

Please complete the following information about the solution:

Version: v3.0.6
Region: eu-central-1
Was the solution modified from the version published on this repository? No
Have you checked your service quotas for
the sevices this solution uses? Yes, and I know there is a limit on the PutItem size for dynamoDB but the solution should account for that
Were there any errors in the CloudWatch Logs?
Error occur in the solutions cloudwatch logs (see attached file)

Appreciate any help on this. We've been seeing this issue with version < 3 of the scheduler solution as well and I hoped it would have been solved with v3 but clearly it is not.

The text was updated successfully, but these errors were encountered:

CrypticCabub · 2025-01-10T15:40:17Z

Hi @brainbug95

Thanks for reporting this issue. I believe that the root cause of this issue is that the kubernetes clusters are constantly creating new EC2 instances rather than starting/stopping existing ones. The state table logic does not currently attempt to detect when a previously scheduled instance no longer exists which results in the constantly changing instance IDs piling up in the state table item until the max size is exceeded.

So you would be correct that the kubernetes clusters are what's causing this issue. Could I ask for a little more detail on how these clusters are setup for scheduling and what you are trying to achieve? This state table scenario is on our backlog to fix, but I am also wondering if the "actual" issue here is that we don't support scheduling kubernetes clusters correctly.

brainbug95 · 2025-01-13T04:55:19Z

Hi @CrypticCabub

Thanks for your insight on this. AWS accounts having this problem host a mix of self-hosted OpenShift kubernetes clusters and other supporting EC2 Instances. We are using the scheduler do stop and start all OCP workload nodes to be available only during business hours instead of leaving them running 24/7. In addition the OCP clusters itself are also capable of creating and terminating EC2 instances.

EKS clusters are also involved but those nodes are not tagged so the scheduler should not "see" them.

I'm happy to provide further info if required.

brainbug95 added the bug label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling scheduling request #594

Error handling scheduling request #594

brainbug95 commented Jan 10, 2025

CrypticCabub commented Jan 10, 2025

brainbug95 commented Jan 13, 2025

Error handling scheduling request #594

Error handling scheduling request #594

Comments

brainbug95 commented Jan 10, 2025

CrypticCabub commented Jan 10, 2025

brainbug95 commented Jan 13, 2025