Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling scheduling request #594

Open
5 tasks
brainbug95 opened this issue Jan 10, 2025 · 2 comments
Open
5 tasks

Error handling scheduling request #594

brainbug95 opened this issue Jan 10, 2025 · 2 comments
Labels

Comments

@brainbug95
Copy link

Describe the bug

I have to schedule ~150 EC2 instances up and down every day in one VPC in one AWS Account. About every 30 days we get an error

ERROR : Error handling scheduling request <...> botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the PutItem operation: Item size has exceeded the maximum allowed size

see scheduler-error.json for a complete log entry.

This stops the scheduler from working for this AWS Account with the effect that the whole environment that should have been started is not resulting in an outage.

To fix this I need to manually delete the Dynamo TableItem for this AWS Account ID in the StateTable of the solution. Then tha scheduler starts working again.

To Reproduce

I have seen this behavior only in two of our AWS accounts where Kubernetes Clusters with a highly volatile number of EC2 instances are deployed. So I am guessing it might have something to do with the constantly changing number of instances and states the scheduler has to track.

Please complete the following information about the solution:

  • Version: v3.0.6
  • Region: eu-central-1
  • Was the solution modified from the version published on this repository? No
  • Have you checked your service quotas for
    the sevices this solution uses? Yes, and I know there is a limit on the PutItem size for dynamoDB but the solution should account for that
  • Were there any errors in the CloudWatch Logs?
    Error occur in the solutions cloudwatch logs (see attached file)

Appreciate any help on this. We've been seeing this issue with version < 3 of the scheduler solution as well and I hoped it would have been solved with v3 but clearly it is not.

@brainbug95 brainbug95 added the bug label Jan 10, 2025
@CrypticCabub
Copy link
Member

Hi @brainbug95

Thanks for reporting this issue. I believe that the root cause of this issue is that the kubernetes clusters are constantly creating new EC2 instances rather than starting/stopping existing ones. The state table logic does not currently attempt to detect when a previously scheduled instance no longer exists which results in the constantly changing instance IDs piling up in the state table item until the max size is exceeded.

So you would be correct that the kubernetes clusters are what's causing this issue. Could I ask for a little more detail on how these clusters are setup for scheduling and what you are trying to achieve? This state table scenario is on our backlog to fix, but I am also wondering if the "actual" issue here is that we don't support scheduling kubernetes clusters correctly.

@brainbug95
Copy link
Author

Hi @CrypticCabub

Thanks for your insight on this. AWS accounts having this problem host a mix of self-hosted OpenShift kubernetes clusters and other supporting EC2 Instances. We are using the scheduler do stop and start all OCP workload nodes to be available only during business hours instead of leaving them running 24/7. In addition the OCP clusters itself are also capable of creating and terminating EC2 instances.

EKS clusters are also involved but those nodes are not tagged so the scheduler should not "see" them.

I'm happy to provide further info if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants