Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase operator resiliency to unexpected scaler failures #5622

Open
cyrilico opened this issue Mar 25, 2024 · 7 comments
Open

Increase operator resiliency to unexpected scaler failures #5622

cyrilico opened this issue Mar 25, 2024 · 7 comments
Assignees
Labels
feature-request All issues for new features that have not been committed to stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@cyrilico
Copy link
Contributor

Proposal

As a(n independent) follow-up to #5619 , I think it would be interesting to start a discussion on potential improvements to Keda operator's resiliency, more specifically in the case of unexpected/catastrophic scaler failures. In the linked issue, the problematic query caused an outage in our Keda operators which prevented all ScaledObjects in the cluster from operating correctly until my team was able to pinpoint the issue and remove that particular scaler configuration. While a quick fix for that specific scaler has been proposed, we worry that similar issues may arise in the future.

While we are not familiar enough with the codebase to immediately suggest potential paths, if you are able to provide some pointers and initial thoughts, we'd love to keep engaging and, if the opportunity arises, provide a contribution down the line 🙏

Use-Case

A higher operator resiliency to scaler failures

Is this a feature you are interested in implementing yourself?

Maybe

Anything else?

No response

@cyrilico cyrilico added feature-request All issues for new features that have not been committed to needs-discussion labels Mar 25, 2024
@JorTurFer
Copy link
Member

This is an interesting point. We aim to prevent all the panics by code instead of just recovering them, but maybe we could recover panics on scaler metric requests:

func (c *ScalersCache) GetMetricsAndActivityForScaler(ctx context.Context, index int, metricName string) ([]external_metrics.ExternalMetricValue, bool, int64, error) {
if index < 0 || index >= len(c.Scalers) {
return nil, false, -1, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
}
startTime := time.Now()
metric, activity, err := c.Scalers[index].Scaler.GetMetricsAndActivity(ctx, metricName)
if err == nil {
return metric, activity, time.Since(startTime).Milliseconds(), nil
}
ns, err := c.refreshScaler(ctx, index)
if err != nil {
return nil, false, -1, err
}
startTime = time.Now()
metric, activity, err = ns.GetMetricsAndActivity(ctx, metricName)
return metric, activity, time.Since(startTime).Milliseconds(), err
}

As scalers are the place where more contributions are made, they're also the place with more unexpected problems and although I think that we should avoid panics, maybe in this case it can make sense.

In the other hand, we have a really few panics because we try to cover all the cases and we almost achieve it.
WDYT @zroubalik @dttung2905 ?

@zroubalik
Copy link
Member

Yeah, we should avoid panics, I agree.

@JorTurFer
Copy link
Member

Are you willing to open a PR with this recover @cyrilico ?

@cyrilico
Copy link
Contributor Author

I'll gladly take a shot at it whenever I get some time 🙏

Copy link

stale bot commented Jun 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jun 10, 2024
@zroubalik zroubalik removed the stale All issues that are marked as stale due to inactivity label Jun 10, 2024
Copy link

stale bot commented Aug 11, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Aug 11, 2024
@cyrilico
Copy link
Contributor Author

not stale, just haven't had the time yet, go away bot

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Aug 11, 2024
@JorTurFer JorTurFer added stale-bot-ignore All issues that should not be automatically closed by our stale bot and removed needs-discussion labels Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Status: To Triage
Development

No branches or pull requests

3 participants