-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: flyteadmin doesn't shutdown servers gracefully #6289
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ryan Lo <[email protected]>
Signed-off-by: Ryan Lo <[email protected]>
Signed-off-by: Ryan Lo <[email protected]>
Code Review Agent Run Status
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6289 +/- ##
==========================================
+ Coverage 50.37% 58.47% +8.09%
==========================================
Files 1170 937 -233
Lines 92851 71141 -21710
==========================================
- Hits 46774 41599 -5175
+ Misses 41931 26389 -15542
+ Partials 4146 3153 -993
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@lowc1012 , is this ready for review? |
Code Review Agent Run #55c786Actionable Suggestions - 3
Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
go func() { | ||
err = srv.Serve(tls.NewListener(conn, srv.TLSConfig)) | ||
if err != nil && err != http.ErrServerClosed { | ||
logger.Errorf(ctx, "Failed to start HTTP/2 Server: %v", err) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP/2 server is now started in a goroutine, but the function immediately proceeds to wait for shutdown signals. This could lead to a race condition where the shutdown sequence begins before the server is fully initialized. Consider adding a small delay or a readiness check before proceeding to the shutdown logic.
Code suggestion
Check the AI-generated fix before applying
@@ -558,6 +558,13 @@
go func() {
err = srv.Serve(tls.NewListener(conn, srv.TLSConfig))
if err != nil && err != http.ErrServerClosed {
logger.Errorf(ctx, "Failed to start HTTP/2 Server: %v", err)
}
}()
+
+ // Give the server a moment to start before proceeding to shutdown logic
+ time.Sleep(100 * time.Millisecond)
+
+ // Log that the server has started
+ logger.Infof(ctx, "HTTP/2 Server started successfully on %s", cfg.GetHostAddress())
Code Review Run #55c786
Should Bito avoid suggestions like this for future reviews? (Manage Rules)
- Yes, avoid them
Signed-off-by: Ryan Lo <[email protected]>
Signed-off-by: Ryan Lo <[email protected]>
Code Review Agent Run #e9c2cdActionable Suggestions - 0Review Details
|
@eapolinario please help review it, thank you |
I was the original one that filed this issue for myself to take a look at later so I can review it a bit. |
}() | ||
|
||
// Gracefully shutdown the servers | ||
sigCh := make(chan os.Signal, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have two signal listeners? Should just be able to use one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean just SIGINT
or SIGTERM
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that in this diff you have two signal channels (one for grpc and one for http). You only need one to know whether or not the app is shutting down.
flyteadmin/pkg/server/service.go
Outdated
grpcServer.GracefulStop() | ||
grpcCancel() | ||
}() | ||
<-grpcShutdownCtx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for graceful shutdown doesn't necessarily mean that it completed successful graceful shutdown. We should follow up with a more forceful stop per the documentation/examples: https://github.com/grpc/grpc-go/blob/master/examples/features/gracefulstop/server/main.go#L93-L99
Part of my thinking when handling graceful shutdown was not just the servers but also draining any buffers, specifically for cloud events so that we minimize notifications that are dropped because they are still buffered in memory. |
I think the buffers might have been in my company's custom fork but it would still be good to look for anything else that can be drained or shutdown via context after the servers are shutdown. For example, there are a number of things that get kicked off in goroutines here liked the scheduled execution executor: https://github.com/lowc1012/flyte/blob/master/flyteadmin/pkg/rpc/adminservice/base.go |
Signed-off-by: Ryan Lo <[email protected]>
2e4252e
to
cdb0383
Compare
Code Review Agent Run #138eb1Actionable Suggestions - 0Review Details
|
flyteadmin/pkg/server/service.go
Outdated
time.Sleep(1 * time.Second) | ||
|
||
// force to shut down servers after 10 seconds | ||
timer := time.AfterFunc(10 * time.Second, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to make this timeout configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like configure the timeout in flyteadmin_config.yaml?
server:
forceShutdownTimeoutSec: 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Signed-off-by: Ryan Lo <[email protected]>
Signed-off-by: Ryan Lo <[email protected]>
Code Review Agent Run #4c1ddaActionable Suggestions - 0Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
|
Tracking issue
#6274
Why are the changes needed?
To handle SIGTERM properly, ensure cleanup of resources and provide smooth user experience
What changes were proposed in this pull request?
Implement graceful shutdown for flyteadmin servers
How was this patch tested?
Labels
Please add one or more of the following labels to categorize your PR:
This is important to improve the readability of release notes.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This PR implements graceful shutdown mechanisms for flyteadmin servers by introducing configurable timeout settings (replacing hardcoded 10-second values), adding signal handling and timeout-based contexts. It enhances error handling in gRPC server startup, updates configuration files with command-line flag support, and ensures proper termination of both HTTP and gRPC servers to prevent resource leaks and improve service reliability.Unit tests added: False
Estimated effort to review (1-5, lower is better): 1