-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: plugin grpc server retry method #1096
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: googs1025 <[email protected]>
friendly ping @elezar /PTAL thanks |
@cdesiniotis can you help this ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're changing this anyway. Let's use the built-in types.
@@ -183,11 +183,13 @@ func (plugin *nvidiaDevicePlugin) Serve() error { | |||
go func() { | |||
lastCrashTime := time.Now() | |||
restartCount := 0 | |||
maxRestarts := 5 | |||
crashTimeoutSeconds := float64(3600) // 1 hour in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crashTimeoutSeconds := float64(3600) // 1 hour in seconds | |
crashTimeout := 1 * time.Hour |
@@ -198,17 +200,29 @@ func (plugin *nvidiaDevicePlugin) Serve() error { | |||
break | |||
} | |||
|
|||
klog.Infof("GRPC server for '%s' crashed with error: %v", plugin.rm.Resource(), err) | |||
|
|||
timeSinceLastCrash := time.Since(lastCrashTime).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeSinceLastCrash := time.Since(lastCrashTime).Seconds() | |
timeSinceLastCrash := time.Since(lastCrashTime) |
timeSinceLastCrash, | ||
) | ||
|
||
if timeSinceLastCrash > crashTimeoutSeconds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if timeSinceLastCrash > crashTimeoutSeconds { | |
if timeSinceLastCrash > crashTimeout { |
plugin.rm.Resource(), | ||
err, | ||
restartCount+1, | ||
timeSinceLastCrash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need:
timeSinceLastCrash, | |
timeSinceLastCrash.Seconds(), |
or need to adjust the format string.
// add a small delay before restarting to prevent tight loops | ||
retryDelay := 5 * time.Second | ||
klog.Infof("Waiting for %v before attempting to restart GRPC server for '%s'", retryDelay, plugin.rm.Resource()) | ||
time.Sleep(retryDelay) // Wait for 5 seconds before attempting to restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Sleep(retryDelay) // Wait for 5 seconds before attempting to restart | |
time.Sleep(retryDelay) |
// it has been one hour since the last crash.. reset the count | ||
// to reflect on the frequency | ||
restartCount = 0 | ||
} else { | ||
restartCount++ | ||
} | ||
|
||
// add a small delay before restarting to prevent tight loops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that you chose 5 seconds specifically? This seems quite long.
// it has been one hour since the last crash.. reset the count | ||
// to reflect on the frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rather just log the crashTimeout
instead of a comment that is now further from the place where the one hour is specified.
// quite if it has been restarted too often | ||
// i.e. if server has crashed more than 5 times and it didn't last more than one hour each time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be moved to the docstring or at the point where we declare the variables.
To improve observability and facilitate debugging, this PR enhances logging after the GRPC server crashes