-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async nginx config reload #70
Conversation
… go routine and return the result of the config reply in the Dataplane status.
Maybe rename the PR to async reload |
This makes it sound like the existing synchronous-with-timeout logic is replaced with an async operation. But I've also heard that this is not a breaking change. Please clarify if the previous behaviour is preserved? |
yeah the old behaviour/workflow is upheld, it is only the failure scenarios are more "comprehensive" beyond the timeout period of the DPM on NMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't look at the test. Can you ensure each config apply action only has one response.
src/plugins/nginx.go
Outdated
succeeded: succeeded, | ||
correlationId: cmd.Meta.MessageId, | ||
timestamp: types.TimestampNow(), | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, use a named function with passed in params for maintainability.
src/plugins/nginx.go
Outdated
} | ||
} | ||
|
||
func (n *Nginx) completeConfigApply(response *NginxConfigValidationResponse) *proto.AgentActivityStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of hard to follow because of async, but i believe we could end up with two responses due to the race between: 1) the non-async/inline response 2) the message handling.
So testing this build, i'm seeing failed to create response message: unknown action: UNKNOWN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah I can see where the completeConfigApply function could be called twice alright. I'll put a fix in for that, but you still should just get back one NginxConfigResponse. I'll double check that there in the morning and get back to you on it
case core.NginxConfigValidationFailed: | ||
switch response := message.Data().(type) { | ||
case *NginxConfigValidationResponse: | ||
n.rollbackConfigApply(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same potential race between rollback message handling, and inline.
…ng executed twice
FeatureActivityEvents = FeaturesKey + KeyDelimiter + "activity-events" | ||
) | ||
|
||
func GetDefaultFeatures() []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the server needs to know the configured set of features that the agent supports. I think we have the following options:
- agent sends to NMS in
AgentMeta
- agent sends to NMS in separate AgentConfigRequest
@@ -131,6 +131,12 @@ func (r *OneTimeRegistration) registerAgent() { | |||
InstanceGroup: r.config.InstanceGroup, | |||
Updated: updated, | |||
SystemUid: r.env.GetSystemUUID(), | |||
AgentDetails: &proto.AgentDetails{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickchen added the AgentDetails here. For backwards compatibility, if the AgentDetails is missing, assume default config minus the nginx-conifig-async? (basically what you have right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
select { | ||
case result := <-n.configApplyStatusChannel: | ||
return result | ||
case <-time.After(validationTimeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are comments that describe the intended actions for the config apply and subsequent config response:
• As part of the AgentConnectRequest include the feature flag in the AgentMeta message as part of the AgentConnectRequest message.
• Always issue NginxConfigResponse without waiting for a timeout (eliminate the timeout here) following the config apply.
• The periodic DataplaneStatus should include the AgentActivityStatus message with the last known deployment status.
• Send an on demand DataplaneStatus when terminal states are achieved like success or fail. This should be sent once every 60 seconds by default.
so this timeout would be un-necessary. Adding note here for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Nick, so the reason for the timeout is that we wanted to preserve backgrounds compatibly as best we could. We added the timeout so that old versions of the controller continued to work with the newer version of the agent. We do this by waiting to see if the config apply can be done in a reasonable time and if it can return the response as before. I have added a task for a future release to remove this logic and go fully async (i.e. remove the timeout).
Proposed changes
Updated apply config logic to validate the nginx config in a separate go routine and return the result of the config reply in the Dataplane status.
If a nginx config apply takes a long time, the Dataplane status will indicate that the config apply is pending. Once the nginx config apply completes, the Dataplane status will indicate whether the config apply was either successful or failed.
Checklist
Before creating a PR, run through this checklist and mark each as complete.
CONTRIBUTING
documentREADME.md
)