Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async nginx config reload #70

Merged
merged 20 commits into from
Nov 3, 2022
Merged

Async nginx config reload #70

merged 20 commits into from
Nov 3, 2022

Conversation

dhurley
Copy link
Collaborator

@dhurley dhurley commented Oct 11, 2022

Proposed changes

Updated apply config logic to validate the nginx config in a separate go routine and return the result of the config reply in the Dataplane status.
If a nginx config apply takes a long time, the Dataplane status will indicate that the config apply is pending. Once the nginx config apply completes, the Dataplane status will indicate whether the config apply was either successful or failed.

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING document
  • If applicable, I have added tests that prove my fix is effective or that my feature works
  • If applicable, I have checked that any relevant tests pass after adding my changes
  • I have updated any relevant documentation (README.md)

@dhurley dhurley self-assigned this Oct 11, 2022
@dhurley dhurley added enhancement New feature or request and removed dependencies labels Oct 11, 2022
@oliveromahony
Copy link
Contributor

Maybe rename the PR to async reload

@lcrilly
Copy link

lcrilly commented Oct 11, 2022

Updated apply config logic

This makes it sound like the existing synchronous-with-timeout logic is replaced with an async operation. But I've also heard that this is not a breaking change.

Please clarify if the previous behaviour is preserved?

@dhurley dhurley changed the title Timed reload Async nginx config reload Oct 12, 2022
@oliveromahony
Copy link
Contributor

Updated apply config logic

This makes it sound like the existing synchronous-with-timeout logic is replaced with an async operation. But I've also heard that this is not a breaking change.

Please clarify if the previous behaviour is preserved?

yeah the old behaviour/workflow is upheld, it is only the failure scenarios are more "comprehensive" beyond the timeout period of the DPM on NMS

@dhurley dhurley requested a review from nickchen October 17, 2022 15:04
Copy link
Contributor

@nickchen nickchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look at the test. Can you ensure each config apply action only has one response.

succeeded: succeeded,
correlationId: cmd.Meta.MessageId,
timestamp: types.TimestampNow(),
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, use a named function with passed in params for maintainability.

}
}

func (n *Nginx) completeConfigApply(response *NginxConfigValidationResponse) *proto.AgentActivityStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of hard to follow because of async, but i believe we could end up with two responses due to the race between: 1) the non-async/inline response 2) the message handling.

So testing this build, i'm seeing failed to create response message: unknown action: UNKNOWN

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah I can see where the completeConfigApply function could be called twice alright. I'll put a fix in for that, but you still should just get back one NginxConfigResponse. I'll double check that there in the morning and get back to you on it

case core.NginxConfigValidationFailed:
switch response := message.Data().(type) {
case *NginxConfigValidationResponse:
n.rollbackConfigApply(response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same potential race between rollback message handling, and inline.

src/plugins/nginx.go Outdated Show resolved Hide resolved
FeatureActivityEvents = FeaturesKey + KeyDelimiter + "activity-events"
)

func GetDefaultFeatures() []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the server needs to know the configured set of features that the agent supports. I think we have the following options:

  1. agent sends to NMS in AgentMeta
  2. agent sends to NMS in separate AgentConfigRequest

@@ -131,6 +131,12 @@ func (r *OneTimeRegistration) registerAgent() {
InstanceGroup: r.config.InstanceGroup,
Updated: updated,
SystemUid: r.env.GetSystemUUID(),
AgentDetails: &proto.AgentDetails{
Copy link
Contributor

@oliveromahony oliveromahony Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickchen added the AgentDetails here. For backwards compatibility, if the AgentDetails is missing, assume default config minus the nginx-conifig-async? (basically what you have right now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

select {
case result := <-n.configApplyStatusChannel:
return result
case <-time.After(validationTimeout):
Copy link
Contributor

@nickchen nickchen Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are comments that describe the intended actions for the config apply and subsequent config response:
• As part of the AgentConnectRequest include the feature flag in the AgentMeta message as part of the AgentConnectRequest message.
• Always issue NginxConfigResponse without waiting for a timeout (eliminate the timeout here) following the config apply.
• The periodic DataplaneStatus should include the AgentActivityStatus message with the last known deployment status.
• Send an on demand DataplaneStatus when terminal states are achieved like success or fail. This should be sent once every 60 seconds by default.

so this timeout would be un-necessary. Adding note here for reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nick, so the reason for the timeout is that we wanted to preserve backgrounds compatibly as best we could. We added the timeout so that old versions of the controller continued to work with the newer version of the agent. We do this by waiting to see if the config apply can be done in a reasonable time and if it can return the response as before. I have added a task for a future release to remove this logic and go fully async (i.e. remove the timeout).

@dhurley dhurley merged commit 5261845 into main Nov 3, 2022
@dhurley dhurley deleted the timed_reload branch November 3, 2022 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants