Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure instance data is synced with pod state before reporting DS success #617

Merged
merged 7 commits into from
Nov 11, 2019

Conversation

timothysmith0609
Copy link
Contributor

@timothysmith0609 timothysmith0609 commented Nov 7, 2019

What are you trying to accomplish with this PR?
#580 added more detailed handling of daemon set rollouts. In particular, the addition of new nodes during a deployment should not interfere with convergence logic. However, this has resulted in a flaky test where the DS deploy is considered successful, but is reporting:
1 updatedNumberScheduled, 1 desiredNumberScheduled, 0 numberReady

After some digging, it appears that we fail to ensure the state of the target DS pods is reflected in the instance data of the DS itself when checking deploy_succeeded?, leading to the situation where a deploy is successful (e.g. the pod(s) is/are up), but the DS itself hasn't converged to this reality.

How is this accomplished?
Add an extra check in deploy_succeeded? to ensure `considered_pods.length == rollout_data["numberReady"]

What could go wrong?
In the worst case, this adds additional sync loops while we wait for the DS to converge, but that's the real definition of success, so it was an error to ignore this in the first place

It's also impossible to test this, since it relies on timing of sync loops between pods and the DS :(

@dalehamel
Copy link
Member

It's also impossible to test this, since it relies on timing of sync loops between pods and the DS :(

In our test can't we mock the daemonset being out of sync with the pods?

@dalehamel
Copy link
Member

Add an extra check in deploy_succeeded? to ensure `considered_pods.length == rollout_data["numberReady"]

As discussed in slack, (and now changed in your PR) i think this should be >=, as I think that this should always be equal to or greater than the number of pods considered, since we are only considering pods for potentially a subset of nodes.

Comment on lines 110 to 121
def test_deploy_waits_for_daemonset_status_to_converge_to_pod_states
status = {
"desiredNumberScheduled": 1,
"updatedNumberScheduled": 1,
"numberReady": 0,
}
ds_template = build_ds_template(filename: 'daemon_set.yml', status: status)
ready_pod_template = load_fixtures(filenames: ['daemon_set_pods.yml']).first # should be a pod in `Ready` state
node_templates = load_fixtures(filenames: ['nodes.yml'])
ds = build_synced_ds(ds_template: ds_template, pod_templates: [ready_pod_template], node_templates: node_templates)
refute_predicate(ds, :deploy_succeeded?)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this against master yields a failure, proving my hypothesis that the divergent state between pods and the DS were the root of the error

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition here is essentially (due to controller update cycles) a pod becomes ready between fetching the DS from the server and fetching the pods?

@@ -58,8 +58,11 @@ def relevant_pods_ready?
return true if rollout_data["desiredNumberScheduled"].to_i == rollout_data["numberReady"].to_i # all pods ready
relevant_node_names = @nodes.map(&:name)
considered_pods = @pods.select { |p| relevant_node_names.include?(p.node_name) }
@logger.debug("Considered #{considered_pods.size} pods out of #{@pods.size} for #{@nodes.size} nodes")
considered_pods.present? && considered_pods.all?(&:deploy_succeeded?)
@logger.debug("DaemonSet is reporting #{rollout_data['numberReady']} pods ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consolidate the two debugging statements. Also do we need to name the DS incase we have a deployment with multiple DSs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, originally I didn't want to do that to avoid an overly-long log line, but since it's debug that's not really a big concern. Fixed to make a single statement

Copy link
Member

@dalehamel dalehamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@timothysmith0609 timothysmith0609 merged commit d470158 into master Nov 11, 2019
@timothysmith0609 timothysmith0609 deleted the fix_ds_race_condition branch November 11, 2019 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants