-
Notifications
You must be signed in to change notification settings - Fork 462
pkg/daemon: rewrite the MCD as a controller #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/daemon: rewrite the MCD as a controller #548
Conversation
|
/hold The first commit just added the backbone of the controller, I'm going to work on the handling of errors and syncs now. |
466fd20 to
b1c8c80
Compare
|
aws issue: /retest |
|
/retest |
|
oh cool, this is passing all e2e already 😄 guess I'll work on the rollback functionality for partial writes tomorrow then and test it by excercising fuzzing the calls to the apiserver and creating bad MCs to "degrade" (with a different meaning now) the nodes. |
pkg/daemon/update.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's...going to be nontrivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be complitely stupid, but isn't this just a matter of inverting the argument to updateFiles?
// update files on disk that need updating
if err := dn.updateFiles(newConfig, oldConfig); err != nil {
return err
}
I'm sure I'm missing something obvious as I've not yet thought about this yet that much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo I have the solution actually #401
I'm modify the atomic write PR to split the writing into a real transation and only commit at the very end! I'll try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that could work. I think where it gets tricky is trying to handle an error when we're in the middle of either writing files or doing a revert. Then we're in an "indeterminate middle state", much like what happens in the middle of a yum/apt update.
But anyways for now we can probably ignore that and yeah, just delete any files added by the new config and restore all the files from the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ignorant on yum/apt update - how are those handling such a failure in the middle of an update?
|
Only skimmed this, offhand looks good to me though! |
|
So yeah I am hacking on #545 since I messed up the release payload (now fixed to use |
@runcom I think that the degraded logic should stay the same in this PR and the unreconcilable/degraded changes this should be done in another PR for both for testing and visibility reasons. |
|
this is what I wanted to see: retry after network failures :) |
|
tested this on both upgrade and installation and it's amazing. With this we can also get back to a failed reconcile by deleting the bad machineconfig + creating a new one |
|
/retest |
|
This is awesome - really nice work! From my PoV good to merge now. /approve /assign kikisdeliveryservice |
pkg/daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to do it in the PR but let's start trying to add some doc comments to functions like this if we're rewriting them anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment, thanks!
pkg/daemon/update.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's an interesting pattern, didn't realize it was that easy to do "do this if we're returning an error". That's quite hard in other languages.
pkg/daemon/update.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality looks like it's working fine, added an error in the MCD simulating a...error and it goes back and rewrite files from oldConfig just fine (I expect to work the same for ssh keys) - in the logs below notice how the file "/etc/chrony.conf" is no longer there once we write from the old config
I0314 16:51:27.336603 6697 update.go:187] Checking reconcilable for config worker-2231f5479a086534c246ec862625a3f1 to worker-41c30245368d268544394d45a2e70598
I0314 16:51:27.336628 6697 update.go:245] Checking if configs are reconcilable
I0314 16:51:27.338097 6697 update.go:385] Updating files
I0314 16:51:27.338118 6697 update.go:586] Writing file "/etc/sysconfig/crio-network"
I0314 16:51:27.344087 6697 update.go:586] Writing file "/var/lib/kubelet/config.json"
I0314 16:51:27.347939 6697 update.go:586] Writing file "/etc/kubernetes/ca.crt"
I0314 16:51:27.356594 6697 update.go:586] Writing file "/etc/sysctl.d/forward.conf"
I0314 16:51:27.359890 6697 update.go:586] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0314 16:51:27.362261 6697 update.go:586] Writing file "/etc/containers/registries.conf"
I0314 16:51:27.365901 6697 update.go:586] Writing file "/etc/containers/storage.conf"
I0314 16:51:27.369021 6697 update.go:586] Writing file "/etc/crio/crio.conf"
I0314 16:51:27.372757 6697 update.go:586] Writing file "/etc/kubernetes/kubelet.conf"
I0314 16:51:27.376479 6697 update.go:586] Writing file "/etc/chrony.conf"
I0314 16:51:27.380001 6697 update.go:523] Writing systemd unit "kubelet.service"
I0314 16:51:27.383001 6697 update.go:558] Enabling systemd unit "kubelet.service"
I0314 16:51:27.383052 6697 update.go:476] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0314 16:51:27.383063 6697 update.go:404] Deleting stale data
I0314 16:51:27.383084 6697 update.go:648] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0314 16:51:27.385047 6697 daemon.go:1009] Current and target osImageURL have matching digest "sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670"
I0314 16:51:27.385071 6697 update.go:648] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0314 16:51:27.386892 6697 update.go:385] Updating files
I0314 16:51:27.386910 6697 update.go:586] Writing file "/etc/sysconfig/crio-network"
I0314 16:51:27.388912 6697 update.go:586] Writing file "/var/lib/kubelet/config.json"
I0314 16:51:27.393087 6697 update.go:586] Writing file "/etc/kubernetes/ca.crt"
I0314 16:51:27.396004 6697 update.go:586] Writing file "/etc/sysctl.d/forward.conf"
I0314 16:51:27.398847 6697 update.go:586] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0314 16:51:27.400039 6697 update.go:586] Writing file "/etc/containers/registries.conf"
I0314 16:51:27.402039 6697 update.go:586] Writing file "/etc/containers/storage.conf"
I0314 16:51:27.407845 6697 update.go:586] Writing file "/etc/crio/crio.conf"
I0314 16:51:27.413978 6697 update.go:586] Writing file "/etc/kubernetes/kubelet.conf"
I0314 16:51:27.417781 6697 update.go:523] Writing systemd unit "kubelet.service"
I0314 16:51:27.420848 6697 update.go:558] Enabling systemd unit "kubelet.service"
I0314 16:51:27.420897 6697 update.go:476] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0314 16:51:27.420908 6697 update.go:404] Deleting stale data
I0314 16:51:27.421027 6697 daemon.go:396] Unable to apply update: test
E0314 16:51:27.421053 6697 writer.go:98] marking degraded due to: test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so to understand, this tried to do the first config, for whatever reason it can't, so it goes back to writing the previous config that didnt have the chrony.conf & annotates as degraded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, and keeps trying to apply the new config as expected by a "controller" which keeps retrying (even if in this case, the obvious fix would be to remove the MC, create a new correct one, and reconcile, this was just a smoke test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so going forward to recover from a bad mc, it's just now make a new mc or make a new mc & remove degraded annotation? (i think we should probably write this down for users?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recover from a bad mc is just 1) delete the bad MC first, 2) create a new MC with something "good" which applies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should, in the future, react to deletions and just undo the whole generated machineconfigs altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do actually believe that deleting an MC is triggering a regeneration of the generated MC for the pool (according to the render_controller code)
Signed-off-by: Antonio Murdaca <[email protected]>
Signed-off-by: Antonio Murdaca <[email protected]>
Effectively avoiding any error like this:
E0314 14:49:12.734384 5690 event.go:259] Could not construct reference to: '&v1.Node{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ip-10-0-147-217.us-west-2.compute.internal", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.NodeSpec{PodCIDR:"", ProviderID:"", Unschedulable:false, Taints:[]v1.Taint(nil), ConfigSource:(*v1.NodeConfigSource)(nil), DoNotUse_ExternalID:""}, Status:v1.NodeStatus{Capacity:v1.ResourceList(nil), Allocatable:v1.ResourceList(nil), Phase:"", Conditions:[]v1.NodeCondition(nil), Addresses:[]v1.NodeAddress(nil), DaemonEndpoints:v1.NodeDaemonEndpoints{KubeletEndpoint:v1.DaemonEndpoint{Port:0}}, NodeInfo:v1.NodeSystemInfo{MachineID:"", SystemUUID:"", BootID:"", KernelVersion:"", OSImage:"", ContainerRuntimeVersion:"", KubeletVersion:"", KubeProxyVersion:"", OperatingSystem:"", Architecture:""}, Images:[]v1.ContainerImage(nil), VolumesInUse:[]v1.UniqueVolumeName(nil), VolumesAttached:[]v1.AttachedVolume(nil), Config:(*v1.NodeConfigStatus)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'Reboot' 'Node will reboot into config worker-47d52401530ce05855ba991fbb59df12'
Signed-off-by: Antonio Murdaca <[email protected]>
|
awesome, i'll look at this today! |
Signed-off-by: Antonio Murdaca <[email protected]>
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
aws timeout /retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
merged! let the fun begin! 🎉 |
|
This might have broken upgrade (we had a run of green upgrades then this edge went red We'll monitor and let you know if it did, and I'll also open a PR for you in the morning to add the e2e-aws-upgrade job to the repo so you can test yourself (it's in a slow roll). |
|
Cautious no - things started greening again after. |
|
Alright, I've already identified some regressions with this PR but I'm working on fixing them and add test coverage as well (this change was needed anyway) |
Signed-off-by: Antonio Murdaca [email protected]
- What I did
rewrite the daemon as a controller with a sync handler to deal with transient vs permanent error mainly.
effectively close #395
also close #401
close #299
- How to verify it
CI should tell us if this is generally good (hopefully)
- Description for the changelog
/cc @derekwaynecarr