-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix caching issues #2847
Merged
Merged
Fix caching issues #2847
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
f69d152
Addressed race condition caused by async image caching with no wait w…
mlgibbons 4b256ed
Update to catch any errors in loading of cached images into Docker ca…
mlgibbons e0cb9ce
Made caching async with block for completion before copy of images in…
mlgibbons File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the race condition here, but I'm not quite sure that this is the right fix. We implemented it this way to avoid slowing things down at all for non-offline mode users.
The cache was basically designed to be "best effort", and making it synchronous like this could introduce flakiness and delays for the standard use-case.
Any ideas here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take your point about the standard use case. I've actually found that it makes the start operation more reliable (particularly if a previous one had been terminated) as well as also making debugging much easier as the log is now predictable and deterministic. If the cache loading has not completed before the "docker load" then we end up often with incomplete loads into the cache and - in online mode - re-downloading of the images with I would think a reduction in performance due to the duplication.
One improvement would be to allow the image download to be async but then block on it completing before the call to UpdateCluster(). This would allow it to run in parallel with the VM creation / startup while preventing the incomplete load problem.
Another change would be to change the semantics such that the load is not best effort i.e. if enabled then it either succeeds or fails (as per the patch) and if this behaviour is not desired then the cache-images flag should be set to false and thus the standard use case would not be impacted. This would seems more more logical to me i.e. if caching is enabled then it should complete successfully rather than failing silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to add a flag such as "mustCacheImages" or such like but I'd prefer to avoid adding more flags of possible and think the other two options are a better approach. Let me know what you think. Cheers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A further thought, and I think the better option, is to make False the default for --cache-images and have the behaviour as coded in the PR. This then makes things very clear and consistent i.e. don't cache images by default but if you do request caching then exit if the caching fails. Possibly also add the async wait before UpdateCluster() which I mentioned earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work and thoughtful discussion here! I think I agree with your last proposal. Then we can easily change the default to true if we feel it does make things more stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A change in behavior from --cache-images default true, to --cache-images default false?
"Feels" like the wrong direction.
I believe the binaries for iso/kubeadm/kubelet are all downloaded to the host, cached & imported/used later. Is it too much of a departure to do the same for images? Caching & blocking at UpdateCluster by default for k8s images seems the "best" option (immediate stability risks ignored).
Sounds like "delay UpdateCluster until cached images are ready" would be preferred but is too risky a default right now?
With switches & defaults - Setting the tone for a future where caching is optimal, reliable & default seems best. (Even if it takes a little while to get there).
My use case includes many lower-bandwidth customers, and "frequent" destroy/starts.
I'd appreciate a config value akin to "kubernetes-version" for "cacheWaitOnKubernetesImages=true" and then in the future once reliable image caching is established - make blocking the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. I can understand where you're coming from. I think we can separate this into two distinct items: 1) when used caching needs to be reliable and so the race condition needs to be addressed; 2) whether the image caching should default to true or false.
Item one needs fixed and I think everyone is in agreement one that. The commit I just made addresses that issue. Item two depends on your perspective and I could argue for both cases. The key difference in my mind is that the ISO, kubeadm and kubectl MUST be downloaded to be able to spin up the cluster. The images MAY be downloaded, cached and loaded but don't need to be as the caching is an optimisation. Given that, having them treated differently to the other items makes sense.
I would personally go with having image caching disabled by default for a number of reasons including that it's one less thing to worry about if you don't need it and I've spent too much time deleting the incomplete images in the cache dir created after doing Ctrl-C during cache population. The commit I made has the default set to False - I read your comment after the commit was made - but it could be either depending on what you value most and your use case i.e. there is no "right" default setting.