-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow parallel image stream importing #6407
Conversation
time.Sleep(100 * time.Millisecond) | ||
} | ||
glog.V(5).Infof("requeuing %s to the worklist", workingSetKey(staleImageStream)) | ||
c.work <- staleImageStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.
I think we should tolerate this. You can end up in pathological states where only a single thread can make progress at a time, destroying the intent of all this code, but they should all nicely wait in line while log-jamming the channel.
[test] |
Client: osclient, | ||
} | ||
controller := factory.Create() | ||
controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast osclient
to the interfaces the function is expecting for sanity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review it later today when I'm back online.
On Dec 18, 2015 5:54 PM, "Jordan Liggitt" [email protected] wrote:
In pkg/cmd/server/origin/run_components.go
#6407 (comment):@@ -345,10 +345,7 @@ func (c *MasterConfig) RunSDNController() {
// RunImageImportController starts the image import trigger controller process.
func (c *MasterConfig) RunImageImportController() {
osclient := c.ImageImportControllerClient()
- factory := imagecontroller.ImportControllerFactory{
Client: osclient,
- }
- controller := factory.Create()
- controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute)
cast osclient to the interfaces the function is expecting for sanity
—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/6407/files#r48045237.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about that.
re[test] |
|
||
err := kclient.RetryOnConflict(kclient.DefaultBackoff, func() error { | ||
liveImageStream, err := c.streams.ImageStreams(staleImageStream.Namespace).Get(staleImageStream.Name) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?
RetryOnConflict
retries on things other than conflicts? If so, I hereby declare that my weekend starts now.
Should the description for this PR mention that it fixes #6259? |
Also fixes #6381 |
It improves those, but I think that @liggitt's future pull to reduce the timeout is more directly related. |
|
||
// if we're already in the workingset, that means that some thread is already trying to do an import for this. | ||
// This does NOT mean that we shouldn't attempt to do this work, only that we shouldn't attempt to do it now. | ||
if c.isInWorkingSet(staleImageStream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isInWorkingSet
and addToWorkingSet
should be a single call addToWorkingSet() (added bool)
, otherwise two threads can both work on the same one at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like the removeFromWorkingSet
deferred immediately, which probably means splitting the body of this case into a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isInWorkingSet and addToWorkingSet should be a single call addToWorkingSet() (added bool), otherwise two threads can both work on the same one at the same time
how embarrassing.
I just recently (when fixing fifo-related issue in build controller) was thinking about moving our controllers to patterns from upstream, wdyt? |
b3eb2e7
to
ab44513
Compare
return c.Next(liveImageStream) | ||
}) | ||
|
||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to log NotFound errors as big unexpected errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to log NotFound errors as big unexpected errors?
Will explicitly deal with the common case of the error case, but if the image stream disappears as we're working on it, you'll still get a big error. More specific handling requires more plumbing into almost-dead code.
Client: osclient, | ||
} | ||
controller := factory.Create() | ||
controller := imagecontroller.NewImportController(client.ImageStreamsNamespacer(osclient), client.ImageStreamMappingsNamespace(osclient), 10, 2*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImageStreamMappingsNamespacer
@liggitt comments addressed. |
re[test] |
re[test] |
re[test] |
flake here:
re[test] |
flake here:
re[test] |
flake in
#6065 |
flake on
|
util.HandleError(err) | ||
return retries.Count < 5 | ||
}, | ||
kutil.NewTokenBucketRateLimiter(1, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably add back rate limiting to the individual workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably add back rate limiting to the individual workers
The concern being that someone creates image streams over and over again, tricking us into pounding a docker registry looking for metadata?
I thought this limiter controlled the rate at which retries are done. We tight loop on conflicts, but don't retry on any other conditions any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all failed imports retry on the sync period, which means if you accumulate a lot of failing image streams, you get a thundering herd every two minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync period may already jitter, I would rate limit workers. are you ok with the retry being sync-period driven instead of re-queue driven?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm not against termination in a large number of
deterministic stop conditions - 404, access denied, unrecognized
non-connection errors. I'm just highlighting that certain errors
beyond conflict should be retried because transient failures are
inevitable. The new import endpoint will allow that distinction to be
drawn for partial completion, but we still need to make an effort.
You said you're refactoring Next
. I'd expect that you're handling those. If not, would you like me to the do the refactor here? If we're doing a large import, it makes sense to have the dockerclient retry near the point of failure to avoid rework.
On the openshift resource mutation side, the update problem exists even without this pull, so you'll have to have patch conflicts, compatibility, and coverage evaluation to handle update conflicts at the point of failure anyway. Again, is that in your refactor or do you want for this one?
The only thing this is changing is when the retry happens. Now it happens on a sync-period instead of immediately at time of failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
517ad26
to
e9232d3
Compare
This pull leaves us in this state until clayton's larger refactor:
@liggitt squashed down. Any other comments? |
My PR should resolve the issues you mentioned cleanly - the new design is On Mon, Jan 4, 2016 at 3:52 PM, David Eads [email protected] wrote:
|
LGTM, [test] |
re[test] |
flake in e2e, lets try [merge] |
e9232d3
to
a33813f
Compare
Evaluated for origin test up to a33813f |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8186/) |
two UI failures here: @spadgett https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4508/consoleText
|
etcd flake for the other re[merge] |
UI https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4521/consoleText
|
re[merge] |
failed to propose re[merge] |
etcd flake again re[merge] |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4543/) (Image: devenv-rhel7_3094) |
etcd flake re[merge] |
Evaluated for origin merge up to a33813f |
Allows multiple import image jobs to run at the same time and operates against live data to decide if the import really needs to happen.
@stevekuznetsov @pweil- ptal
@bparees fyi