Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parallel image stream importing #6407

Merged
merged 1 commit into from
Jan 7, 2016

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Dec 18, 2015

Allows multiple import image jobs to run at the same time and operates against live data to decide if the import really needs to happen.

@stevekuznetsov @pweil- ptal
@bparees fyi

time.Sleep(100 * time.Millisecond)
}
glog.V(5).Infof("requeuing %s to the worklist", workingSetKey(staleImageStream))
c.work <- staleImageStream
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.

I think we should tolerate this. You can end up in pathological states where only a single thread can make progress at a time, destroying the intent of all this code, but they should all nicely wait in line while log-jamming the channel.

@pweil-
Copy link
Contributor

pweil- commented Dec 18, 2015

@soltysh

@deads2k
Copy link
Contributor Author

deads2k commented Dec 18, 2015

[test]

Client: osclient,
}
controller := factory.Create()
controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast osclient to the interfaces the function is expecting for sanity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review it later today when I'm back online.
On Dec 18, 2015 5:54 PM, "Jordan Liggitt" [email protected] wrote:

In pkg/cmd/server/origin/run_components.go
#6407 (comment):

@@ -345,10 +345,7 @@ func (c *MasterConfig) RunSDNController() {
// RunImageImportController starts the image import trigger controller process.
func (c *MasterConfig) RunImageImportController() {
osclient := c.ImageImportControllerClient()

  • factory := imagecontroller.ImportControllerFactory{
  •   Client: osclient,
    
  • }
  • controller := factory.Create()
  • controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute)

cast osclient to the interfaces the function is expecting for sanity


Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/6407/files#r48045237.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about that.

@deads2k
Copy link
Contributor Author

deads2k commented Dec 18, 2015

re[test]


err := kclient.RetryOnConflict(kclient.DefaultBackoff, func() error {
liveImageStream, err := c.streams.ImageStreams(staleImageStream.Namespace).Get(staleImageStream.Name)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?

RetryOnConflict retries on things other than conflicts? If so, I hereby declare that my weekend starts now.

@ncdc ncdc changed the title allow parallel image streams Allow parallel image stream importing Dec 18, 2015
@ncdc
Copy link
Contributor

ncdc commented Dec 18, 2015

Should the description for this PR mention that it fixes #6259?

@stevekuznetsov
Copy link
Contributor

Also fixes #6381

@deads2k
Copy link
Contributor Author

deads2k commented Dec 18, 2015

It improves those, but I think that @liggitt's future pull to reduce the timeout is more directly related.


// if we're already in the workingset, that means that some thread is already trying to do an import for this.
// This does NOT mean that we shouldn't attempt to do this work, only that we shouldn't attempt to do it now.
if c.isInWorkingSet(staleImageStream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isInWorkingSet and addToWorkingSet should be a single call addToWorkingSet() (added bool), otherwise two threads can both work on the same one at the same time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like the removeFromWorkingSet deferred immediately, which probably means splitting the body of this case into a function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isInWorkingSet and addToWorkingSet should be a single call addToWorkingSet() (added bool), otherwise two threads can both work on the same one at the same time

how embarrassing.

@soltysh
Copy link
Contributor

soltysh commented Dec 18, 2015

I just recently (when fixing fifo-related issue in build controller) was thinking about moving our controllers to patterns from upstream, wdyt?

@deads2k deads2k force-pushed the parallel-image-import branch from b3eb2e7 to ab44513 Compare December 18, 2015 23:20
return c.Next(liveImageStream)
})

if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to log NotFound errors as big unexpected errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to log NotFound errors as big unexpected errors?

Will explicitly deal with the common case of the error case, but if the image stream disappears as we're working on it, you'll still get a big error. More specific handling requires more plumbing into almost-dead code.

Client: osclient,
}
controller := factory.Create()
controller := imagecontroller.NewImportController(client.ImageStreamsNamespacer(osclient), client.ImageStreamMappingsNamespace(osclient), 10, 2*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImageStreamMappingsNamespacer

@deads2k
Copy link
Contributor Author

deads2k commented Dec 21, 2015

@liggitt comments addressed.

@deads2k
Copy link
Contributor Author

deads2k commented Dec 21, 2015

#6435

re[test]

@deads2k
Copy link
Contributor Author

deads2k commented Dec 21, 2015

#6447

re[test]

@deads2k
Copy link
Contributor Author

deads2k commented Dec 21, 2015

#6447

re[test]

@stevekuznetsov
Copy link
Contributor

flake here:

FAILURE after 63.617s: hack/../test/cmd/builds.sh:108: executing 'oc process -f examples/sample-app/application-template-dockerbuild.json -l build=docker | oc create -f -' expecting success: the command returned the wrong error code
Standard output from the command:
imagestream "origin-ruby-sample" created
deploymentconfig "frontend" created
service "database" created
deploymentconfig "database" created
Standard error from the command:
Error from server: Timeout: request did not complete within allowed duration
Error from server: 501: All the given peers are not reachable (failed to propose on members [https://127.0.0.1:24001] twice [last error: Unexpected HTTP status code]) [0]
Error from server: imageStream "ruby-22-centos7" already exists
Error from server: buildconfig "ruby-sample-build" already exists
[FAIL] !!!!! Test Failed !!!!

#6447

re[test]

@stevekuznetsov
Copy link
Contributor

flake here:

FAILURE after 7.946s: hack/../test/cmd/templates.sh:56: executing 'oc delete template/template-type-precision' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
Error from server: templates "template-type-precision" not found
[FAIL] !!!!! Test Failed !!!!

#6453

re[test]

@stevekuznetsov
Copy link
Contributor

flake in TestAuthorizationSubjectAccessReview here:

I1221 21:18:36.111554   32431 etcd.go:37] Deleting &etcd.Node{Key:"/kubernetes.io", Value:"", Dir:true, Expiration:(*time.Time)(nil), TTL:0, Nodes:etcd.Nodes(nil), ModifiedIndex:0x30f, CreatedIndex:0x30f} (child of &etcd.Node{Key:"", Value:"", Dir:true, Expiration:(*time.Time)(nil), TTL:0, Nodes:etcd.Nodes{(*etcd.Node)(0xc208856060), (*etcd.Node)(0xc2088560c0)}, ModifiedIndex:0x0, CreatedIndex:0x0})
F1221 21:18:43.869295   32431 etcd.go:39] Unable to delete key: 100: Key not found (/kubernetes.io) [1066]

#6065
re[test]

@stevekuznetsov
Copy link
Contributor

flake on test/cmd/images here:

FAILURE after 10.875s: hack/../test/cmd/images.sh:125: executing 'oc tag mysql:latest tagtest3:latest tagtest4:latest --alias' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
Error from server: imageStream "tagtest3" already exists
[FAIL] !!!!! Test Failed !!!!

#6461

util.HandleError(err)
return retries.Count < 5
},
kutil.NewTokenBucketRateLimiter(1, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably add back rate limiting to the individual workers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably add back rate limiting to the individual workers

The concern being that someone creates image streams over and over again, tricking us into pounding a docker registry looking for metadata?

I thought this limiter controlled the rate at which retries are done. We tight loop on conflicts, but don't retry on any other conditions any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all failed imports retry on the sync period, which means if you accumulate a lot of failing image streams, you get a thundering herd every two minutes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync period may already jitter, I would rate limit workers. are you ok with the retry being sync-period driven instead of re-queue driven?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not against termination in a large number of
deterministic stop conditions - 404, access denied, unrecognized
non-connection errors. I'm just highlighting that certain errors
beyond conflict should be retried because transient failures are
inevitable. The new import endpoint will allow that distinction to be
drawn for partial completion, but we still need to make an effort.

You said you're refactoring Next. I'd expect that you're handling those. If not, would you like me to the do the refactor here? If we're doing a large import, it makes sense to have the dockerclient retry near the point of failure to avoid rework.

On the openshift resource mutation side, the update problem exists even without this pull, so you'll have to have patch conflicts, compatibility, and coverage evaluation to handle update conflicts at the point of failure anyway. Again, is that in your refactor or do you want for this one?

The only thing this is changing is when the retry happens. Now it happens on a sync-period instead of immediately at time of failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k deads2k force-pushed the parallel-image-import branch from 517ad26 to e9232d3 Compare January 4, 2016 20:49
@deads2k
Copy link
Contributor Author

deads2k commented Jan 4, 2016

This pull leaves us in this state until clayton's larger refactor:

  1. Missing resources are skipped.
  2. Resource conflicts are retried immediately.
  3. Retry-able errors are retried on the re-list interval.
  4. Non-retryable errors are never retried.

@liggitt squashed down. Any other comments?

@smarterclayton
Copy link
Contributor

My PR should resolve the issues you mentioned cleanly - the new design is
much easier to reason about as a client.

On Mon, Jan 4, 2016 at 3:52 PM, David Eads [email protected] wrote:

This pull leaves us in this state until clayton's larger refactor:

  1. Missing resources are skipped.
  2. Resource conflicts are retried immediately.
  3. Retry-able errors are retried on the re-list interval.
  4. Non-retryable errors are never retried.

@liggitt https://github.com/liggitt squashed down. Any other comments?


Reply to this email directly or view it on GitHub
#6407 (comment).

@liggitt
Copy link
Contributor

liggitt commented Jan 5, 2016

LGTM, [test]

@liggitt liggitt added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2016
@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

re[test]

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

flake in e2e, lets try [merge]

@deads2k deads2k force-pushed the parallel-image-import branch from e9232d3 to a33813f Compare January 5, 2016 18:01
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to a33813f

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8186/)

@deads2k
Copy link
Contributor Author

deads2k commented Jan 6, 2016

two UI failures here: @spadgett https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4508/consoleText

1)  unauthenticated user should be able to log in
  - Error: Angular could not be found on the page https://localhost:9000/ : angular never provided resumeBootstrap
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/lib/protractor.js:478:17
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
    at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)

2)  authenticated e2e-user with test project should be able to list the test project
  - UnknownError: javascript error: document unloaded while waiting for result
  (Session info: chrome=47.0.2526.106)
  (Driver info: chromedriver=2.14.313457 (3d645c400edf2e2c500566c9aa096063e707c9cf),platform=Linux 3.10.0-229.7.2.el7.x86_64 x86_64) (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 117 milliseconds

@deads2k
Copy link
Contributor Author

deads2k commented Jan 6, 2016

etcd flake for the other re[merge]

@stevekuznetsov
Copy link
Contributor

@deads2k that flake is #6533

@deads2k
Copy link
Contributor Author

deads2k commented Jan 7, 2016

UI https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4521/consoleText

1)  authenticated e2e-user new project when creating a new project should browse deployments
  - Error: Wait timed out after 3131ms
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15

@deads2k
Copy link
Contributor Author

deads2k commented Jan 7, 2016

re[merge]

@deads2k
Copy link
Contributor Author

deads2k commented Jan 7, 2016

failed to propose

re[merge]

@deads2k
Copy link
Contributor Author

deads2k commented Jan 7, 2016

etcd flake again re[merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4543/) (Image: devenv-rhel7_3094)

@deads2k
Copy link
Contributor Author

deads2k commented Jan 7, 2016

etcd flake re[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to a33813f

openshift-bot pushed a commit that referenced this pull request Jan 7, 2016
@openshift-bot openshift-bot merged commit 315002e into openshift:master Jan 7, 2016
@deads2k deads2k deleted the parallel-image-import branch February 26, 2016 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants