Allow parallel image stream importing #6407

deads2k · 2015-12-18T16:28:41Z

Allows multiple import image jobs to run at the same time and operates against live data to decide if the import really needs to happen.

@stevekuznetsov @pweil- ptal
@bparees fyi

deads2k · 2015-12-18T16:30:24Z

pkg/image/controller/controller.go

+					time.Sleep(100 * time.Millisecond)
+				}
+				glog.V(5).Infof("requeuing %s to the worklist", workingSetKey(staleImageStream))
+				c.work <- staleImageStream


In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.

In theory, this can wedge if you've filled up the channel. I'm trying to lookup fairness guarantees to see if I need to have a fallback policy.

I think we should tolerate this. You can end up in pathological states where only a single thread can make progress at a time, destroying the intent of all this code, but they should all nicely wait in line while log-jamming the channel.

pweil- · 2015-12-18T16:33:47Z

@soltysh

deads2k · 2015-12-18T16:38:48Z

[test]

liggitt · 2015-12-18T16:54:15Z

pkg/cmd/server/origin/run_components.go

-		Client: osclient,
-	}
-	controller := factory.Create()
+	controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute)


cast osclient to the interfaces the function is expecting for sanity

I'll review it later today when I'm back online.
On Dec 18, 2015 5:54 PM, "Jordan Liggitt" [email protected] wrote:

In pkg/cmd/server/origin/run_components.go
#6407 (comment):

@@ -345,10 +345,7 @@ func (c *MasterConfig) RunSDNController() {
// RunImageImportController starts the image import trigger controller process.
func (c *MasterConfig) RunImageImportController() {
osclient := c.ImageImportControllerClient()

factory := imagecontroller.ImportControllerFactory{

Client: osclient,

}

controller := factory.Create()

controller := imagecontroller.NewImportController(osclient, osclient, 10, 2*time.Minute)

cast osclient to the interfaces the function is expecting for sanity

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/6407/files#r48045237.

I wouldn't worry about that.

deads2k · 2015-12-18T18:07:41Z

re[test]

liggitt · 2015-12-18T19:15:32Z

pkg/image/controller/controller.go

+
+			err := kclient.RetryOnConflict(kclient.DefaultBackoff, func() error {
+				liveImageStream, err := c.streams.ImageStreams(staleImageStream.Namespace).Get(staleImageStream.Name)
+				if err != nil {


if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?

if a NotFound is encountered here, should we still be returning an error (which I think causes a retry)?

RetryOnConflict retries on things other than conflicts? If so, I hereby declare that my weekend starts now.

ncdc · 2015-12-18T19:30:02Z

Should the description for this PR mention that it fixes #6259?

stevekuznetsov · 2015-12-18T19:33:08Z

Also fixes #6381

deads2k · 2015-12-18T20:18:17Z

It improves those, but I think that @liggitt's future pull to reduce the timeout is more directly related.

liggitt · 2015-12-18T21:10:36Z

pkg/image/controller/controller.go

+
+			// if we're already in the workingset, that means that some thread is already trying to do an import for this.
+			// This does NOT mean that we shouldn't attempt to do this work, only that we shouldn't attempt to do it now.
+			if c.isInWorkingSet(staleImageStream) {


isInWorkingSet and addToWorkingSet should be a single call addToWorkingSet() (added bool), otherwise two threads can both work on the same one at the same time

I'd also like the removeFromWorkingSet deferred immediately, which probably means splitting the body of this case into a function

isInWorkingSet and addToWorkingSet should be a single call addToWorkingSet() (added bool), otherwise two threads can both work on the same one at the same time

how embarrassing.

soltysh · 2015-12-18T21:39:30Z

I just recently (when fixing fifo-related issue in build controller) was thinking about moving our controllers to patterns from upstream, wdyt?

liggitt · 2015-12-19T01:38:28Z

pkg/image/controller/controller.go

+		return c.Next(liveImageStream)
+	})
+
+	if err != nil {


do we really want to log NotFound errors as big unexpected errors?

do we really want to log NotFound errors as big unexpected errors?

Will explicitly deal with the common case of the error case, but if the image stream disappears as we're working on it, you'll still get a big error. More specific handling requires more plumbing into almost-dead code.

liggitt · 2015-12-19T06:41:27Z

pkg/cmd/server/origin/run_components.go

-		Client: osclient,
-	}
-	controller := factory.Create()
+	controller := imagecontroller.NewImportController(client.ImageStreamsNamespacer(osclient), client.ImageStreamMappingsNamespace(osclient), 10, 2*time.Minute)


ImageStreamMappingsNamespacer

deads2k · 2015-12-21T15:30:15Z

@liggitt comments addressed.

deads2k · 2015-12-21T16:21:19Z

#6435

re[test]

deads2k · 2015-12-21T18:13:16Z

#6447

re[test]

deads2k · 2015-12-21T19:51:25Z

#6447

re[test]

stevekuznetsov · 2015-12-22T00:35:31Z

flake here:

FAILURE after 63.617s: hack/../test/cmd/builds.sh:108: executing 'oc process -f examples/sample-app/application-template-dockerbuild.json -l build=docker | oc create -f -' expecting success: the command returned the wrong error code
Standard output from the command:
imagestream "origin-ruby-sample" created
deploymentconfig "frontend" created
service "database" created
deploymentconfig "database" created
Standard error from the command:
Error from server: Timeout: request did not complete within allowed duration
Error from server: 501: All the given peers are not reachable (failed to propose on members [https://127.0.0.1:24001] twice [last error: Unexpected HTTP status code]) [0]
Error from server: imageStream "ruby-22-centos7" already exists
Error from server: buildconfig "ruby-sample-build" already exists
[FAIL] !!!!! Test Failed !!!!

#6447

re[test]

stevekuznetsov · 2015-12-22T01:32:38Z

flake here:

FAILURE after 7.946s: hack/../test/cmd/templates.sh:56: executing 'oc delete template/template-type-precision' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
Error from server: templates "template-type-precision" not found
[FAIL] !!!!! Test Failed !!!!

#6453

re[test]

stevekuznetsov · 2015-12-22T06:47:21Z

flake in TestAuthorizationSubjectAccessReview here:

I1221 21:18:36.111554   32431 etcd.go:37] Deleting &etcd.Node{Key:"/kubernetes.io", Value:"", Dir:true, Expiration:(*time.Time)(nil), TTL:0, Nodes:etcd.Nodes(nil), ModifiedIndex:0x30f, CreatedIndex:0x30f} (child of &etcd.Node{Key:"", Value:"", Dir:true, Expiration:(*time.Time)(nil), TTL:0, Nodes:etcd.Nodes{(*etcd.Node)(0xc208856060), (*etcd.Node)(0xc2088560c0)}, ModifiedIndex:0x0, CreatedIndex:0x0})
F1221 21:18:43.869295   32431 etcd.go:39] Unable to delete key: 100: Key not found (/kubernetes.io) [1066]

#6065
re[test]

stevekuznetsov · 2015-12-22T14:18:25Z

flake on test/cmd/images here:

FAILURE after 10.875s: hack/../test/cmd/images.sh:125: executing 'oc tag mysql:latest tagtest3:latest tagtest4:latest --alias' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
Error from server: imageStream "tagtest3" already exists
[FAIL] !!!!! Test Failed !!!!

#6461

liggitt · 2015-12-22T15:51:55Z

pkg/image/controller/factory.go

-				util.HandleError(err)
-				return retries.Count < 5
-			},
-			kutil.NewTokenBucketRateLimiter(1, 10),


should probably add back rate limiting to the individual workers

should probably add back rate limiting to the individual workers

The concern being that someone creates image streams over and over again, tricking us into pounding a docker registry looking for metadata?

I thought this limiter controlled the rate at which retries are done. We tight loop on conflicts, but don't retry on any other conditions any more.

Why do you not retry on other conditions? We have to retry when incremental problems occur - that's the contract of import.

all failed imports retry on the sync period, which means if you accumulate a lot of failing image streams, you get a thundering herd every two minutes

Then we should jitter during sync periods and control ingress into the workers with a rate limiter.

Note that I'm not against termination in a large number of deterministic stop conditions - 404, access denied, unrecognized non-connection errors. I'm just highlighting that certain errors beyond conflict should be retried because transient failures are inevitable. The new import endpoint will allow that distinction to be drawn for partial completion, but we still need to make an effort.

sync period may already jitter, I would rate limit workers. are you ok with the retry being sync-period driven instead of re-queue driven?

Note that I'm not against termination in a large number of
deterministic stop conditions - 404, access denied, unrecognized
non-connection errors. I'm just highlighting that certain errors
beyond conflict should be retried because transient failures are
inevitable. The new import endpoint will allow that distinction to be
drawn for partial completion, but we still need to make an effort.

You said you're refactoring Next. I'd expect that you're handling those. If not, would you like me to the do the refactor here? If we're doing a large import, it makes sense to have the dockerclient retry near the point of failure to avoid rework.

On the openshift resource mutation side, the update problem exists even without this pull, so you'll have to have patch conflicts, compatibility, and coverage evaluation to handle update conflicts at the point of failure anyway. Again, is that in your refactor or do you want for this one?

The only thing this is changing is when the retry happens. Now it happens on a sync-period instead of immediately at time of failure.

One sucky thing with that is that import almost always drives deployments / new app creations, which means the end user expects something to happen, which means we dramatically increase latency of "it started working" for a class of users who are waiting for it to work.

Since the other refactor has to happen for auth anyway I won't push it here.

deads2k · 2016-01-04T20:51:47Z

This pull leaves us in this state until clayton's larger refactor:

Missing resources are skipped.
Resource conflicts are retried immediately.
Retry-able errors are retried on the re-list interval.
Non-retryable errors are never retried.

@liggitt squashed down. Any other comments?

smarterclayton · 2016-01-04T20:54:52Z

My PR should resolve the issues you mentioned cleanly - the new design is
much easier to reason about as a client.

On Mon, Jan 4, 2016 at 3:52 PM, David Eads [email protected] wrote:

This pull leaves us in this state until clayton's larger refactor:

Missing resources are skipped.

Resource conflicts are retried immediately.

Retry-able errors are retried on the re-list interval.

Non-retryable errors are never retried.

@liggitt https://github.com/liggitt squashed down. Any other comments?

—
Reply to this email directly or view it on GitHub
#6407 (comment).

liggitt · 2016-01-05T02:17:51Z

LGTM, [test]

deads2k · 2016-01-05T13:33:18Z

re[test]

deads2k · 2016-01-05T15:58:33Z

flake in e2e, lets try [merge]

openshift-bot · 2016-01-05T18:05:26Z

Evaluated for origin test up to a33813f

openshift-bot · 2016-01-05T18:55:28Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8186/)

deads2k · 2016-01-06T13:15:48Z

two UI failures here: @spadgett https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4508/consoleText

1)  unauthenticated user should be able to log in
  - Error: Angular could not be found on the page https://localhost:9000/ : angular never provided resumeBootstrap
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/lib/protractor.js:478:17
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
    at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)

2)  authenticated e2e-user with test project should be able to list the test project
  - UnknownError: javascript error: document unloaded while waiting for result
  (Session info: chrome=47.0.2526.106)
  (Driver info: chromedriver=2.14.313457 (3d645c400edf2e2c500566c9aa096063e707c9cf),platform=Linux 3.10.0-229.7.2.el7.x86_64 x86_64) (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 117 milliseconds

deads2k · 2016-01-06T13:16:50Z

etcd flake for the other re[merge]

stevekuznetsov · 2016-01-06T14:23:37Z

@deads2k that flake is #6533

deads2k · 2016-01-07T01:55:58Z

UI https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4521/consoleText

1)  authenticated e2e-user new project when creating a new project should browse deployments
  - Error: Wait timed out after 3131ms
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15

deads2k · 2016-01-07T01:56:04Z

re[merge]

deads2k · 2016-01-07T14:06:06Z

failed to propose

re[merge]

deads2k · 2016-01-07T16:16:45Z

etcd flake again re[merge]

openshift-bot · 2016-01-07T17:10:22Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4543/) (Image: devenv-rhel7_3094)

deads2k · 2016-01-07T19:26:19Z

etcd flake re[merge]

openshift-bot · 2016-01-07T19:30:29Z

Evaluated for origin merge up to a33813f

Merged by openshift-bot

deads2k assigned liggitt Dec 18, 2015

deads2k reviewed Dec 18, 2015
View reviewed changes

liggitt reviewed Dec 18, 2015
View reviewed changes

ncdc mentioned this pull request Dec 18, 2015

UPSTREAM: <drop>: fixup for 14537 #6409

Merged

ncdc changed the title ~~allow parallel image streams~~ Allow parallel image stream importing Dec 18, 2015

liggitt reviewed Dec 18, 2015
View reviewed changes

deads2k force-pushed the parallel-image-import branch from b3eb2e7 to ab44513 Compare December 18, 2015 23:20

liggitt reviewed Dec 19, 2015
View reviewed changes

liggitt mentioned this pull request Dec 19, 2015

test flake in test/cmd/newapp.sh #6381

Closed

liggitt reviewed Dec 19, 2015
View reviewed changes

liggitt reviewed Dec 22, 2015
View reviewed changes

deads2k force-pushed the parallel-image-import branch from 517ad26 to e9232d3 Compare January 4, 2016 20:49

liggitt added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2016

allow parallel image streams

a33813f

deads2k force-pushed the parallel-image-import branch from e9232d3 to a33813f Compare January 5, 2016 18:01

spadgett mentioned this pull request Jan 6, 2016

UI Flake - Wait timed out #5881

Closed

openshift-bot pushed a commit that referenced this pull request Jan 7, 2016

Merge pull request #6407 from deads2k/parallel-image-import

315002e

Merged by openshift-bot

openshift-bot merged commit 315002e into openshift:master Jan 7, 2016

liggitt mentioned this pull request Jan 8, 2016

Revert "Allow parallel image stream importing" #6594

Merged

deads2k deleted the parallel-image-import branch February 26, 2016 18:42

Allow parallel image stream importing #6407

Allow parallel image stream importing #6407

Conversation

deads2k commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pweil- commented Dec 18, 2015

deads2k commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Dec 18, 2015

stevekuznetsov commented Dec 18, 2015

deads2k commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Dec 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Dec 21, 2015

deads2k commented Dec 21, 2015

deads2k commented Dec 21, 2015

deads2k commented Dec 21, 2015

stevekuznetsov commented Dec 22, 2015

stevekuznetsov commented Dec 22, 2015

stevekuznetsov commented Dec 22, 2015

stevekuznetsov commented Dec 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Dec 22, 2015 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Dec 22, 2015 via email

Choose a reason for hiding this comment

smarterclayton Dec 22, 2015 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Dec 22, 2015 via email

Choose a reason for hiding this comment

smarterclayton Dec 22, 2015 via email

Choose a reason for hiding this comment

deads2k commented Jan 4, 2016

smarterclayton commented Jan 4, 2016

liggitt commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

openshift-bot commented Jan 5, 2016

openshift-bot commented Jan 5, 2016

deads2k commented Jan 6, 2016

deads2k commented Jan 6, 2016

stevekuznetsov commented Jan 6, 2016

deads2k commented Jan 7, 2016

deads2k commented Jan 7, 2016

deads2k commented Jan 7, 2016

deads2k commented Jan 7, 2016

openshift-bot commented Jan 7, 2016

deads2k commented Jan 7, 2016

openshift-bot commented Jan 7, 2016