Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature:Builds][Conformance] s2i build with a root user image should create a root build and pass with a privileged SCC #17883

Closed
sosiouxme opened this issue Dec 19, 2017 · 13 comments
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/pod

Comments

@sosiouxme
Copy link
Member

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_gce/13246/

/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:75
Expected error:
    <*errors.errorString | 0xc42115c500>: {
        s: "The build \"nodejspass-1\" status is \"Failed\"",
    }
    The build "nodejspass-1" status is "Failed"
not to have occurred
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:91
@sosiouxme sosiouxme added the kind/test-flake Categorizes issue or PR as related to test flakes. label Dec 19, 2017
@bparees
Copy link
Contributor

bparees commented Dec 22, 2017

failure to get pod logs

Dec 19 12:36:26.607: INFO: Running 'oc logs --config=/tmp/extended-test-s2i-build-root-xqdq2-dnlfm-user.kubeconfig --namespace=extended-test-s2i-build-root-xqdq2-dnlfm pod/nodejspass-1-build'
Dec 19 12:36:26.980: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc logs --config=/tmp/extended-test-s2i-build-root-xqdq2-dnlfm-user.kubeconfig --namespace=extended-test-s2i-build-root-xqdq2-dnlfm pod/nodejspass-1-build] []   Error from server (NotFound): the server could not find the requested resource ( pods/log nodejspass-1-build)
 Error from server (NotFound): the server could not find the requested resource ( pods/log nodejspass-1-build)
 [] <nil> 0xc421c9ee40 exit status 1 <nil> <nil> true [0xc4218881f8 0xc421888270 0xc421888270] [0xc4218881f8 0xc421888270] [0xc421888208 0xc421888260] [0x9895f0 0x9896f0] 0xc421aba540 <nil>}:
Error from server (NotFound): the server could not find the requested resource ( pods/log nodejspass-1-build)
Dec 19 12:36:26.980: INFO: Error retrieving logs for pod "nodejspass-1-build": exit status 1

but describing the pod worked:

Dec 19 12:36:26.298: INFO: Running 'oc describe --config=/tmp/extended-test-s2i-build-root-xqdq2-dnlfm-user.kubeconfig --namespace=extended-test-s2i-build-root-xqdq2-dnlfm pod/nodejspass-1-build'
Dec 19 12:36:26.607: INFO: Describing pod "nodejspass-1-build"
Name:           nodejspass-1-build
Namespace:      extended-test-s2i-build-root-xqdq2-dnlfm

@sjenning
Copy link
Contributor

@frobware PTAL

@frobware
Copy link
Contributor

frobware commented Feb 1, 2018

I have been trying to reproduce this error for quite a few days but without success.

I did run into the following yesterday:

Expected error:
    <*errors.errorString | 0xc420299f70>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

but that took some effort:

$ ./ginkgo -p -nodes=4 -focus=".*s2i build with a root user image" -untilItFails=true extended.test
Tests failed on attempt #214
Ginkgo ran 1 suite in 9h38m12.151562892s

@frobware
Copy link
Contributor

frobware commented Feb 5, 2018

First time I have seen this error now. Running via:

./ginkgo -p -nodes=4 -focus=".*Builds" -untilItFails=true extended.test
• Failure in Spec Setup (JustBeforeEach) [693.919 seconds]
[Feature:Builds][Conformance] s2i build with a root user image
/home/aim/go-projects/origin/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:16
  
  /home/aim/go-projects/origin/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:23
    should create a root build and fail without a privileged SCC [Suite:openshift/conformance/parallel] [JustBeforeEach]
    /home/aim/go-projects/origin/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:44

    Expected error:
        <*errors.errorString | 0xc4201490f0>: {
            s: "The build \"nodejsroot-1\" status is \"Error\"",
        }
        The build "nodejsroot-1" status is "Error"
    not to have occurred

    /home/aim/go-projects/origin/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:34

Feb  2 08:12:20.589: INFO: Running AfterSuite actions on all node
Feb  2 08:12:20.590: INFO: Running AfterSuite actions on node 1

I have been trying to reproduce this bug for a while now but maybe my
invocations have been too short. This (and the only time I have seen
this) failed after 25 hours.

[Fail] [Feature:Builds][Conformance] s2i build with a root user image  [JustBeforeEach] should create a root build and fail without a privileged SCC [Suite:openshift/conformance/parallel] 
/home/aim/go-projects/origin/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/s2i_root.go:34

Ran 2 of 440 Specs in 694.479 seconds
FAIL! -- 0 Passed | 2 Failed | 0 Pending | 438 Skipped 
Tests failed on attempt #545

Ginkgo ran 1 suite in 25h20m35.279131559s

@sjenning
Copy link
Contributor

sjenning commented Feb 5, 2018

@frobware this looks yet another "quick container GC" where the container and its logs are GCed before the test can read them

@frobware
Copy link
Contributor

frobware commented Feb 6, 2018

I switched my testing to use overlay2 (similar to CI) and was able to reproduce a little quicker. The quickest has been 36 minutes in 4 iterations, the other occurrence happened after 50 minutes.

@frobware
Copy link
Contributor

I'm still investigating this but it is very hard to reproduce. This week, and on a whim, I switched from docker 1.12.6 to docker 1.13 and got 1 error in 25 hours. Last night/today I have had one occurrence in 13 hours. Looking through the logs I believe this "context cancelled" error happens each time:

Feb 15 08:26:13 rhel-74-vm-1 etcd[1287]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: str
Feb 15 08:26:13 rhel-74-vm-1 etcd[1287]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")

which appears to be similar to grpc/grpc-go#1134

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 5, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/pod
Projects
None yet
Development

No branches or pull requests

8 participants