Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Feb 19, 2019

Signed-off-by: Antonio Murdaca [email protected]

- What I did

The indexer in the informer uses a store which manages object using a read-write lock, the flakes come from the fact that if we don't sync up the store before adding objects, the listing of objects can race against an add (verified this by adding a bunch of debug logs during investigation and I could spot that lister.List can return empty even if we added objects to the indexer before starting the informers themselves).

Fixes #417
Fixes #444
Fixes #451
Fixes #449

Cannot reproduce anymore with this patch (running in a loop since 3 hours now), I'm able to consistently reproduce by running go test -race w/o this patch.

- How to verify it

true && while [ $? -eq 0 ]; do GOCACHE=off go test -race -v ./cmd/... ./pkg/... ./lib/...; done

w/o this patch the command above fails as soon as the flake makes the test error out. With the patch, the above continues perpetually.

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 19, 2019
@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

CI glitch...

could not wait for build: the build machine-config-server failed after 6m6s with reason PullBuilderImageFailed: Failed pulling builder image.

Pulling image docker-registry.default.svc:5000/ci-op-v6y05...8fe8db228e015b2bd5ec62da99e0505f8fd6dbc7ce842e2153680dc ...

Pulling image registry.svc.ci.openshift.org/openshift/release:golang-1.10 ...
error: build error: failed to pull image: Get https://regi...t canceled (Client.Timeout exceeded while awaiting headers)

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

still glitch #457 (comment)

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

CI errors apart (not unit), this is fixing the flakes as my tests never failed in more than 5 hours

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

the glitch is a CI issue

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

aws limit

/retest

@cgwalters
Copy link
Member

LGTM, giving for another final review/merge:

/assign kikisdeliveryservice

@rphillips
Copy link
Contributor

lgtm (deferring to @kikisdeliveryservice for review). Nice find

@LorbusChris
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LorbusChris, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LorbusChris
Copy link
Contributor

/test e2e-aws

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Contributor

HAProxy and other known flakes.

/test e2e-aws

@abhinavdahiya
Copy link
Contributor

known flake.

fail [github.com/openshift/origin/test/extended/bootstrap_user/bootstrap_user_login.go:52]: Expected error:
    <*util.ExitError | 0xc422561740>: {
        Cmd: "oc login --config=/tmp/configfile916071508 --namespace=e2e-test-bootstrap-login-cpctm -u kubeadmin -p 5LpcuRoMeuNntNkAusTRa269NuAbQoU9ptn4aloHbp4",
        StdErr: "Error from server (InternalError): Internal error occurred: unexpected response: 400",
        ExitError: {
            ProcessState: {
                pid: 11753,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 268025},
                    Stime: {Sec: 0, Usec: 66749},
                    Maxrss: 94440,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 15028,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 1198,
                    Nivcsw: 27,
                },
            },
            Stderr: nil,
        },
    }
    exit status 1
not to have occurred

openshift/origin#22088 is moving it to flake

@kikisdeliveryservice
Copy link
Contributor

Thanks for the update on that test, @abhinavdahiya !

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

1 similar comment
@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

CI errors tracked in slack

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 53923bc into openshift:master Feb 20, 2019
@runcom runcom deleted the flakes-fixes branch February 20, 2019 16:14
runcom added a commit to runcom/machine-config-operator that referenced this pull request Mar 17, 2019
This patch does various things, all related:

- start the informers for controllers before running them to follow what
https://github.com/kubernetes/sample-controller/blob/master/main.go#L64-L71
does and to avoid races already spotted in unit tests
(openshift#457)
- move the clients builder code from cmd/common to a new package just for that
under the new internal/ folder so nobody but us can use that
- move common controllers code under pkg/controller/common to be reused
- create an e2e only clientset, avoiding us to type client version
everytime (eg client.CoreV1()..). This is a need cause if in the future
the api change, we don't play grep for a week replacing old apis...

Signed-off-by: Antonio Murdaca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestKubeletConfigUpdates flake TestContainerRuntimeConfigCreate flake Flake TestShouldMakeProgress TestKubeletConfigCreate flake?

9 participants