Skip to content

Conversation

@mheon
Copy link
Member

@mheon mheon commented May 14, 2019

StartAndAttach() runs start() in a goroutine, which can allow it to fire after the caller returns - and thus, after the defer to unlock the container lock has fired.

The start() call must occur while the container is locked, or else state inconsistencies may occur.

Fixes #3114

StartAndAttach() runs start() in a goroutine, which can allow it
to fire after the caller returns - and thus, after the defer to
unlock the container lock has fired.

The start() call _must_ occur while the container is locked, or
else state inconsistencies may occur.

Fixes containers#3114

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M labels May 14, 2019
@giuseppe
Copy link
Member

thanks, tested locally and fixes the issue we've seen.

LGTM

@cevich
Copy link
Member

cevich commented May 14, 2019

running this through #3115

@cevich
Copy link
Member

cevich commented May 14, 2019

initial poking looks good. Next I'll find my bigger stick...

@cevich cevich closed this May 14, 2019
@cevich
Copy link
Member

cevich commented May 14, 2019

woops

@cevich
Copy link
Member

cevich commented May 14, 2019

LGTM (do not fully understand actual code changes though)

@mheon
Copy link
Member Author

mheon commented May 14, 2019

While I'm confident this did fix the issue in question, it didn't fix our CI timeouts

@TomSweeneyRedHat
Copy link
Member

LGTM, assuming happy tests

Makefile Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not look like this should be merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrothberg this option is needed with -nodes 3 to make ginkgo log output from each node. Otherwise, if any node hangs for any reason, it's incredibly difficult to debug which test caused the problem.

Copy link
Member

@vrothberg vrothberg May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that sounds important. The commit message mentions not to merge it, so we might need to change the message to avoid confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially we just wanted it for debug, so I added the quick "HACK" message. Changed now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we use a Mutex instead of a WaitGroup? The WaitGroup implies that we are waiting for multiple tasks to finish which does not seem to be the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On looking into this more: I'd prefer to stick with a WG - it's a lot more clear what's going on than with a mutex (Why am I unlocking this mutex inside of attachContainerSocket()? What locked it in the first place?). The control flow here is already complicated enough (took four hours to zero in on this bug) and I'd prefer not to make it any more so.

@cevich
Copy link
Member

cevich commented May 15, 2019

@mheon This will cause the ginkgo -debug logs to be collected inline with the CI task:

diff --git a/.cirrus.yml b/.cirrus.yml
index 51488996..95c33219 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -292,12 +292,14 @@ testing_task:
     setup_environment_script: '$SCRIPT_BASE/setup_environment.sh |& ${TIMESTAMP}'
     unit_test_script: '$SCRIPT_BASE/unit_test.sh |& ${TIMESTAMP}'
     integration_test_script: '$SCRIPT_BASE/integration_test.sh |& ${TIMESTAMP}'
+    ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
     audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log'
     journalctl_b_script: 'journalctl -b'
 
     on_failure:
         failed_master_script: '$CIRRUS_WORKING_DIR/$SCRIPT_BASE/notice_master_failure.sh'
         # Job has already failed, don't fail again and miss collecting data
+        failed_ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
         failed_audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log || echo "Uh oh, cat audit.log failed"'
         failed_journalctl_b_script: 'journalctl -b || echo "Uh oh, journalctl -b failed"'
 

@mheon mheon force-pushed the fix_start_race branch from 72105b6 to 4087847 Compare May 15, 2019 14:26
@mheon
Copy link
Member Author

mheon commented May 15, 2019

Updated to include @cevich changes - Ginkgo is now in debug mode, with cirrus collecting debug logs, to aid in chasing flakes.

I'll hit the mutex comment from @vrothberg later today

.cirrus.yml Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's a bad path: should be $CIRRUS_WORKING_DIR/test/e2e/ginkgo-node-*.log

.cirrus.yml Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too. Sorry 😞

Makefile Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we might want to consider adding -debug here as well.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon mheon force-pushed the fix_start_race branch from 4087847 to d1f8223 Compare May 15, 2019 16:07
Need this to re-trigger CI

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon mheon mentioned this pull request May 15, 2019
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon mheon force-pushed the fix_start_race branch from 2a8c368 to 5b3f3c4 Compare May 15, 2019 20:47
@mheon
Copy link
Member Author

mheon commented May 15, 2019

@baude The cp tests aren't failing - I'm seeing a lot more failures from the rootless tests everywhere.

I'd say we merge this and #3091 and that should put CI right.

@baude
Copy link
Member

baude commented May 15, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2019
@openshift-merge-robot openshift-merge-robot merged commit 95d90c1 into containers:master May 15, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ginkgo timed out waiting for all parallel nodes to report back!

8 participants