Ensure that start() in StartAndAttach() is locked #3127

mheon · 2019-05-14T18:56:44Z

StartAndAttach() runs start() in a goroutine, which can allow it to fire after the caller returns - and thus, after the defer to unlock the container lock has fired.

The start() call must occur while the container is locked, or else state inconsistencies may occur.

Fixes #3114

StartAndAttach() runs start() in a goroutine, which can allow it to fire after the caller returns - and thus, after the defer to unlock the container lock has fired. The start() call _must_ occur while the container is locked, or else state inconsistencies may occur. Fixes containers#3114 Signed-off-by: Matthew Heon <matthew.heon@pm.me>

openshift-ci-robot · 2019-05-14T18:56:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

giuseppe · 2019-05-14T19:14:13Z

thanks, tested locally and fixes the issue we've seen.

LGTM

cevich · 2019-05-14T19:17:16Z

running this through #3115

cevich · 2019-05-14T19:25:54Z

initial poking looks good. Next I'll find my bigger stick...

cevich · 2019-05-14T19:32:48Z

woops

cevich · 2019-05-14T19:37:43Z

LGTM (do not fully understand actual code changes though)

mheon · 2019-05-14T19:40:10Z

While I'm confident this did fix the issue in question, it didn't fix our CI timeouts

TomSweeneyRedHat · 2019-05-14T20:22:50Z

LGTM, assuming happy tests

vrothberg · 2019-05-15T06:55:53Z

Makefile

Does not look like this should be merged.

@vrothberg this option is needed with -nodes 3 to make ginkgo log output from each node. Otherwise, if any node hangs for any reason, it's incredibly difficult to debug which test caused the problem.

Alright, that sounds important. The commit message mentions not to merge it, so we might need to change the message to avoid confusion.

Initially we just wanted it for debug, so I added the quick "HACK" message. Changed now.

vrothberg · 2019-05-15T06:59:33Z

libpod/container_attach_linux.go

Nit: Can we use a Mutex instead of a WaitGroup? The WaitGroup implies that we are waiting for multiple tasks to finish which does not seem to be the case.

On looking into this more: I'd prefer to stick with a WG - it's a lot more clear what's going on than with a mutex (Why am I unlocking this mutex inside of attachContainerSocket()? What locked it in the first place?). The control flow here is already complicated enough (took four hours to zero in on this bug) and I'd prefer not to make it any more so.

cevich · 2019-05-15T14:21:53Z

@mheon This will cause the ginkgo -debug logs to be collected inline with the CI task:

diff --git a/.cirrus.yml b/.cirrus.yml
index 51488996..95c33219 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -292,12 +292,14 @@ testing_task:
     setup_environment_script: '$SCRIPT_BASE/setup_environment.sh |& ${TIMESTAMP}'
     unit_test_script: '$SCRIPT_BASE/unit_test.sh |& ${TIMESTAMP}'
     integration_test_script: '$SCRIPT_BASE/integration_test.sh |& ${TIMESTAMP}'
+    ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
     audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log'
     journalctl_b_script: 'journalctl -b'
 
     on_failure:
         failed_master_script: '$CIRRUS_WORKING_DIR/$SCRIPT_BASE/notice_master_failure.sh'
         # Job has already failed, don't fail again and miss collecting data
+        failed_ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
         failed_audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log || echo "Uh oh, cat audit.log failed"'
         failed_journalctl_b_script: 'journalctl -b || echo "Uh oh, journalctl -b failed"'

mheon · 2019-05-15T14:27:10Z

Updated to include @cevich changes - Ginkgo is now in debug mode, with cirrus collecting debug logs, to aid in chasing flakes.

I'll hit the mutex comment from @vrothberg later today

cevich · 2019-05-15T15:19:43Z

.cirrus.yml

Oops, that's a bad path: should be $CIRRUS_WORKING_DIR/test/e2e/ginkgo-node-*.log

cevich · 2019-05-15T15:20:01Z

.cirrus.yml

Here too. Sorry 😞

cevich · 2019-05-15T15:30:12Z

Makefile

In fact, we might want to consider adding -debug here as well.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>

Need this to re-trigger CI Signed-off-by: Matthew Heon <matthew.heon@pm.me>

Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon · 2019-05-15T21:57:52Z

@baude The cp tests aren't failing - I'm seeing a lot more failures from the rootless tests everywhere.

I'd say we merge this and #3091 and that should put CI right.

baude · 2019-05-15T23:07:47Z

/lgtm

openshift-ci-robot requested review from mrunalp and vrothberg May 14, 2019 18:56

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M labels May 14, 2019

mheon mentioned this pull request May 14, 2019

oci: fix race when the container exits before start returns #3125

Closed

cevich closed this May 14, 2019

cevich reopened this May 14, 2019

cevich mentioned this pull request May 14, 2019

Ginkgo timed out waiting for all parallel nodes to report back! #3114

Closed

mheon added the Release Notes 1.3.1 label May 14, 2019

vrothberg reviewed May 15, 2019

View reviewed changes

mheon force-pushed the fix_start_race branch from 72105b6 to 4087847 Compare May 15, 2019 14:26

cevich reviewed May 15, 2019

View reviewed changes

.cirrus.yml Outdated

Copy link

Member

cevich May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's a bad path: should be $CIRRUS_WORKING_DIR/test/e2e/ginkgo-node-*.log

cevich reviewed May 15, 2019

View reviewed changes

.cirrus.yml Outdated

Copy link

Member

cevich May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too. Sorry 😞

cevich reviewed May 15, 2019

View reviewed changes

Makefile Outdated

Copy link

Member

cevich May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we might want to consider adding -debug here as well.

cevich mentioned this pull request May 15, 2019

WIP - DO NOT MERGE - TESTING PR #3130

Closed

Add debug mode to Ginkgo, collect debug logs in Cirrus

d1f8223

Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon force-pushed the fix_start_race branch from 4087847 to d1f8223 Compare May 15, 2019 16:07

Minor capitalization fix in Readme

29e4271

Need this to re-trigger CI Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon mentioned this pull request May 15, 2019

Fix CI #3131

Closed

Kill os.Exit() in tests, replace with asserts

5b3f3c4

Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon force-pushed the fix_start_race branch from 2a8c368 to 5b3f3c4 Compare May 15, 2019 20:47

openshift-ci-robot assigned baude May 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2019

openshift-merge-robot merged commit 95d90c1 into containers:master May 15, 2019

rh-atomic-bot mentioned this pull request May 15, 2019

Cirrus: Automate releasing of tested binaries #3106

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023

Ensure that start() in StartAndAttach() is locked #3127

Ensure that start() in StartAndAttach() is locked #3127

Uh oh!

Conversation

mheon commented May 14, 2019

Uh oh!

openshift-ci-robot commented May 14, 2019

Uh oh!

giuseppe commented May 14, 2019

Uh oh!

cevich commented May 14, 2019

Uh oh!

cevich commented May 14, 2019

Uh oh!

cevich commented May 14, 2019

Uh oh!

cevich commented May 14, 2019

Uh oh!

mheon commented May 14, 2019

Uh oh!

TomSweeneyRedHat commented May 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrothberg May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cevich commented May 15, 2019

Uh oh!

mheon commented May 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mheon commented May 15, 2019

Uh oh!

baude commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vrothberg May 15, 2019 •

edited

Loading