-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Ensure that start() in StartAndAttach() is locked #3127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that start() in StartAndAttach() is locked #3127
Conversation
StartAndAttach() runs start() in a goroutine, which can allow it to fire after the caller returns - and thus, after the defer to unlock the container lock has fired. The start() call _must_ occur while the container is locked, or else state inconsistencies may occur. Fixes containers#3114 Signed-off-by: Matthew Heon <matthew.heon@pm.me>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
thanks, tested locally and fixes the issue we've seen. LGTM |
|
running this through #3115 |
|
initial poking looks good. Next I'll find my bigger stick... |
|
woops |
|
LGTM (do not fully understand actual code changes though) |
|
While I'm confident this did fix the issue in question, it didn't fix our CI timeouts |
|
LGTM, assuming happy tests |
Makefile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not look like this should be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrothberg this option is needed with -nodes 3 to make ginkgo log output from each node. Otherwise, if any node hangs for any reason, it's incredibly difficult to debug which test caused the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that sounds important. The commit message mentions not to merge it, so we might need to change the message to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially we just wanted it for debug, so I added the quick "HACK" message. Changed now.
libpod/container_attach_linux.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we use a Mutex instead of a WaitGroup? The WaitGroup implies that we are waiting for multiple tasks to finish which does not seem to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On looking into this more: I'd prefer to stick with a WG - it's a lot more clear what's going on than with a mutex (Why am I unlocking this mutex inside of attachContainerSocket()? What locked it in the first place?). The control flow here is already complicated enough (took four hours to zero in on this bug) and I'd prefer not to make it any more so.
|
@mheon This will cause the ginkgo diff --git a/.cirrus.yml b/.cirrus.yml
index 51488996..95c33219 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -292,12 +292,14 @@ testing_task:
setup_environment_script: '$SCRIPT_BASE/setup_environment.sh |& ${TIMESTAMP}'
unit_test_script: '$SCRIPT_BASE/unit_test.sh |& ${TIMESTAMP}'
integration_test_script: '$SCRIPT_BASE/integration_test.sh |& ${TIMESTAMP}'
+ ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log'
journalctl_b_script: 'journalctl -b'
on_failure:
failed_master_script: '$CIRRUS_WORKING_DIR/$SCRIPT_BASE/notice_master_failure.sh'
# Job has already failed, don't fail again and miss collecting data
+ failed_ginkgo_node_logs_script: 'cat $SCRIPT_BASE/test/e2e/ginkgo-node-*.log || echo "Ginkgo node logs not found"'
failed_audit_log_script: 'cat /var/log/audit/audit.log || cat /var/log/kern.log || echo "Uh oh, cat audit.log failed"'
failed_journalctl_b_script: 'journalctl -b || echo "Uh oh, journalctl -b failed"'
|
|
Updated to include @cevich changes - Ginkgo is now in debug mode, with cirrus collecting debug logs, to aid in chasing flakes. I'll hit the mutex comment from @vrothberg later today |
.cirrus.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that's a bad path: should be $CIRRUS_WORKING_DIR/test/e2e/ginkgo-node-*.log
.cirrus.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too. Sorry 😞
Makefile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we might want to consider adding -debug here as well.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Need this to re-trigger CI Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
|
/lgtm |
StartAndAttach() runs start() in a goroutine, which can allow it to fire after the caller returns - and thus, after the defer to unlock the container lock has fired.
The start() call must occur while the container is locked, or else state inconsistencies may occur.
Fixes #3114