Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reboot module hangs with OpenVZ hosts #58868

Closed
mvorisek opened this issue Jul 9, 2019 · 19 comments · Fixed by #62680
Closed

reboot module hangs with OpenVZ hosts #58868

mvorisek opened this issue Jul 9, 2019 · 19 comments · Fixed by #62680
Assignees
Labels
affects_2.8 This issue/PR affects Ansible v2.8 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. module This issue/PR relates to a module. support:core This issue/PR relates to code supported by the Ansible Engineering Team. system System category

Comments

@mvorisek
Copy link

mvorisek commented Jul 9, 2019

SUMMARY

Currently there is some issue with the reboot module with OpenVZ hosts. The reboot command is executed correctly, all hosts reboots, but the Ansible reboot tasks hangs.

If exactly the same playbook is run on KVM or native host, it finish without any issues.

It should be hot fixed in 2.7+ branches.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

reboot module

ANSIBLE VERSION

2.7

CONFIGURATION

Debian 9.9 / Ansible 2.8

OS / ENVIRONMENT

Debian 9.9

STEPS TO REPRODUCE

Trigger this reboot handler on OpenVZ hosts

- name: Reboot system
  reboot:
    reboot_timeout: 1200
    post_reboot_delay: 5
    connect_timeout: 2
  listen: handler_reboot
EXPECTED RESULTS

Reboot is executed and once the host is back online, task is marked as changed and the execution of the playbook continues.

ACTUAL RESULTS
...
RUNNING HANDLER [shared_handlers : Reboot system] ********

(and no more output, the Ansible playbook hangs)

@ansibot
Copy link
Contributor

ansibot commented Jul 9, 2019

Files identified in the description:

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot
Copy link
Contributor

ansibot commented Jul 9, 2019

@ansibot ansibot added affects_2.8 This issue/PR affects Ansible v2.8 bug This issue/PR relates to a bug. module This issue/PR relates to a module. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. system System category labels Jul 9, 2019
@samdoran
Copy link
Contributor

samdoran commented Jul 9, 2019

I believe this is a duplicate of #57768. I am not familiar with OpenVZ but it seems to run containers, not VMs. My best guess is /proc/sys/kernel/random/boot_iddoes not change when the "VM" "reboots". Can you provide the output of ANSIBLE_DEBUG=1 ansible [testhost] -b -m reboot | grep 'reboot:'? This should show exactly what's going on.

needs_info

@ansibot ansibot added needs_info This issue requires further information. Please answer any outstanding questions. and removed needs_triage Needs a first human triage before being processed. labels Jul 9, 2019
@sivel sivel changed the title Urgent: reboot module hangs with OpenVZ hosts reboot module hangs with OpenVZ hosts Jul 9, 2019
@mvorisek
Copy link
Author

mvorisek commented Jul 10, 2019

@samdoran Yes, that is exactly the issue:

root@debian:~# ANSIBLE_DEBUG=1 ansible 11.11.11.11 -b -m reboot | grep 'reboot:'
 17212 1562731855.10831: reboot: running setup module to get distribution
 17212 1562731858.97001: reboot: distribution: {'version': u'9', 'name': u'debian', 'family': u'debian'}
 17212 1562731858.97007: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731859.02842: reboot: last boot time: 224ac2c6-bb38-4ccc-862e-948c370e8b6b
 17212 1562731859.02861: reboot: connect_timeout connection option has not been set
 17212 1562731859.02868: reboot: running find module looking in ['/sbin', '/usr/sbin', '/usr/local/sbin'] to get path for "shutdown"
 17212 1562731859.53833: reboot: rebooting server with command '/sbin/shutdown -r 0 "Reboot initiated by Ansible"'
 17212 1562731859.59278: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731859.62833: reboot: last boot time check fail 'ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 1.424 seconds...
 17212 1562731861.05391: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731861.07659: reboot: last boot time check fail 'Failed to connect to the host via ssh: ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 2.824 seconds...
 17212 1562731863.90356: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731863.92551: reboot: last boot time check fail 'Failed to connect to the host via ssh: ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 4.236 seconds...
 17212 1562731868.16594: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731868.18807: reboot: last boot time check fail 'Failed to connect to the host via ssh: ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 8.491 seconds...
 17212 1562731876.68416: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731876.70661: reboot: last boot time check fail 'Failed to connect to the host via ssh: ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 12.9 seconds...
 17212 1562731889.61159: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731889.63611: reboot: last boot time check fail 'Failed to connect to the host via ssh: ssh: connect to host 11.11.11.11 port 22: Connection refused', retrying in 12.14 seconds...
 17212 1562731901.77407: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731902.11154: reboot: last boot time: 224ac2c6-bb38-4ccc-862e-948c370e8b6b
 17212 1562731902.11161: reboot: last boot time check fail 'boot time has not changed', retrying in 12.49 seconds...
 17212 1562731914.61625: reboot: getting boot time with command: 'cat /proc/sys/kernel/random/boot_id'
 17212 1562731914.66989: reboot: last boot time: 224ac2c6-bb38-4ccc-862e-948c370e8b6b
 17212 1562731914.67002: reboot: last boot time check fail 'boot time has not changed', retrying in 12.28 seconds...
^C [ERROR]: User interrupted execution

what about changing the boot time command to:

uptime --since ; cat /proc/sys/kernel/random/boot_id

Maybe the uptime --since is available also on other systems and so it can be prepended to the non-default commands here:

'freebsd': '/sbin/sysctl kern.boottime',

Even before who -b as this command can be problematic as it does not report seconds.

@ansibot ansibot removed the needs_info This issue requires further information. Please answer any outstanding questions. label Jul 10, 2019
@samdoran
Copy link
Contributor

That presents a tricky problem to solve. It's not as simple as prepending the boot time command — we have to see what is available on all those platforms. Further complicating the issue is we have to change the command based on the virtualization platform. I am not sure how to reliably discover that information.

@mvorisek
Copy link
Author

mvorisek commented Jul 11, 2019

The uptime is presented on the most systems and it seems to be an elegant solution. As 2nd source it will preserve the existing behaviour/back compatibility and it will cover 99% of use cases as it works with virtualization.

Can you please hotfix it for 2.7+ branches?

For next major release it can be further analysed or a new option to specify the start time command manually can be introduced.

@samdoran
Copy link
Contributor

The problem with uptime and who is that available parameters vary widely across Linux, Unix, BSD, and macOS, and the versions of each. I found during development that uptime was unreliable as a source of last boot time because the default output is a counter not a unique value of the last boot time, making it an unreliable test as to whether the system has rebooted.

who was problematic on systems that do not have a real time clock, such as the Raspberry Pi.

I am not saying this cannot be solved, just that it is more involved than it seems.

@mvorisek
Copy link
Author

mvorisek commented Jul 11, 2019

Yes, the source for uptime is the /proc/uptime counter. uptime --since provides a start date but it can be unrealible as the last second of it can change due to the time difference between the /proc/uptime file and actual date is read. Or if time / timezone is changed.

What about reading the last reboot time directly from /var/log/wtmp file? This is what who or last commands do.

who was problematic on systems that do not have a real time clock, such as the Raspberry Pi.

/proc/sys/kernel/random/boot_id is a great source. But if we can find a source of stable (i.e. not a counter one) start time, it can be used as a 2nd source as outlined above. If will help to detect reboots on container hosted systems, but it will not affect any existing detection sources.

@mvorisek
Copy link
Author

@samdoran What fix would you propose and can we fix it locally before the fix will be released?

@samdoran
Copy link
Contributor

@mvorisek I don't have a specific fix in mind. I will have to setup OpenVZ in order to investigate further.

@mvorisek
Copy link
Author

When this issue is expected to be fixed?

@samdoran
Copy link
Contributor

I do not have an ETA on fixing this since restarting containers was outside the original scope of this module. I will look into it, but am currently focused on the 2.9 release.

@mvorisek
Copy link
Author

For OpenVZ, Docker, etc. containers all standard restaring commands work out of the box.

The only issue is the command to detect that the system has rebooted.

Would it be possible to solve this issue by adding an Ansible boot_time_command option which can be freely specified by the user? If a non-empty value is specified, it will override the default boot time detection command. If none or an empty value is specified, it will have no affect.

What do you think about this solution and can you add it to the 2.9 release?

@samdoran
Copy link
Contributor

I thought long and hard about this as well as evaluated some other solutions. In the end, your suggestion of adding a new parameter was the simplest and most reliable solution.

Please test with #62680 and let me know if that solves you problem.

@mvorisek
Copy link
Author

Seems good too me, can you make it for Ansible 2.9 release?

@samdoran
Copy link
Contributor

Since this is adding a new feature and 2.9 is already well past feature freeze, this will be in 2.10. I was trying to find a way to handle this internally and thus be able to add it to 2.9 as a bugfix, but I did not come up with a reliable way to fix it without tons of complexity.

@mvorisek
Copy link
Author

I understand, but this switch is a bugfix for me. Nothing new, only a possibility for the user to specify own reboot detection command for the edge cases not covered by the module itself (like this bug report with OpenVZ).

@ansibot ansibot added the has_pr This issue has an associated PR. label Sep 23, 2019
@dgabrysch
Copy link

dgabrysch commented Jan 4, 2020

I am running a normal VM which I connect to over a normal Internet connection, reboot always gets stuck. It is an Ubuntu box: 18.04.3 LTS (GNU/Linux 4.15.0-72-generic x86_64)

It hangs forever after the reboot :/

This is what I have inside my playbook:

  • name: Reboot after upgrade
    reboot:

become is set to "true", this is what I get when running verbose:

machine.host.name> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/Users/xxx/.ansible/cp/4406263006 -tt machine.host.name'/bin/sh -c '"'"'sudo -H -S -p "[sudo via ansible, key=somekey] password:" -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-blablabla ; cat /proc/sys/kernel/random/boot_id'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<machine.host.name> (255, b'', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 48: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: Control socket "/Users/xxx/.ansible/cp/4406263006" does not exist\r\ndebug2: resolving "machine.host.name" port 22\r\ndebug2: ssh_connect_direct\r\ndebug1: Connecting to machine.host.name [xx.xx.xx.xx] port 22.\r\ndebug2: fd 5 setting O_NONBLOCK\r\ndebug1: connect to address xx.xx.xx.xx communication timed out\r\nssh: connect to host machine.host.name port 22: Operation timed out\r\n')
reboot: attempting to get system boot time
machine.host.name ESTABLISH SSH CONNECTION FOR USER: None
machine.host.name SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/Users/xxx/.ansible/cp/4406263006 -tt machine.host.name '/bin/sh -c '"'"'sudo -H -S -p "[sudo via ansible, key=xxx] password:" -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-xxxx ; cat /proc/sys/kernel/random/boot_id'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded

It stays there forever, I experimented with the timeouts but this did not help in any way :(

@dgabrysch
Copy link

Ok, nevermind, I have been using Visual Studio Code when testing, seems like the integrated terminal there does not work properly with this module in combination which macOS... :D

@ansible ansible locked and limited conversation to collaborators Feb 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.8 This issue/PR affects Ansible v2.8 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. module This issue/PR relates to a module. support:core This issue/PR relates to code supported by the Ansible Engineering Team. system System category
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants