Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent winrm connection failures with large hosts count #597

Closed
Yannik opened this issue Apr 2, 2024 · 8 comments
Closed

intermittent winrm connection failures with large hosts count #597

Yannik opened this issue Apr 2, 2024 · 8 comments

Comments

@Yannik
Copy link

Yannik commented Apr 2, 2024

SUMMARY

Since our number of ansible managed windows hosts is growing over time (currently at 80 windows 2022 servers), I am more and more often running into intermittent winrm connection issues. These happen roughly once every two thousand winrm task executions, but for this number of hosts, it is starting to be problematic.

These issues are intermittent and not reproducible, they do not occur in the beginning of a play, but after previous winrm tasks for the same host were executed successfully and running the play again always fixes the issue. (Even though it has happened that another host fails on the next run...

Here are some sample errors:

[WARNING]: ERROR DURING WINRM SEND INPUT - attempting to recover:
ConnectionError ('Connection aborted.', ConnectionResetError(104, 'Connection
reset by peer'))
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: requests.exceptions.ConnectionError: HTTPConnectionPool(host='debfe01sc003.XXX', port=5985): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffb718f1880>: Failed to establish a new connection: [Errno 111] Connection refused'))
fatal: [debfe01sc003]: FAILED! => 
  msg: 'Unexpected failure during module execution: HTTPConnectionPool(host=''debfe01sc003.XXX'', port=5985): Max retries exceeded with url: /wsman (Caused by NewConnectionError(''<urllib3.connection.HTTPConnection object at 0x7ffb718f1880>: Failed to establish a new connection: [Errno 111] Connection refused''))'
  stdout: ''
fatal: [debfe01sc003]: UNREACHABLE! => changed=false 
  msg: 'kerberos: HTTPConnectionPool(host=''debfe01sc003.XXX'', port=5985): Max retries exceeded with url: /wsman (Caused by NewConnectionError(''<urllib3.connection.HTTPConnection object at 0x7ff6d690e690>: Failed to establish a new connection: [Errno 111] Connection refused''))'
  unreachable: true
fatal: [degoe10fs014]: UNREACHABLE! => changed=false 
  msg: 'kerberos: HTTPConnectionPool(host=''degoe10fs014.XXX'', port=5985): Max retries exceeded with url: /wsman (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f16162334d0>: Failed to resolve ''degoe10fs014.XXX'' ([Errno -3] Try again)"))'
  unreachable: true

I found this issue discussing the same problem, but it appears it was closed with no real solution, except for modifying winrm python files by hand or switching to ssh, both of which is not really a viable option for us. Would love your opinion on this @jborean93 :)

ISSUE TYPE
  • Bug Report
COMPONENT NAME

winrm

ANSIBLE VERSION
ansible [core 2.16.3]
  config file = /builds/ansible/deployments/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /app/lib/python3.12/site-packages/ansible
  ansible collection location = /builds/ansible/deployments/vendor_collections
  executable location = /app/bin//ansible
  python version = 3.12.2 (main, Feb  7 2024, 22:13:24) [GCC 13.2.1 20231014] (/usr/local/bin/python)
  jinja version = 3.1.3
  libyaml = True
COLLECTION VERSION

Collection      Version
--------------- -------
ansible.windows 1.14.0 
CONFIGURATION
CACHE_PLUGIN_CONNECTION(/home/yannik/projects/xxx/ansible/ansible.cfg) = .ansible_facts
CACHE_PLUGIN_TIMEOUT(/home/yannik/projects/xxx/ansible/ansible.cfg) = 60
CALLBACKS_ENABLED(/home/yannik/projects/xxx/ansible/ansible.cfg) = ['ansible.posix.profile_tasks']
COLLECTIONS_PATHS(/home/yannik/projects/xxx/ansible/ansible.cfg) = ['/home/yannik/projects/xxx/ansible/vendor_collections']
CONFIG_FILE() = /home/yannik/projects/xxx/ansible/ansible.cfg
DEFAULT_FORKS(/home/yannik/projects/xxx/ansible/ansible.cfg) = 25
DEFAULT_HOST_LIST(/home/yannik/projects/xxx/ansible/ansible.cfg) = ['/home/yannik/projects/xxx/ansible/inventory']
DEFAULT_MANAGED_STR(/home/yannik/projects/xxx/ansible/ansible.cfg) = This file is managed by ansible and will be overwritten! Do not change it manually!
DEFAULT_ROLES_PATH(/home/yannik/projects/xxx/ansible/ansible.cfg) = ['/home/yannik/projects/xxx/ansible/vendor_roles']
DEFAULT_STDOUT_CALLBACK(/home/yannik/projects/xxx/ansible/ansible.cfg) = yaml
DEFAULT_TIMEOUT(/home/yannik/projects/xxx/ansible/ansible.cfg) = 120
DIFF_ALWAYS(/home/yannik/projects/xxx/ansible/ansible.cfg) = True
EDITOR(env: EDITOR) = vim
HOST_KEY_CHECKING(/home/yannik/projects/xxx/ansible/ansible.cfg) = False
INTERPRETER_PYTHON(/home/yannik/projects/xxx/ansible/ansible.cfg) = auto_silent
PAGER(env: PAGER) = less
RETRY_FILES_ENABLED(/home/yannik/projects/xxx/ansible/ansible.cfg) = False
OS / ENVIRONMENT

Target OS: windows server 2022

pywinrm-0.4.3
pykerberos-1.2.4

@jborean93
Copy link
Collaborator

There's unfortunately no real resolution for these problems as they are more symptoms of another problem. This could be things like an unreliable network or the host being under high usage breaking the WinRM service in some way. Fixing this problem is not a simple thing as I cannot really give you a one shot solution. I'll try and explain the errors you are getting a bit more though

Failed to resolve ''degoe10fs014.XXX'' ([Errno -3] Try again

This is when the client tries to resolve the hostname specified into an IP address. It's purely a DNS specific task and is before any WinRM operations occur. Why this might occur I'm not sure but it's a mandatory first step to figure out how to communicate with the target.

Failed to establish a new connection: [Errno 111] Connection refused

This one is a bit trickier as it could be the result of 2 things; a firewall is explicitly blocking the connection, or nothing is listening at that endpoint port. As it is an intermittent problem I'm not sure what the cause could be behind it but this is the error that the TCP stack returns when it fails to open the socket connection as the server rejected it.

You can try the psrp connection plugin which also operates over WinRM but with a newer protocol on top. This has a few configurable knobs for connection retries but it's not guaranteed to solve the problem. Ultimately these problems sit in a few layers below where Ansible sits so there is little we can really do to solve these problems.

@Yannik
Copy link
Author

Yannik commented Apr 15, 2024

Two questions:

  1. Is there any good comparison between winrm and psrp connection plugins somewhere?
    For instance, I cannot find a knob similar to ansible_winrm_kerberos_delegation for psrp.
    Which one would you personally recommend?

  2. Would you be open to merge an updated version of Overcome "Connection Refused" on some operations diyan/pywinrm#174 ?
    I saw you merged the functionally-identical commit from the same author on psrp (Implement connection retries jborean93/pypsrp#10).

@jborean93
Copy link
Collaborator

Is there any good comparison between winrm and psrp connection plugins somewhere? For instance, I cannot find a knob similar to ansible_winrm_kerberos_delegation for psrp. Which one would you personally recommend?

I'm not aware of any migration guide for the both but in this case the option is now ansible_psrp_negotiate_delegate. The psrp doc tries to document all the features rather than winrm which has a few hidden/passthru options. It is a good idea to have a migration guide though to help people wanting to try out psrp. I personally would recommend psrp as it's the same fundamental transport but offers a few benefits over winrm.

Would you be open to merge an updated version of diyan/pywinrm#174 ?

In all honesty probably not. I've not really seen much of a benefit over the connection retry mechanism as usually when a retry is needed the underlying service is in a bad state where the retry won't help. The bar for touching pywinrm is pretty high, especially when it comes to new features sorry.

@Yannik
Copy link
Author

Yannik commented Apr 18, 2024

Hi Jordan,

thanks!

I will create a POC for migrating to psrp.

I personally would recommend psrp as it's the same fundamental transport but offers a few benefits over winrm.

Would you mind elaborating these benefits? I'd love to learn.
Tbh, it would be great to have an explanation like this in the ansible docs :)

@jborean93
Copy link
Collaborator

Would you mind elaborating these benefits?

The main benefit would be speed improvements. It's nothing substantial but you should see some really great improvements when running a looped task. General tasks should still be a tiny bit quicker but you'll only really see the improvements when the connection is reused (a loop reuses the connection). There are a few improvements in the authentication process but honestly probably nothing you would notice.

Tbh, it would be great to have an explanation like this in the ansible docs :)

While I can't give a definitive date it is part of my current documentation goals so hopefully I'll have something soon.

@Yannik
Copy link
Author

Yannik commented Apr 19, 2024

Thanks! We have migrated to pysrp now and will evaluate the results in the upcoming weeks. Will close this for now.

@Yannik Yannik closed this as completed Apr 19, 2024
@agibson2
Copy link
Contributor

Any followup feedback on the result of that transition (for others like myself) would be appreciated.

@Yannik
Copy link
Author

Yannik commented Jun 24, 2024

@agibson2 We are very happy, switching to pysrp has fixed the connection issues and made our ci/cd execution much more reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants