Implement connection retries #10

dagwieers · 2018-10-07T22:25:41Z

This implements the same solution for retrying connections as in pywinrm diyan/pywinrm#174

coveralls · 2018-10-07T23:00:15Z

Pull Request Test Coverage Report for Build 115

5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
3 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.09%) to 98.358%

Files with Coverage Reduction	New Missed Lines	%
pypsrp/negotiate.py	1	78.38%
pypsrp/serializer.py	1	99.76%
pypsrp/spgnego.py	1	87.75%

Totals
Change from base Build 112:	-0.09%
Covered Lines:	3354
Relevant Lines:	3410

💛 - Coveralls

jborean93 · 2018-10-08T00:59:48Z

@dagwieers once merged #12 will fix the Python 2.6 issues.

jborean93

Thanks for this @dagwieers, do you have a easy to reproduce script/playbook that I can use to test this out?

jborean93 · 2018-10-08T02:34:14Z

pypsrp/wsman.py

+            connect=self.reconnection_retries,
+            read=self.reconnection_retries,
+            backoff_factor=self.reconnection_backoff,
+            status_forcelist=[425, 429, 503],


Why these status codes?

All three of these are HTTP status codes that allow for retrying.

In fact HTTP 413, HTTP 429 and HTTP 503 return a Retry-After header that influences the backoff value. This way the server can tell the client to come back at some later interval (because it hasn't finished starting, or is handling too many connections).

Before our implementation would ignore those. (TBH I don't know if Microsoft has any of this implemented, but it doesn't hurt to do it anyway because a proxy or gateway could be used)

pypsrp/wsman.py

jborean93 · 2018-10-08T02:38:57Z

pypsrp/wsman.py

@@ -126,6 +126,7 @@ def __init__(self, server, max_envelope_size=153600, operation_timeout=20,
                 path="wsman", auth="negotiate", cert_validation=True,
                 connection_timeout=30, encryption='auto', proxy=None,
                 no_proxy=False, locale='en-US', data_locale=None,
+                 reconnection_retries=4, reconnection_backoff=2.0,


What's your reasoning for making this the default? I'm leaning towards keeping the existing behaviour rather than defaulting to retry on newer versions.

I think reliability is more important than keeping the existing behavior, especially if the new behavior is identical to the existing behavior when everything works fine. Only when things don't work fine immediately (Connection Refused, Service Unavailable, Too Early, Too Many Requests) we retry, which is exactly what we should be doing (knowing what we know now).

The reason for choosing 4 retries with a backoff of 2 seconds means we will retry 4 times no longer than 30 seconds. Which is exactly the same as we have specified for a connection timeout. So I think this acceptable, if you accept that an unresponsive system will also take 30 seconds by default before it fails.

Not retrying in the above cases is the wrong behavior IMO.

So with this, when does connection_timeout apply and retry apply? It sounds like if there's no response after 30 seconds then the connection timeout will be reached? Is it going to try to reconnect again or is the reconnects only on issues that happen post connection?

The documentation is not clear on this, but normally the connection timeout is per connection (not the total time) this should be apparent when you have slow DNS lookups or DNS problems, in this case the DNS latency is added to the connection timeout as well.

If you use retry=4, each connection has the same timeout. For the successful case there's no difference in behavior, for the case where the connection was established (but faulty) there's also no difference in behavior. But for the case where the server does not respond, and retries are attempted (not sure if it retries in this case, I would assume not) the connection timeout could be 4x as long. But it's easy to test this behavior in fact.

Let's try it out, I'm wondering whether this will affect the logic in win_reboot and we should override some values in there.

Yeah, we may need to look at win_reboot, although the default 4 retries and backoff value of 2 seconds works fine in our tests for win_reboot too. But it's probably not really needed.

So in one of my tests connecting with WinRM to a non-existing server, it timed out after 30 seconds. Which means that it doesn't timeout for every retry, the timeout appears to cover the total time (including retries). Not sure whether this is because of urllib3, requests or a lower-level mechanism in this case.

dagwieers · 2018-10-08T03:06:51Z

@jborean93 This is an example where we need to recover gracefully:

- name: Mount System-Center ISO image
  win_disk_image:
    image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    state: present
  register: iso

# NOTE: Debug logs are available from: C:\ProgramData\VMMLogs\SetupWizard.log
# NOTE: The SCVMM installer does not support a config file in UTF-8 (with BOM).
# FIXME: This action requires that pywinrm can survive 30+ seconds of WinRM service unavailability
#        And needs a patch for winrm/transport.py, see: https://github.com/ansible/ansible/issues/25532
- name: Run System-Center installer
  win_command: '{{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini" /SqlDBAdminDomain "{{ windows_domain }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}" /VmmServiceDomain "{{ windows_domain }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}" /IACCEPTSCEULA'
  args:
    creates: C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe
  vars:
    ansible_user: '{{ windows_domain }}\{{ windows_admin_user }}'
  async: 1000
  poll: 15

- name: Unmount System-Center ISO image
  win_disk_image:
    image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    state: absent

But it is not the easiest example.

BTW Async is needed here to avoid: WSManFaultError: Received a WSManFault message. (Code: 995, Machine: 38.38.51.3, Reason: The I/O operation has been aborted because of either a thread exit or an application request.)

jborean93 · 2018-10-08T03:09:18Z

No worries, I'll try it out and see how I go.

dagwieers · 2018-10-08T03:24:06Z

I added HTTP 413 which can also return a Retry-After header. And I forced read=0 as well.

dagwieers · 2018-10-08T20:51:42Z

I did some tests:

It takes exactly 30 seconds (with a ConnectTimeoutError) when the system is unreachable. (System was powered off). So here clearly the connect_timeout won from the retries.
It takes about 5 seconds (with a NewConnectionError) when the system has "No route to host" (System was not routeable)
When I use another HTTP server to do WinRM communication (httpbin.org port 80), I was expecting the transport to fail early and fast, but it takes 4 minutes. Which is peculiar because I wouldn't expected more than 2 minutes (4 x 30 seconds). This needs some investigation.

jborean93 · 2018-10-13T01:35:09Z

Just did my first test, will add any other tests/issues I come across to this post;

When using SSL, if the certificates aren't validated then the retries will occur instead of failing straight away it will retry a total of 4 ~~seconds~~ times. We shouldn't retry on these types of errors.

dagwieers · 2018-10-13T01:36:53Z

it will retry a total of 4 seconds

Did you look at the logs, because with our defaults that seems impossible.

jborean93 · 2018-10-13T01:41:14Z

Sorry 4 times not seconds, total of 30 seconds

dagwieers · 2018-10-13T01:51:05Z

@jborean93 Right, I guess SSL errors are also connection errors, just like "Connection refused" and other connection errors. We cannot distinguish one from another within the requests Retry framework. And the intention is clearly that we do want to retry for "Connection refused" and other connection errors.

I expect other HTTP clients using requests with Retry will have the same behavior. Not sure if we can/want to influence this behavior internally.

Note that the following link actually recommends using the Retry framework over build-your-own especially for SSL-related issues: https://www.peterbe.com/plog/best-practice-with-retries-with-requests

jborean93 · 2018-10-13T02:47:21Z

Hmm all these edge cases and the compounding of the connection timeout and backoff_factor is really pushing me against making retries the default and just keep it configurable by an end user if they really want this feature.

I also question the validity of the status code retries, I'm trying to come up with a scenario where we would want to try on a non-200 error message. For example a 413 error message is due to us not setting the max envelope size properly and we send a packet that's too big for the server to handle. No amount of retries will change this behaviour and would just be a waste of time. Have you come across this before that requires these retries?

dagwieers · 2018-10-13T19:33:41Z

@jborean93 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413

It can return a Retry-After header, which we wouldn't be supporting if we don't implement Retry. This is one of the use-cases of the mechanism. But I'm not married to HTTP 413.

dagwieers · 2018-10-19T23:33:14Z

Now retries is off by default, please re-review.

jborean93 · 2018-10-21T21:39:25Z

Looks like requests has issued an update that is not compatible with the tests, I will fix that in a separate commit.

dagwieers mentioned this pull request Oct 7, 2018

WinRM: Bails out with "[Errno 111] Connection refused" ansible/ansible#25532

Closed

dagwieers force-pushed the connection-retry branch 2 times, most recently from 621540f to 03ed2bd Compare October 7, 2018 22:52

jborean93 reviewed Oct 8, 2018

View reviewed changes

dagwieers force-pushed the connection-retry branch from 03ed2bd to 72b3809 Compare October 8, 2018 03:09

dagwieers force-pushed the connection-retry branch from 72b3809 to f8a5e35 Compare October 8, 2018 03:21

dagwieers force-pushed the connection-retry branch from f8a5e35 to e5bb040 Compare October 8, 2018 03:54

dagwieers mentioned this pull request Oct 9, 2018

Windows Working Group agenda - 2018 ansible/community#294

Closed

Implement connection retries

b7f539b

dagwieers force-pushed the connection-retry branch from e5bb040 to b7f539b Compare October 13, 2018 01:46

dagwieers added 2 commits October 20, 2018 01:23

Default to no retries

d6862ab

Remove HTTP 413, likely hypothetical case for PSRP

d38bce3

jborean93 approved these changes Oct 21, 2018

View reviewed changes

jborean93 merged commit 3530e9e into jborean93:master Oct 21, 2018

dagwieers mentioned this pull request Dec 11, 2018

Overcome "Connection Refused" on some operations diyan/pywinrm#174

Open

Yannik mentioned this pull request Apr 15, 2024

intermittent winrm connection failures with large hosts count ansible-collections/ansible.windows#597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement connection retries #10

Implement connection retries #10

dagwieers commented Oct 7, 2018

coveralls commented Oct 7, 2018 •

edited

Loading

jborean93 commented Oct 8, 2018

jborean93 left a comment

jborean93 Oct 8, 2018

dagwieers Oct 8, 2018

dagwieers Oct 8, 2018

jborean93 Oct 8, 2018

dagwieers Oct 8, 2018

jborean93 Oct 8, 2018

dagwieers Oct 8, 2018

jborean93 Oct 8, 2018

dagwieers Oct 8, 2018

dagwieers Oct 10, 2018 •

edited

Loading

dagwieers commented Oct 8, 2018 •

edited

Loading

jborean93 commented Oct 8, 2018

dagwieers commented Oct 8, 2018

dagwieers commented Oct 8, 2018

jborean93 commented Oct 13, 2018 •

edited

Loading

dagwieers commented Oct 13, 2018 •

edited

Loading

jborean93 commented Oct 13, 2018 •

edited

Loading

dagwieers commented Oct 13, 2018 •

edited

Loading

jborean93 commented Oct 13, 2018

dagwieers commented Oct 13, 2018

dagwieers commented Oct 19, 2018

jborean93 commented Oct 21, 2018

Implement connection retries #10

Implement connection retries #10

Conversation

dagwieers commented Oct 7, 2018

coveralls commented Oct 7, 2018 • edited Loading

Pull Request Test Coverage Report for Build 115

💛 - Coveralls

jborean93 commented Oct 8, 2018

jborean93 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dagwieers Oct 10, 2018 • edited Loading

Choose a reason for hiding this comment

dagwieers commented Oct 8, 2018 • edited Loading

jborean93 commented Oct 8, 2018

dagwieers commented Oct 8, 2018

dagwieers commented Oct 8, 2018

jborean93 commented Oct 13, 2018 • edited Loading

dagwieers commented Oct 13, 2018 • edited Loading

jborean93 commented Oct 13, 2018 • edited Loading

dagwieers commented Oct 13, 2018 • edited Loading

jborean93 commented Oct 13, 2018

dagwieers commented Oct 13, 2018

dagwieers commented Oct 19, 2018

jborean93 commented Oct 21, 2018

coveralls commented Oct 7, 2018 •

edited

Loading

dagwieers Oct 10, 2018 •

edited

Loading

dagwieers commented Oct 8, 2018 •

edited

Loading

jborean93 commented Oct 13, 2018 •

edited

Loading

dagwieers commented Oct 13, 2018 •

edited

Loading

jborean93 commented Oct 13, 2018 •

edited

Loading

dagwieers commented Oct 13, 2018 •

edited

Loading