Make tests compatible with windows by ocaballeror · Pull Request #252 · opensistemas-hub/osbrain

ocaballeror · 2018-02-20T14:50:34Z

Continuing the discussion from #147 .

Also, just to be clear, I don't intend to merge this yet, so there may be some weird comments and print statements around the code.

ocaballeror · 2018-02-20T17:25:08Z

Also, it turns out many tests were failing due to timeouts because we were trying to run 8 parallel threads when the appveyor machine has only 2 cores. Using pytest -n2 works much better 👍

Peque · 2018-02-20T17:29:00Z

@ocaballeror Yeah, we could even avoid using pytest-xdist in AppVeyor. Historically it has given some problems in Travis, but at least it sped things up. If we are not going to notice the speed-up much, then we might as well avoid using it.

I'd say we could even, for AppVeyor, run tests in 1 thread (i.e.: no pytest-xdist) and only against the latest Python version.

ocaballeror · 2018-02-20T17:36:24Z

@Peque I've run it a few times with 2 threads and it seems to work fine. We could leave it with 2 and revert to single-threaded mode if we run into any problems in the future. After all, if we have extra cores, why not use them?

About the python versions, though, why test only the latest one? There could be some implementation differences from version to version that suddenly break compatibility, so I'd leave it as it is right now just to be safe.

ocaballeror · 2018-02-22T07:13:16Z

Update: Still working on the linger tests. There must be something that Windows is doing differently in the internal implementation of sockets and the SO_LINGER option, but I'm having a hard time figuring out what it is.

I'll keep updating when I find more stuff.

ocaballeror · 2018-02-23T15:59:26Z

Took me a long while to figure it out. In the end it turns out the problem with the linger tests had nothing to do with sockets, zmq or anything related to that. The problem was actually being caused by the way we changed the linger value in every test. We were updating the global config dictionary, and then expecting the .close_all() method to read the updated value, but that won't work on Windows because of the way it creates new processes. On Unix the fork() system called is used to create the agent processes, but since that is not available on Windows, and the new processes are created using spawn(), we can't expect the global variable config to be copied to child processes.

In the end, the call to get_linger() inside close_all() was returning the default value of 1000 for linger instead of the updated one. I changed this so ,close_all() accepts a linger parameter.

ocaballeror · 2018-02-23T16:02:40Z

By the way, there are 2 other tests currently failing and they are both related to Windows allowing more than one socket to listen on one address.

I will try to implement some kind of check to stop this from happening and manually raise an error if the user tries to do that. @Peque or should I just skip these tests on Windows and kindly advise people against this sort of practice?

codecov · 2018-02-23T16:04:47Z

Codecov Report

Merging #252 into master will decrease coverage by 0.05%.
The diff coverage is 96.22%.

@@            Coverage Diff             @@
##           master     #252      +/-   ##
==========================================
- Coverage   99.08%   99.03%   -0.06%     
==========================================
  Files          26       26              
  Lines        3503     3528      +25     
  Branches      250      251       +1     
==========================================
+ Hits         3471     3494      +23     
- Misses         20       21       +1     
- Partials       12       13       +1

Impacted Files	Coverage Δ
osbrain/tests/test_proxy.py	`100% <100%> (ø)`	⬆️
osbrain/tests/test_agent_ipc_sockets.py	`100% <100%> (ø)`	⬆️
osbrain/__init__.py	`100% <100%> (ø)`	⬆️
osbrain/nameserver.py	`99.21% <100%> (+0.01%)`	⬆️
osbrain/agent.py	`97.7% <100%> (ø)`	⬆️
osbrain/tests/common.py	`92.85% <100%> (+2.38%)`	⬆️
osbrain/tests/test_agent.py	`100% <100%> (ø)`	⬆️
osbrain/tests/test_nameserver.py	`98.11% <100%> (+0.01%)`	⬆️
osbrain/tests/test_agent_transport.py	`93.97% <88.88%> (-2.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0d96cc...f6482ec. Read the comment docs.

Peque · 2018-02-26T08:46:51Z

osbrain/nameserver.py

        self.addr = SocketAddress(self.host, self.port)
+        print("self.host:", self.host)
+        print("self.port:", self.port)
+        print("self.addr:", self.addr)


I guess this prints should be removed.

Peque · 2018-02-26T08:50:44Z

osbrain/tests/common.py

+            return True
+        time.sleep(.1)
+
+    return False


This function is declared but it seems it is not used.

If you want to use it maybe we could merge #251 first (your review is pending), which implements a wait_condition() function with tests and an option to negate the condition.

I completely forgot about this. Yes, it should be removed.

No need to merge the other PR, I'm not even using this function.

Peque · 2018-02-26T08:57:20Z

osbrain/tests/common.py

+    return False
+
+skip_on_windows = pytest.mark.skipif(sys.platform == 'win32',
+                                     reason='Not supported on windows')


I'd rather put this between the imports and the first function, if you agree.

Peque · 2018-02-26T08:58:09Z

osbrain/tests/test_agent.py

    All agent identifiers should be unique strings.
    """
-    N = 1000
+    N = 100


Why this change? Is it too slow on Windows?

Yes. I don't understand why, but for some reason, it takes 12-13 seconds to complete on Windows and about 0.2 seconds on Linux. I thought N=100 would be enough for this test anyway

I knew Windows processes were slow, but did not know they were that slow. The main difference is the way they are created (spawn vs. fork). Forking is definitely much faster, specially with the copy-on-write implemented in Linux.

Leave it with 100 then. 👍

Peque · 2018-02-26T09:18:37Z

osbrain/tests/test_agent_transport.py

+        osbrain.config['TRANSPORT'] = 'ipc'
+        agent = run_agent('a2')
+        address = agent.bind('PUSH')
+        assert address.transport == 'ipc'


I think there should probably be a test for global transport configuration, in example:

def test_agent_bind_transport_platform_default(nsproxy): """ Default transport is platform-dependent. """ agent = run_agent('a0') address = agent.bind('PUSH') if os.name == 'posix': assert address.transport == 'ipc' else: assert address.transport == 'tcp'

And then another for the global setting:

def test_agent_bind_transport_global(nsproxy): """ Test global default transport change. """ # Default transport is not `inproc` agent = run_agent('a0') address = agent.bind('PUSH') assert address.transport != 'inproc' # Changing default global transport to `inproc` osbrain.config['TRANSPORT'] = 'inproc' agent = run_agent('a1') address = agent.bind('PUSH') assert address.transport == 'inproc'

That reminds me that changing values in the global config does not work on windows, for the same reason I explained earlier when I was talking about the linger thing.
When calling run_agent() a new process will be created, but because of the way Windows creates child processes, the values in config won't be the ones we just set, but rather the default ones.

I'll add these two tests and try to find a way to make the second one work on Windows.

As I mention bellow, I'd leave that until we merge this PR. Just having AppVeyor integrated, even if we are skipping some tests in Windows, is worth it to merge. Then we can see how to gradually reduce the number of skipped tests.

You forgot to change this (i.e.: create 2 separate tests). 😉

Peque · 2018-02-26T09:26:26Z

osbrain/agent.py

        for sock in self._get_unique_external_zmq_sockets():
            sockets_to_delete.append(sock)
-            sock.close(linger=get_linger())
+            sock.close(linger)


This is not the expected behavior.

Do you know when that get_linger() gets executed?

In my opinion we should leave it as it was before and skip if Windows fails, then create a new issue to investigate this problem and/or a workaround (i.e.: serializing the global configuration to child processes?). But for now it is easier if we simply skip.

Why is it not the expected behavior? I don't understand what you mean. The value of linger can be passed as an optional parameter, and if it isn't, get_linger() is called, as it was before. How does this differ from the expected behavior?

conf = dict(linger=1) def get_linger(): return conf['linger'] def default_conf(linger=get_linger()): return linger if __name__ == '__main__': print(default_conf()) conf['linger'] = 2 print(default_conf())

Oh. I didn't know that happened. What about something like this:

conf = dict(linger=1) def get_linger(): return conf['linger'] def default_conf(linger=None): if linger is None: linger = get_linger() return linger if __name__ == '__main__': print(default_conf()) conf['linger'] = 2 print(default_conf())

Yeah, although that should be in a separate PR. Basically you'd be adding support for passing a linger parameter to .close_all(). That would require new tests. If you want to implement this we should also probably add the same parameter to .close() (with tests again).

As it is a non-trivial change, I would not add it here. Let us for now keep this PR only for making AppVeyor succeed (even if skipping some tests for now).

Ok. I will skip the linger tests as well, then.

Peque · 2018-02-26T09:29:03Z

osbrain/tests/test_proxy.py

    with pytest.raises(TimeoutError):
        locate_ns('127.0.0.1:1234', timeout=timeout)
-    assert timeout <= time.time() - t0 <= timeout + 1.
+    assert timeout <= time.time() - t0 <= timeout + 2.


Was this failing with the previous timeout?

At least on my machine, yes. 😄

Turns out 1.2 is enough, but with just 1, the test will fail everytime, both on my machine and appveyor

Peque · 2018-02-26T09:29:07Z

osbrain/tests/test_proxy.py

    assert not wayne._next_oneway

-    assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2)
+    assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5)


Was this failing with the previous timeout?

Maybe at some point, I wouldn't have changed it otherwise. I'll run it through appveyor again, maybe it needed longer timeouts before, since we were using 8 threads instead of 2.

It still fails if set back to 1.2?

Same as above, I ran it in a loop to see if it would eventually fail, and after 100-200 tests I've gotten it to fail a couple of times. 1.5 looks like a safer value.

Peque · 2018-02-26T09:29:48Z

osbrain/tests/test_nameserver.py

+    if sys.platform != 'win32':
+        with pytest.raises(RuntimeError):
+            random_nameserver_process(port_start=22,
+                                      port_stop=22, timeout=0.5)


I'd say we simply skip this test on Windows for now and revert the changes.

What about random_nameserver_process() then? Shouldn't we test it works?

For now let us just skip those failing tests. We will fix them later, as separate PRs, analyzing the situation for each case.

Peque · 2018-02-26T09:31:30Z

osbrain/tests/test_agent_transport.py

+        address = agent.bind('PUSH', addr=ipc_addr, transport='ipc')
+        assert address.transport == 'ipc'
+        assert address.address.name == ipc_addr
+


I'd say we split this test in two: one for IPC, one for TCP. Add then a skip-on-windows mark for the IPC test.

I still think we should split this in 2.

Peque · 2018-02-26T09:36:19Z

@ocaballeror I would say, for now, we just skip failing tests for Windows. Then we can create specific issues for those fails and decide whether we should fix them or just assume they should fail in Windows.

Also, that way we keep this PR a bit simpler and merge it faster (even if it is not perfect, it is definitely a step forward that we want to include in our code base). 😊

Peque · 2018-02-26T11:46:16Z

osbrain/agent.py

+            reuse = Pyro4.config.SOCK_REUSE
+            Pyro4.config.SOCK_REUSE = False
            self._daemon = Pyro4.Daemon(self._host, self.port)
+            Pyro4.config.SOCK_REUSE = reuse


I think Pyro4.config.SOCK_REUSE defaults to False. Anyway maybe we could fix this later, as a different issue, and try to merge what you have got now, which is already an improvement over what we had before. 😊

It defaults to True. And for some reason (I haven't investigated much), everything fails spectacularly if the default value is changed to False in __init__.py.

But OK, I'll revert these two changes and skip the tests that rely on OS errors being raised from binding to addresses in use.

You are right, the documentation is wrong then... Can you open a PR in Pyro4 to fix the documentation?

ocaballeror · 2018-02-26T13:07:27Z

@Peque Maybe I should create another flag? Something like skip_for_now so we know that specific test still needs reviewing?

Peque · 2018-02-26T13:25:00Z

@ocaballeror I fear the skip_for_now name... 😂 (you know, sometimes "temporary" fixes end up being permanent...)

I like the idea of using more specific flags though. Maybe skip_windows_ipc and a generic skip_windows_broken for (currently) unknown problems.

Peque · 2018-02-26T15:42:54Z

osbrain/tests/common.py

+skip_on_windows = pytest.mark.skipif(sys.platform == 'win32',
+                                     reason='Not supported on windows')
+skip_windows_addr = pytest.mark.skipif(sys.platform == 'win32',
+                                       reason='Windows allows port reuse')


Just for consistency I'd name both skip_on_windows... or skip_windows... (it is easier to grep).

Peque · 2018-02-26T15:44:45Z

osbrain/tests/test_agent.py



+@pytest.mark.skipif(sys.platform == 'win32',
+                    reason='Windows allows binding to any port')


Even if only used once, I'd declare all Windows skips the same way (with a @skip... decorator). Will make finding all Windows skips easier.

Peque · 2018-02-26T15:45:43Z

osbrain/tests/test_agent_transport.py

+        osbrain.config['TRANSPORT'] = 'ipc'
+        agent = run_agent('a2')
+        address = agent.bind('PUSH')
+        assert address.transport == 'ipc'


You forgot to change this (i.e.: create 2 separate tests). 😉

Peque · 2018-02-26T15:48:54Z

osbrain/tests/test_agent_transport.py

+        address = agent.bind('PUSH', addr=ipc_addr, transport='ipc')
+        assert address.transport == 'ipc'
+        assert address.address.name == ipc_addr
+


I still think we should split this in 2.

Peque · 2018-02-26T15:49:42Z

osbrain/tests/test_agent_transport.py

    assert address.address == tcp_addr


+@skip_on_windows


We could use skip_on_windows_ipc to be more clear.

Peque · 2018-02-26T15:50:14Z

osbrain/tests/test_nameserver.py



+@pytest.mark.skipif(sys.platform == 'win32',
+                    reason='Windows allows binding to any port')


Rather use @skip... everywhere.

Peque · 2018-02-26T15:51:31Z

osbrain/tests/test_agent_ipc_sockets.py

 from common import nsproxy  # pragma: no flakes

+pytestmark = pytest.mark.skipif(sys.platform == 'win32',
+                                reason='IPC not available on Windows')


Does pytestmark = skip_windows_ipc work? (importing the corresponding skip_windows_ipc definition).

This way we put all pytest.mark.skipif declarations together.

It does. Changing it now.

ocaballeror · 2018-02-26T17:21:04Z

Added a bunch more @skip_windows indicators, to be able to differentiate between all the roadblocks we're running into.

Peque

Just a couple of comments. Would you mind rebasing and squashing before next (and probably last) review?

Peque · 2018-02-27T08:25:40Z

osbrain/tests/test_nameserver.py

    with pytest.raises(RuntimeError) as error:
        run_nameserver(nsproxy.addr())
    assert 'OSError' in str(error.value)
-    assert 'Address already in use' in str(error.value)


You removed this line but now that we are skipping for Windows, it should be there.

Peque · 2018-02-27T08:27:17Z

osbrain/tests/test_agent.py

                addr=address.address, transport='tcp')

-    assert should_receive == wait_agent_attr(puller, data='foo', timeout=.2)
+    assert should_receive == wait_agent_attr(puller, data='foo', timeout=1)


I guess this timeout does not need to be changed for now as this test is skipped for Windows. When/if we remove the skip we will change the timeout if necessary.

But this one does fail on Linux from time to time. I ran it in a loop yesterday, and it took me anywhere between 50 and 500 attempts to make it fail, but eventually it does.

It has failed a few times in travis:
https://travis-ci.org/ocaballeror/osbrain/jobs/346269407

I think 1 is a safer value.

@ocaballeror Then ok. It seems changing the timeout does not actually change the test, will only make it a bit slower for the cases where should_receive == False, so it is fine for me. 😊

Should we put it as a separate commit? i.e.: "More lenient timeout in linger test". Simply because it is not really related to Windows tests.

Makes sense 👍

Rebasing this is going to be fun....

Peque · 2018-02-27T08:39:32Z

osbrain/tests/test_agent.py

    with pytest.raises(RuntimeError) as error:
        run_agent('a0', nsaddr=nsproxy.addr(), addr=nsproxy.addr())
    assert 'OSError' in str(error.value)
-    assert 'Address already in use' in str(error.value)


You removed this line but now that we are skipping for Windows, it should be there.

Peque · 2018-02-27T08:44:50Z

osbrain/tests/test_proxy.py

    assert not wayne._next_oneway

-    assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2)
+    assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5)


It still fails if set back to 1.2?

Peque

Looks good to me. 😊 👍

ocaballeror · 2018-02-27T13:31:41Z

Finally 😄

This took way longer than I expected

ocaballeror requested a review from Peque February 20, 2018 14:50

ocaballeror mentioned this pull request Feb 20, 2018

Fix IPC tests and documentation #147

Closed

Peque added this to the 0.7.0 milestone Feb 20, 2018

Peque requested changes Feb 26, 2018

View reviewed changes

Peque reviewed Feb 26, 2018

View reviewed changes

ocaballeror requested a review from Peque February 26, 2018 14:47

Peque requested changes Feb 26, 2018

View reviewed changes

ocaballeror requested a review from Peque February 26, 2018 17:21

Peque requested changes Feb 27, 2018

View reviewed changes

ocaballeror force-pushed the windows_ipc branch from 4d4cf05 to ae1781b Compare February 27, 2018 12:38

ocaballeror requested a review from Peque February 27, 2018 12:55

Peque approved these changes Feb 27, 2018

View reviewed changes

This was referenced Feb 27, 2018

Do not use IPC by default for Windows systems #146

Closed

test_linger[2-1-True] #250

Closed

ocaballeror mentioned this pull request Feb 27, 2018

Add linger parameter to .close() and .close_all() #255

Open

Peque and others added 3 commits February 27, 2018 14:41

Add AppVeyor YAML configuration file

d99a0a3

Update docs

b1f4f28

Make tests compatible with windows

b2c603f

Peque and others added 3 commits February 27, 2018 14:41

Make cloudpickle default internal serializer

f799236

Internally serialize base class to avoid pickling errors

e5e3c4d

More lenient timeouts on specific tests

f6482ec

ocaballeror force-pushed the windows_ipc branch from ae1781b to f6482ec Compare February 27, 2018 13:42

ocaballeror merged commit f6482ec into opensistemas-hub:master Feb 27, 2018

ocaballeror deleted the windows_ipc branch March 23, 2018 08:42



		@pytest.mark.skipif(sys.platform == 'win32',
		reason='Windows allows binding to any port')

Conversation

ocaballeror commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocaballeror commented Feb 20, 2018

Uh oh!

Peque commented Feb 20, 2018

Uh oh!

ocaballeror commented Feb 20, 2018

Uh oh!

ocaballeror commented Feb 22, 2018

Uh oh!

ocaballeror commented Feb 23, 2018

Uh oh!

ocaballeror commented Feb 23, 2018

Uh oh!

codecov bot commented Feb 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peque Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peque Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peque Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocaballeror commented Feb 20, 2018 •

edited

Loading

codecov bot commented Feb 23, 2018 •

edited

Loading

Peque Feb 26, 2018 •

edited

Loading

Peque Feb 26, 2018 •

edited

Loading

Peque Feb 27, 2018 •

edited

Loading

ocaballeror commented Feb 26, 2018 •

edited

Loading