Make tests compatible with windows#252
Conversation
|
Also, it turns out many tests were failing due to timeouts because we were trying to run 8 parallel threads when the appveyor machine has only 2 cores. Using |
|
@ocaballeror Yeah, we could even avoid using I'd say we could even, for AppVeyor, run tests in 1 thread (i.e.: no |
|
@Peque I've run it a few times with 2 threads and it seems to work fine. We could leave it with 2 and revert to single-threaded mode if we run into any problems in the future. After all, if we have extra cores, why not use them? About the python versions, though, why test only the latest one? There could be some implementation differences from version to version that suddenly break compatibility, so I'd leave it as it is right now just to be safe. |
|
Update: Still working on the linger tests. There must be something that Windows is doing differently in the internal implementation of sockets and the SO_LINGER option, but I'm having a hard time figuring out what it is. I'll keep updating when I find more stuff. |
|
Took me a long while to figure it out. In the end it turns out the problem with the linger tests had nothing to do with sockets, zmq or anything related to that. The problem was actually being caused by the way we changed the linger value in every test. We were updating the global In the end, the call to |
|
By the way, there are 2 other tests currently failing and they are both related to Windows allowing more than one socket to listen on one address. I will try to implement some kind of check to stop this from happening and manually raise an error if the user tries to do that. @Peque or should I just skip these tests on Windows and kindly advise people against this sort of practice? |
Codecov Report
@@ Coverage Diff @@
## master #252 +/- ##
==========================================
- Coverage 99.08% 99.03% -0.06%
==========================================
Files 26 26
Lines 3503 3528 +25
Branches 250 251 +1
==========================================
+ Hits 3471 3494 +23
- Misses 20 21 +1
- Partials 12 13 +1
Continue to review full report at Codecov.
|
osbrain/nameserver.py
Outdated
| self.addr = SocketAddress(self.host, self.port) | ||
| print("self.host:", self.host) | ||
| print("self.port:", self.port) | ||
| print("self.addr:", self.addr) |
There was a problem hiding this comment.
I guess this prints should be removed.
osbrain/tests/common.py
Outdated
| return True | ||
| time.sleep(.1) | ||
|
|
||
| return False |
There was a problem hiding this comment.
This function is declared but it seems it is not used.
If you want to use it maybe we could merge #251 first (your review is pending), which implements a wait_condition() function with tests and an option to negate the condition.
There was a problem hiding this comment.
I completely forgot about this. Yes, it should be removed.
No need to merge the other PR, I'm not even using this function.
osbrain/tests/common.py
Outdated
| return False | ||
|
|
||
| skip_on_windows = pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='Not supported on windows') |
There was a problem hiding this comment.
I'd rather put this between the imports and the first function, if you agree.
| All agent identifiers should be unique strings. | ||
| """ | ||
| N = 1000 | ||
| N = 100 |
There was a problem hiding this comment.
Why this change? Is it too slow on Windows?
There was a problem hiding this comment.
Yes. I don't understand why, but for some reason, it takes 12-13 seconds to complete on Windows and about 0.2 seconds on Linux. I thought N=100 would be enough for this test anyway
There was a problem hiding this comment.
I knew Windows processes were slow, but did not know they were that slow. The main difference is the way they are created (spawn vs. fork). Forking is definitely much faster, specially with the copy-on-write implemented in Linux.
Leave it with 100 then. 👍
| osbrain.config['TRANSPORT'] = 'ipc' | ||
| agent = run_agent('a2') | ||
| address = agent.bind('PUSH') | ||
| assert address.transport == 'ipc' |
There was a problem hiding this comment.
I think there should probably be a test for global transport configuration, in example:
def test_agent_bind_transport_platform_default(nsproxy):
"""
Default transport is platform-dependent.
"""
agent = run_agent('a0')
address = agent.bind('PUSH')
if os.name == 'posix':
assert address.transport == 'ipc'
else:
assert address.transport == 'tcp'And then another for the global setting:
def test_agent_bind_transport_global(nsproxy):
"""
Test global default transport change.
"""
# Default transport is not `inproc`
agent = run_agent('a0')
address = agent.bind('PUSH')
assert address.transport != 'inproc'
# Changing default global transport to `inproc`
osbrain.config['TRANSPORT'] = 'inproc'
agent = run_agent('a1')
address = agent.bind('PUSH')
assert address.transport == 'inproc'There was a problem hiding this comment.
That reminds me that changing values in the global config does not work on windows, for the same reason I explained earlier when I was talking about the linger thing.
When calling run_agent() a new process will be created, but because of the way Windows creates child processes, the values in config won't be the ones we just set, but rather the default ones.
I'll add these two tests and try to find a way to make the second one work on Windows.
There was a problem hiding this comment.
As I mention bellow, I'd leave that until we merge this PR. Just having AppVeyor integrated, even if we are skipping some tests in Windows, is worth it to merge. Then we can see how to gradually reduce the number of skipped tests.
There was a problem hiding this comment.
You forgot to change this (i.e.: create 2 separate tests). 😉
osbrain/agent.py
Outdated
| for sock in self._get_unique_external_zmq_sockets(): | ||
| sockets_to_delete.append(sock) | ||
| sock.close(linger=get_linger()) | ||
| sock.close(linger) |
There was a problem hiding this comment.
This is not the expected behavior.
Do you know when that get_linger() gets executed?
In my opinion we should leave it as it was before and skip if Windows fails, then create a new issue to investigate this problem and/or a workaround (i.e.: serializing the global configuration to child processes?). But for now it is easier if we simply skip.
There was a problem hiding this comment.
Why is it not the expected behavior? I don't understand what you mean. The value of linger can be passed as an optional parameter, and if it isn't, get_linger() is called, as it was before. How does this differ from the expected behavior?
There was a problem hiding this comment.
conf = dict(linger=1)
def get_linger():
return conf['linger']
def default_conf(linger=get_linger()):
return linger
if __name__ == '__main__':
print(default_conf())
conf['linger'] = 2
print(default_conf())There was a problem hiding this comment.
Oh. I didn't know that happened. What about something like this:
conf = dict(linger=1)
def get_linger():
return conf['linger']
def default_conf(linger=None):
if linger is None:
linger = get_linger()
return linger
if __name__ == '__main__':
print(default_conf())
conf['linger'] = 2
print(default_conf())There was a problem hiding this comment.
Yeah, although that should be in a separate PR. Basically you'd be adding support for passing a linger parameter to .close_all(). That would require new tests. If you want to implement this we should also probably add the same parameter to .close() (with tests again).
As it is a non-trivial change, I would not add it here. Let us for now keep this PR only for making AppVeyor succeed (even if skipping some tests for now).
There was a problem hiding this comment.
Ok. I will skip the linger tests as well, then.
osbrain/tests/test_proxy.py
Outdated
| with pytest.raises(TimeoutError): | ||
| locate_ns('127.0.0.1:1234', timeout=timeout) | ||
| assert timeout <= time.time() - t0 <= timeout + 1. | ||
| assert timeout <= time.time() - t0 <= timeout + 2. |
There was a problem hiding this comment.
Was this failing with the previous timeout?
There was a problem hiding this comment.
At least on my machine, yes. 😄
There was a problem hiding this comment.
Turns out 1.2 is enough, but with just 1, the test will fail everytime, both on my machine and appveyor
| assert not wayne._next_oneway | ||
|
|
||
| assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2) | ||
| assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5) |
There was a problem hiding this comment.
Was this failing with the previous timeout?
There was a problem hiding this comment.
Maybe at some point, I wouldn't have changed it otherwise. I'll run it through appveyor again, maybe it needed longer timeouts before, since we were using 8 threads instead of 2.
There was a problem hiding this comment.
It still fails if set back to 1.2?
There was a problem hiding this comment.
Same as above, I ran it in a loop to see if it would eventually fail, and after 100-200 tests I've gotten it to fail a couple of times. 1.5 looks like a safer value.
osbrain/tests/test_nameserver.py
Outdated
| if sys.platform != 'win32': | ||
| with pytest.raises(RuntimeError): | ||
| random_nameserver_process(port_start=22, | ||
| port_stop=22, timeout=0.5) |
There was a problem hiding this comment.
I'd say we simply skip this test on Windows for now and revert the changes.
There was a problem hiding this comment.
What about random_nameserver_process() then? Shouldn't we test it works?
There was a problem hiding this comment.
For now let us just skip those failing tests. We will fix them later, as separate PRs, analyzing the situation for each case.
| address = agent.bind('PUSH', addr=ipc_addr, transport='ipc') | ||
| assert address.transport == 'ipc' | ||
| assert address.address.name == ipc_addr | ||
|
|
There was a problem hiding this comment.
I'd say we split this test in two: one for IPC, one for TCP. Add then a skip-on-windows mark for the IPC test.
There was a problem hiding this comment.
I still think we should split this in 2.
|
@ocaballeror I would say, for now, we just skip failing tests for Windows. Then we can create specific issues for those fails and decide whether we should fix them or just assume they should fail in Windows. Also, that way we keep this PR a bit simpler and merge it faster (even if it is not perfect, it is definitely a step forward that we want to include in our code base). 😊 |
osbrain/agent.py
Outdated
| reuse = Pyro4.config.SOCK_REUSE | ||
| Pyro4.config.SOCK_REUSE = False | ||
| self._daemon = Pyro4.Daemon(self._host, self.port) | ||
| Pyro4.config.SOCK_REUSE = reuse |
There was a problem hiding this comment.
I think Pyro4.config.SOCK_REUSE defaults to False. Anyway maybe we could fix this later, as a different issue, and try to merge what you have got now, which is already an improvement over what we had before. 😊
There was a problem hiding this comment.
It defaults to True. And for some reason (I haven't investigated much), everything fails spectacularly if the default value is changed to False in __init__.py.
But OK, I'll revert these two changes and skip the tests that rely on OS errors being raised from binding to addresses in use.
There was a problem hiding this comment.
You are right, the documentation is wrong then... Can you open a PR in Pyro4 to fix the documentation?
|
@Peque Maybe I should create another flag? Something like |
|
@ocaballeror I fear the I like the idea of using more specific flags though. Maybe |
osbrain/tests/common.py
Outdated
| skip_on_windows = pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='Not supported on windows') | ||
| skip_windows_addr = pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='Windows allows port reuse') |
There was a problem hiding this comment.
Just for consistency I'd name both skip_on_windows... or skip_windows... (it is easier to grep).
osbrain/tests/test_agent.py
Outdated
|
|
||
|
|
||
| @pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='Windows allows binding to any port') |
There was a problem hiding this comment.
Even if only used once, I'd declare all Windows skips the same way (with a @skip... decorator). Will make finding all Windows skips easier.
| osbrain.config['TRANSPORT'] = 'ipc' | ||
| agent = run_agent('a2') | ||
| address = agent.bind('PUSH') | ||
| assert address.transport == 'ipc' |
There was a problem hiding this comment.
You forgot to change this (i.e.: create 2 separate tests). 😉
| address = agent.bind('PUSH', addr=ipc_addr, transport='ipc') | ||
| assert address.transport == 'ipc' | ||
| assert address.address.name == ipc_addr | ||
|
|
There was a problem hiding this comment.
I still think we should split this in 2.
| assert address.address == tcp_addr | ||
|
|
||
|
|
||
| @skip_on_windows |
There was a problem hiding this comment.
We could use skip_on_windows_ipc to be more clear.
osbrain/tests/test_nameserver.py
Outdated
|
|
||
|
|
||
| @pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='Windows allows binding to any port') |
| from common import nsproxy # pragma: no flakes | ||
|
|
||
| pytestmark = pytest.mark.skipif(sys.platform == 'win32', | ||
| reason='IPC not available on Windows') |
There was a problem hiding this comment.
Does pytestmark = skip_windows_ipc work? (importing the corresponding skip_windows_ipc definition).
This way we put all pytest.mark.skipif declarations together.
There was a problem hiding this comment.
It does. Changing it now.
|
Added a bunch more |
Peque
left a comment
There was a problem hiding this comment.
Just a couple of comments. Would you mind rebasing and squashing before next (and probably last) review?
| with pytest.raises(RuntimeError) as error: | ||
| run_nameserver(nsproxy.addr()) | ||
| assert 'OSError' in str(error.value) | ||
| assert 'Address already in use' in str(error.value) |
There was a problem hiding this comment.
You removed this line but now that we are skipping for Windows, it should be there.
| addr=address.address, transport='tcp') | ||
|
|
||
| assert should_receive == wait_agent_attr(puller, data='foo', timeout=.2) | ||
| assert should_receive == wait_agent_attr(puller, data='foo', timeout=1) |
There was a problem hiding this comment.
I guess this timeout does not need to be changed for now as this test is skipped for Windows. When/if we remove the skip we will change the timeout if necessary.
There was a problem hiding this comment.
But this one does fail on Linux from time to time. I ran it in a loop yesterday, and it took me anywhere between 50 and 500 attempts to make it fail, but eventually it does.
It has failed a few times in travis:
https://travis-ci.org/ocaballeror/osbrain/jobs/346269407
I think 1 is a safer value.
There was a problem hiding this comment.
@ocaballeror Then ok. It seems changing the timeout does not actually change the test, will only make it a bit slower for the cases where should_receive == False, so it is fine for me. 😊
Should we put it as a separate commit? i.e.: "More lenient timeout in linger test". Simply because it is not really related to Windows tests.
There was a problem hiding this comment.
Makes sense 👍
Rebasing this is going to be fun....
| with pytest.raises(RuntimeError) as error: | ||
| run_agent('a0', nsaddr=nsproxy.addr(), addr=nsproxy.addr()) | ||
| assert 'OSError' in str(error.value) | ||
| assert 'Address already in use' in str(error.value) |
There was a problem hiding this comment.
You removed this line but now that we are skipping for Windows, it should be there.
| assert not wayne._next_oneway | ||
|
|
||
| assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2) | ||
| assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5) |
There was a problem hiding this comment.
It still fails if set back to 1.2?
4d4cf05 to
ae1781b
Compare
|
Finally 😄 This took way longer than I expected |
ae1781b to
f6482ec
Compare
Continuing the discussion from #147 .
Also, just to be clear, I don't intend to merge this yet, so there may be some weird comments and print statements around the code.