Skip to content

Make tests compatible with windows#252

Merged
ocaballeror merged 6 commits intoopensistemas-hub:masterfrom
ocaballeror:windows_ipc
Feb 27, 2018
Merged

Make tests compatible with windows#252
ocaballeror merged 6 commits intoopensistemas-hub:masterfrom
ocaballeror:windows_ipc

Conversation

@ocaballeror
Copy link
Contributor

@ocaballeror ocaballeror commented Feb 20, 2018

Continuing the discussion from #147 .

Also, just to be clear, I don't intend to merge this yet, so there may be some weird comments and print statements around the code.

@ocaballeror ocaballeror requested a review from Peque February 20, 2018 14:50
@Peque Peque added this to the 0.7.0 milestone Feb 20, 2018
@ocaballeror
Copy link
Contributor Author

Also, it turns out many tests were failing due to timeouts because we were trying to run 8 parallel threads when the appveyor machine has only 2 cores. Using pytest -n2 works much better 👍

@Peque
Copy link
Member

Peque commented Feb 20, 2018

@ocaballeror Yeah, we could even avoid using pytest-xdist in AppVeyor. Historically it has given some problems in Travis, but at least it sped things up. If we are not going to notice the speed-up much, then we might as well avoid using it.

I'd say we could even, for AppVeyor, run tests in 1 thread (i.e.: no pytest-xdist) and only against the latest Python version.

@ocaballeror
Copy link
Contributor Author

@Peque I've run it a few times with 2 threads and it seems to work fine. We could leave it with 2 and revert to single-threaded mode if we run into any problems in the future. After all, if we have extra cores, why not use them?

About the python versions, though, why test only the latest one? There could be some implementation differences from version to version that suddenly break compatibility, so I'd leave it as it is right now just to be safe.

@ocaballeror
Copy link
Contributor Author

Update: Still working on the linger tests. There must be something that Windows is doing differently in the internal implementation of sockets and the SO_LINGER option, but I'm having a hard time figuring out what it is.

I'll keep updating when I find more stuff.

@ocaballeror
Copy link
Contributor Author

Took me a long while to figure it out. In the end it turns out the problem with the linger tests had nothing to do with sockets, zmq or anything related to that. The problem was actually being caused by the way we changed the linger value in every test. We were updating the global config dictionary, and then expecting the .close_all() method to read the updated value, but that won't work on Windows because of the way it creates new processes. On Unix the fork() system called is used to create the agent processes, but since that is not available on Windows, and the new processes are created using spawn(), we can't expect the global variable config to be copied to child processes.

In the end, the call to get_linger() inside close_all() was returning the default value of 1000 for linger instead of the updated one. I changed this so ,close_all() accepts a linger parameter.

@ocaballeror
Copy link
Contributor Author

By the way, there are 2 other tests currently failing and they are both related to Windows allowing more than one socket to listen on one address.

I will try to implement some kind of check to stop this from happening and manually raise an error if the user tries to do that. @Peque or should I just skip these tests on Windows and kindly advise people against this sort of practice?

@codecov
Copy link

codecov bot commented Feb 23, 2018

Codecov Report

Merging #252 into master will decrease coverage by 0.05%.
The diff coverage is 96.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #252      +/-   ##
==========================================
- Coverage   99.08%   99.03%   -0.06%     
==========================================
  Files          26       26              
  Lines        3503     3528      +25     
  Branches      250      251       +1     
==========================================
+ Hits         3471     3494      +23     
- Misses         20       21       +1     
- Partials       12       13       +1
Impacted Files Coverage Δ
osbrain/tests/test_proxy.py 100% <100%> (ø) ⬆️
osbrain/tests/test_agent_ipc_sockets.py 100% <100%> (ø) ⬆️
osbrain/__init__.py 100% <100%> (ø) ⬆️
osbrain/nameserver.py 99.21% <100%> (+0.01%) ⬆️
osbrain/agent.py 97.7% <100%> (ø) ⬆️
osbrain/tests/common.py 92.85% <100%> (+2.38%) ⬆️
osbrain/tests/test_agent.py 100% <100%> (ø) ⬆️
osbrain/tests/test_nameserver.py 98.11% <100%> (+0.01%) ⬆️
osbrain/tests/test_agent_transport.py 93.97% <88.88%> (-2.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0d96cc...f6482ec. Read the comment docs.

self.addr = SocketAddress(self.host, self.port)
print("self.host:", self.host)
print("self.port:", self.port)
print("self.addr:", self.addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this prints should be removed.

return True
time.sleep(.1)

return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is declared but it seems it is not used.

If you want to use it maybe we could merge #251 first (your review is pending), which implements a wait_condition() function with tests and an option to negate the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely forgot about this. Yes, it should be removed.

No need to merge the other PR, I'm not even using this function.

return False

skip_on_windows = pytest.mark.skipif(sys.platform == 'win32',
reason='Not supported on windows')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather put this between the imports and the first function, if you agree.

All agent identifiers should be unique strings.
"""
N = 1000
N = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Is it too slow on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't understand why, but for some reason, it takes 12-13 seconds to complete on Windows and about 0.2 seconds on Linux. I thought N=100 would be enough for this test anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew Windows processes were slow, but did not know they were that slow. The main difference is the way they are created (spawn vs. fork). Forking is definitely much faster, specially with the copy-on-write implemented in Linux.

Leave it with 100 then. 👍

osbrain.config['TRANSPORT'] = 'ipc'
agent = run_agent('a2')
address = agent.bind('PUSH')
assert address.transport == 'ipc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should probably be a test for global transport configuration, in example:

def test_agent_bind_transport_platform_default(nsproxy):
    """
    Default transport is platform-dependent.
    """
    agent = run_agent('a0')
    address = agent.bind('PUSH')
    if os.name == 'posix':
        assert address.transport == 'ipc'
    else:
        assert address.transport == 'tcp'

And then another for the global setting:

def test_agent_bind_transport_global(nsproxy):
    """
    Test global default transport change.
    """
    # Default transport is not `inproc`
    agent = run_agent('a0')
    address = agent.bind('PUSH')
    assert address.transport != 'inproc'

    # Changing default global transport to `inproc`
    osbrain.config['TRANSPORT'] = 'inproc'
    agent = run_agent('a1')
    address = agent.bind('PUSH')
    assert address.transport == 'inproc'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reminds me that changing values in the global config does not work on windows, for the same reason I explained earlier when I was talking about the linger thing.
When calling run_agent() a new process will be created, but because of the way Windows creates child processes, the values in config won't be the ones we just set, but rather the default ones.

I'll add these two tests and try to find a way to make the second one work on Windows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention bellow, I'd leave that until we merge this PR. Just having AppVeyor integrated, even if we are skipping some tests in Windows, is worth it to merge. Then we can see how to gradually reduce the number of skipped tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to change this (i.e.: create 2 separate tests). 😉

osbrain/agent.py Outdated
for sock in self._get_unique_external_zmq_sockets():
sockets_to_delete.append(sock)
sock.close(linger=get_linger())
sock.close(linger)
Copy link
Member

@Peque Peque Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the expected behavior.

Do you know when that get_linger() gets executed?

In my opinion we should leave it as it was before and skip if Windows fails, then create a new issue to investigate this problem and/or a workaround (i.e.: serializing the global configuration to child processes?). But for now it is easier if we simply skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not the expected behavior? I don't understand what you mean. The value of linger can be passed as an optional parameter, and if it isn't, get_linger() is called, as it was before. How does this differ from the expected behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf = dict(linger=1)


def get_linger():
    return conf['linger']


def default_conf(linger=get_linger()):
    return linger


if __name__ == '__main__':
    print(default_conf())
    conf['linger'] = 2
    print(default_conf())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I didn't know that happened. What about something like this:

conf = dict(linger=1)


def get_linger():
    return conf['linger']


def default_conf(linger=None):
    if linger is None:
        linger = get_linger()
    return linger


if __name__ == '__main__':
    print(default_conf())
    conf['linger'] = 2
    print(default_conf())

Copy link
Member

@Peque Peque Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, although that should be in a separate PR. Basically you'd be adding support for passing a linger parameter to .close_all(). That would require new tests. If you want to implement this we should also probably add the same parameter to .close() (with tests again).

As it is a non-trivial change, I would not add it here. Let us for now keep this PR only for making AppVeyor succeed (even if skipping some tests for now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will skip the linger tests as well, then.

with pytest.raises(TimeoutError):
locate_ns('127.0.0.1:1234', timeout=timeout)
assert timeout <= time.time() - t0 <= timeout + 1.
assert timeout <= time.time() - t0 <= timeout + 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this failing with the previous timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on my machine, yes. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out 1.2 is enough, but with just 1, the test will fail everytime, both on my machine and appveyor

assert not wayne._next_oneway

assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2)
assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this failing with the previous timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at some point, I wouldn't have changed it otherwise. I'll run it through appveyor again, maybe it needed longer timeouts before, since we were using 8 threads instead of 2.

Copy link
Member

@Peque Peque Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still fails if set back to 1.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I ran it in a loop to see if it would eventually fail, and after 100-200 tests I've gotten it to fail a couple of times. 1.5 looks like a safer value.

if sys.platform != 'win32':
with pytest.raises(RuntimeError):
random_nameserver_process(port_start=22,
port_stop=22, timeout=0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we simply skip this test on Windows for now and revert the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about random_nameserver_process() then? Shouldn't we test it works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let us just skip those failing tests. We will fix them later, as separate PRs, analyzing the situation for each case.

address = agent.bind('PUSH', addr=ipc_addr, transport='ipc')
assert address.transport == 'ipc'
assert address.address.name == ipc_addr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we split this test in two: one for IPC, one for TCP. Add then a skip-on-windows mark for the IPC test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should split this in 2.

@Peque
Copy link
Member

Peque commented Feb 26, 2018

@ocaballeror I would say, for now, we just skip failing tests for Windows. Then we can create specific issues for those fails and decide whether we should fix them or just assume they should fail in Windows.

Also, that way we keep this PR a bit simpler and merge it faster (even if it is not perfect, it is definitely a step forward that we want to include in our code base). 😊

osbrain/agent.py Outdated
reuse = Pyro4.config.SOCK_REUSE
Pyro4.config.SOCK_REUSE = False
self._daemon = Pyro4.Daemon(self._host, self.port)
Pyro4.config.SOCK_REUSE = reuse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Pyro4.config.SOCK_REUSE defaults to False. Anyway maybe we could fix this later, as a different issue, and try to merge what you have got now, which is already an improvement over what we had before. 😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defaults to True. And for some reason (I haven't investigated much), everything fails spectacularly if the default value is changed to False in __init__.py.

But OK, I'll revert these two changes and skip the tests that rely on OS errors being raised from binding to addresses in use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the documentation is wrong then... Can you open a PR in Pyro4 to fix the documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@ocaballeror
Copy link
Contributor Author

ocaballeror commented Feb 26, 2018

@Peque Maybe I should create another flag? Something like skip_for_now so we know that specific test still needs reviewing?

@Peque
Copy link
Member

Peque commented Feb 26, 2018

@ocaballeror I fear the skip_for_now name... 😂 (you know, sometimes "temporary" fixes end up being permanent...)

I like the idea of using more specific flags though. Maybe skip_windows_ipc and a generic skip_windows_broken for (currently) unknown problems.

@ocaballeror ocaballeror requested a review from Peque February 26, 2018 14:47
skip_on_windows = pytest.mark.skipif(sys.platform == 'win32',
reason='Not supported on windows')
skip_windows_addr = pytest.mark.skipif(sys.platform == 'win32',
reason='Windows allows port reuse')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency I'd name both skip_on_windows... or skip_windows... (it is easier to grep).



@pytest.mark.skipif(sys.platform == 'win32',
reason='Windows allows binding to any port')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if only used once, I'd declare all Windows skips the same way (with a @skip... decorator). Will make finding all Windows skips easier.

osbrain.config['TRANSPORT'] = 'ipc'
agent = run_agent('a2')
address = agent.bind('PUSH')
assert address.transport == 'ipc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to change this (i.e.: create 2 separate tests). 😉

address = agent.bind('PUSH', addr=ipc_addr, transport='ipc')
assert address.transport == 'ipc'
assert address.address.name == ipc_addr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should split this in 2.

assert address.address == tcp_addr


@skip_on_windows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use skip_on_windows_ipc to be more clear.



@pytest.mark.skipif(sys.platform == 'win32',
reason='Windows allows binding to any port')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather use @skip... everywhere.

from common import nsproxy # pragma: no flakes

pytestmark = pytest.mark.skipif(sys.platform == 'win32',
reason='IPC not available on Windows')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pytestmark = skip_windows_ipc work? (importing the corresponding skip_windows_ipc definition).

This way we put all pytest.mark.skipif declarations together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. Changing it now.

@ocaballeror
Copy link
Contributor Author

Added a bunch more @skip_windows indicators, to be able to differentiate between all the roadblocks we're running into.

@ocaballeror ocaballeror requested a review from Peque February 26, 2018 17:21
Copy link
Member

@Peque Peque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments. Would you mind rebasing and squashing before next (and probably last) review?

with pytest.raises(RuntimeError) as error:
run_nameserver(nsproxy.addr())
assert 'OSError' in str(error.value)
assert 'Address already in use' in str(error.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed this line but now that we are skipping for Windows, it should be there.

addr=address.address, transport='tcp')

assert should_receive == wait_agent_attr(puller, data='foo', timeout=.2)
assert should_receive == wait_agent_attr(puller, data='foo', timeout=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this timeout does not need to be changed for now as this test is skipped for Windows. When/if we remove the skip we will change the timeout if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this one does fail on Linux from time to time. I ran it in a loop yesterday, and it took me anywhere between 50 and 500 attempts to make it fail, but eventually it does.

It has failed a few times in travis:
https://travis-ci.org/ocaballeror/osbrain/jobs/346269407

I think 1 is a safer value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocaballeror Then ok. It seems changing the timeout does not actually change the test, will only make it a bit slower for the cases where should_receive == False, so it is fine for me. 😊

Should we put it as a separate commit? i.e.: "More lenient timeout in linger test". Simply because it is not really related to Windows tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

Rebasing this is going to be fun....

with pytest.raises(RuntimeError) as error:
run_agent('a0', nsaddr=nsproxy.addr(), addr=nsproxy.addr())
assert 'OSError' in str(error.value)
assert 'Address already in use' in str(error.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed this line but now that we are skipping for Windows, it should be there.

assert not wayne._next_oneway

assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.2)
assert wait_agent_attr(wayne, value=20*['bang!'], timeout=1.5)
Copy link
Member

@Peque Peque Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still fails if set back to 1.2?

Copy link
Member

@Peque Peque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 😊 👍

@ocaballeror
Copy link
Contributor Author

Finally 😄

This took way longer than I expected

@ocaballeror ocaballeror merged commit f6482ec into opensistemas-hub:master Feb 27, 2018
@ocaballeror ocaballeror deleted the windows_ipc branch March 23, 2018 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants