-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DATA RACE: Disconnect() #550
Comments
I've closed off a previous issue (#536) that has the same cause. As per the comments on that issue its something I was aware of (there are a few other similar issues) - reusing the client after calling Disconnect is not fully threadsafe (for example if a connection attempt is in progress Note: The above makes some assumptions in regards to how you are using the library (calling Happy to accept a pull request fixing this. However I don't think it will be a simple fix; due to the way this library has evolved its difficult to maintain thread safety (really easy to introduce unintended deadlocks) and the way client.status is used is somewhat problematic (for historical reasons). Matt |
It seems this can happen without incorrect library usage if there is a reconnection. Specifically, I try to call Disconnect() when behind the scenes a reconnection happened.
|
Ok i had a look that the code and it is very tricky. We have a mutex(connMu) protecting concurrent writing I believe the issue is that we use close(c.commsStopped) to unblock both |
You are right - the code is quite tricky (which is why I have been avoiding making changes - its very easy to break something and addressing this properly will not be a quick job). A lot of this is due to the way this library has evolved and attempting to maintain backwards compatibility. I don't use I think that your analysis is correct. In
Should a
Note that there is a check in place in
So the situation you are encountering should be a rare occurrence (unless I'm missing something). I think the following would need to happen:
I implemented the I guess one solution would be to remove the check of Ideally the code handling |
Thank you for your fast and thorough answer
Would you be open to accept refactoring PRs of this code done so it would become easier to reason? I found quite a lot of redundant code that makes certain races and execution flows less evident. Hopefully, I would then submit a reduction of concurrent code interaction. |
@ptsneves thanks for the offer - I'll happily accept refactoring PRs that improve readability/resilience so long as they don't break existing usages. Note that it may take me a while to accept such PR's (review/testing often takes longer than implementation). |
The error is always reproducible if running |
@ptsneves let me know if you need some help with refactoring |
After understanding the code better and specifically the Disconnect() method I find that just removing the following line fixes the issue without any real adverse effects:
The only downside is that if the comms have stopped we will not know it and Disconnect will take at most the quiesce time. As this is a library user caller argument, the caller may just set the quiesce to 0 if it wants. Otherwise, that line just waits and gives a print. Would you accept a pull request to remove that line? Do you even so want the check to happen if AutoReconnect is false? |
I will, with reservations, accept a pull request that comments this out (with the comment that it results in a data race when autoreconnect is enabled). At some point we will need to fix the way status is handled and at that point this warning (and skipping the delay) should be re-introduced. Being able to tell if the disconnect packet was sent may be important in some use-cases (if the broker receives this it will discard any Will Message for the current session). |
@MattBrittan Thank you. Honestly, I have a big refactoring and fat-cutting patch privately, but this issue does not seem solvable without ugly synchronization primitives for the commsStopped channel. |
the data race seems tobe happen in |
@bokunodev If setting an |
I've written a low level implementation without use of the Go runtime (no mutexes, channels, allocations, interface casting/asserting) at natiu-mqtt. Hopefully it can inspire a future v2 for this library? |
@soypat - see the v5 client for the effective v2 of this client. Currently that only supports v5 but the reality is that most brokers support v5 (and a v3 implementation in the same form would be relatively simple). I think there is room for range of approaches (a lot of users want something that handles all of the work for them, i..e auto reconnection etc, whereas others want full control over memory usage and maximum efficiency) and it's great to see another Go option! I believe that this particular issue is probably fixed in the latest release (which significantly changes the way the connection state is managed). I don't expect there to be major changes to this library going forward (once persistence is implemented in the V5 client I'll probably move my systems over to that). |
My take on this is that I want both! When I started using this library last week I was frustrated with the bugs it had. When I tried to dive deep into the code to see what the problem was I was bewildered by how tightly coupled everything was. There was really no way of navigating the project in a reasonable time. So I built my own! Natiu as it stands is low level, but that's fine- anyone can build an extension on top of it to add, for example, a websocket implementation with autoreconnect features and all the bells and whistles. This way it is easier to maintain the original implementation, as it's simple and lightweight. This in turn makes it easy to implement extensions for it by third parties- it's a win win! There's other succesful projects that follow this methodology, like goldmark- by doing so you really cut down on cost of maintenance. Goldmark today has 180 closed issues, 1 open issue. Suggest you try it out in one of your next projects! 😄 |
I'm going to close this off as the largish change to status handling should have resolved the issue (and a few similar problems) and this issue has been quiet for a while. If anyone does come across a similar problem them please raise a new issue (there have been significant changes to the code so comments in this issue are no longer relevant). |
Version: v1.3.5
The text was updated successfully, but these errors were encountered: