-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rearchitect status handling #605
Comments
This is a major change that moves handling of the status (i.e. disconnected, connecting etc) out to status.go. It also introduces the new status disconnecting. All existing tests (including 10000 runs through Test_DisconnectWhileProcessingIncomingPublish) pass and I have added new ones to specifically test the connectionStatus functionality. However this is a major change so bugs may have been introduced! The major aim of this change is to simplify future work, It has been difficult to implement changes/refactor because the status handling was fragile (and not fully thread safe in some instances).
Changes have been merged but I'll leave this issue open for any issues arising. |
I've been running this live in four systems (some with very poor connections) for a couple of weeks without experiencing any issues. Will leave the issue open until after the next release but it appears that the change might be mostly bug free (but I would not be surprised if there is an edge case I have missed). |
Hi! We currently receive sporadic errors of the following kind when subscribing to topics: The issue here is, if you tcpdump, there is never an interruption or a reconnect or anything happening while this error comes up. ResumeSubs is not set like the error already correctly says and in our case the application does this a lot:
Do you know any good ways to debug this further within the library. This is very sporadic and mostly happens on production work load and very infrequently, we still would like to see if we can track it down. I'm kind of assuming right now, that this might relate to the merged PR from August, since its description fits our symptoms. Thanks! |
@peterhoneder looking at the code this can only happen if This would probably be better logged as a separate issue but would require a lot more information to be actionable (ideally code extracts and debug logs). This kind of issue can be very hard to track down (and is next to impossible without debug logs when I cannot duplicate the problem). |
Closing this issue off as the changes to status handling appear to be working OK (without more details I'm unable to follow up on the one comment above). |
Currently this package stores the connection status in a
uint32
protected by aRWMutex
:Unfortunately the status is often accessed without locking the
Mutex
e.g. here. While these accesses are usually viaatomic.LoadUint32
they are not particularly threadsafe and the code has been complicated to work around the deficiencies (e.g. here). This is especially apparent with auto reconnection (what happens ifDisconnect
is called while the system is attempting to reconnect?).Currently the available statuses are:
This is fairly limiting. For instance the addition of a
disconnecting
status would make it obvious that the disconnection process was underway avoiding the need for workarounds.Whilst I believe that workarounds are in place for all known issues (i.e. it works) the code can be difficult to follow which discourages new contributors (e.g. this PR) and makes checking/testing contributions more difficult then it needs to be.
I have been putting off changing this for some time, due to the potential to introduce subtle bugs, but feel that the time has come where a change is needed so will be submitting a PR shortly. I would expect this change to require a few iterations; the initial change will introduce new status handling code and follow-up PR's will (not necessarily by me) will refactor code to remove workarounds introduced due to the limitations of the old mechanism.
Note: The new code will focus on being readable over performance (for example I'm aiming to remove uses of
atomic
until benchmarking provides a solid rationale for using them). This change will not change the API and should make no difference to users (unless bugs are introduced!).The text was updated successfully, but these errors were encountered: