Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue Fails to Delete with Orphaned Binding #1376

Closed
LeeWhite187 opened this issue Aug 26, 2023 · 12 comments
Closed

Queue Fails to Delete with Orphaned Binding #1376

LeeWhite187 opened this issue Aug 26, 2023 · 12 comments
Assignees
Labels
Milestone

Comments

@LeeWhite187
Copy link

LeeWhite187 commented Aug 26, 2023

Describe the bug

This bug is for v6.5.0 of the RabbitMQ.Client nuget package.
It's a queue binding registration problem that creates an exception on deletion of a queue.

The channel.QueueDelete method call in the .NET client library throws an exception with a message containing: 'NOT_FOUND - no exchange '.
This occurs if a previous add queue binding was attempted with an unknown exchange name.
The exception is erroneous, since the call is to delete a queue, and not an exchange.
So, that implies an orphaned reference, somewhere.... and the problem is not actually the queue delete call, directly.

It's actually that a previously executed channel.QueueBind method correctly fails with an exception for the missing exchange.
But is still allowed to register the binding (to a nonexistant exchange), before the throw.

And subsequently, when later, deleting a queue, the QueueDelete method call also throws an exception for the same missing exchange (and same exchange name).... because it discovers a binding for the missing exchange (that was allowed to be registered), even though no valid exchange existed (and the QueueBind call threw because of it).

This exception, on queue deletion, prevents the QueueDelete method logic from actually deleting the queue.
And, the queue has to be deleted via a REST call or through the management interface.

Reproduction steps

  1. Connect client to a RMQ cluster with a user context that has admin privileges.
  2. Create a Guid string that will be the name of a queue to be created.
  3. Create a Guid string that will represent an unknown exchange.
  4. Create a Guid string that will be the binding routing key.
  5. From the .NET client, create a queue with the Guid string from above.
  6. From the management interface, verify the queue was created.
  7. From the .NET client, create a queue binding using the queue name, exchange name, and routing key, created above.
  8. This will throw an exception because the exchange is NOT_Found.
  9. From the management interface, verify the queue binding is NOT present (though, it actually got registered before the throw).
  10. Query the REST API for the queue binding, and verify that it is NOT present (though, it actually got registered before the throw).
  11. From the .NET client, delete the queue by name.
  12. Verify it throws an exception for a NOT_Found exchange, with the Guid string of the unknown exchange in the exception message.
  13. From the management interface, verify the queue is still present, failing to be deleted.
  14. From the REST API, verify the queue is still present, failing to be deleted.

Expected behavior

The expected behavior is that:

The QueueBind method of the channel class must NOT register a binding or leave any dangling binding record when the given exchange name is not known, or the queue name is not known.

As well, the QueueDelete method of the channel must be intelligent enough to delete all bindings of the queue, regardless of exchange presence.
Also. The QueueDelete method should throw an exception for a not found queue, which it correctly does.
But, it should NOT throw an exception for a not found exchange, which is conceptually incorrect for a queue deletion.
It should gracefully cleanup any bindings without error.

Additional context

No response

@LeeWhite187
Copy link
Author

LeeWhite187 commented Aug 26, 2023

I went through the effort, as a necessary workaround for the above issue, to create a REST client that performs the add/remove/exists for queue, exchange, and bindings.
I am able to correctly delete a queue via REST API, after an attempt to bind the queue to a missing exchange.
As well. I worked up unit tests (33+ now) for the REST client calls, covering happy-path and edge cases, like the above scenario, to ensure the RMQ service properly handles cleanup of dangling resources.

So, it is safe to assume that the RMQ service is behaving correctly in the above issue.
Which means, there must be a bug in the .NET client (v6.5.0) that is causing the queue delete call to throw an exception (with a NOT_Found exchange message)... when deleting a queue.

This issue is not a work stoppage for me. So, no rush.
I've unit tested and merged in the developed REST calls (as a workaround) to perform the queue, binding, and exchange management activities (present on the channel class).

But, I expect this edge case scenario could affect others.
Bug me if you need help isolating the problem.
Or, want a copy of my rest client code and unit tests.

Have a good day.

@LeeWhite187
Copy link
Author

To note, I did try slowing down logic execution to determine if the errant exception is just an eventual consistency issue for an in-progress, asynchronous binding attempt.
But, pausing logic flow with a healthy sleep delay between test steps, did not prevent the error.

@lukebakken lukebakken self-assigned this Aug 29, 2023
@lukebakken
Copy link
Contributor

lukebakken commented Aug 29, 2023

Thank you for the report.

What would expedite analysis is if you could provide a git repository with a complete set of code that reproduces this issue, and demonstrates your workaround.

Right now you're asking us to follow some prose to guess what you're doing when a runnable project or projects (as you offered) would help tremendously.

Thank you! I've been on PTO and am catching up today.

@lukebakken lukebakken added this to the 6.8.0 milestone Nov 18, 2023
@lukebakken
Copy link
Contributor

@LeeWhite187 would you be able to provide code to run with your reproduction steps?

@LeeWhite187
Copy link
Author

Sure.
I've attached a VS project that demonstrates the problem:
RMQ_QueueDeleteFailure_Test.zip

I thinned out the classes enough to readily follow execution, while preserving enough abstraction, in the unit test block, to visibly align with the "prose", above.
The unit test will assert failure at the Delete_Queue step, as this is the step that's expected to function properly, but does not.

No rush on fixing it. I built a REST client as workaround.

Bug me if questions, or if you need something else.
Hope this helps.

@lukebakken
Copy link
Contributor

Thank you very much! I'm working on the version 7 release and one goal is to close as many outstanding issues as possible pre-release.

@LeeWhite187
Copy link
Author

Anytime. Sorry for the delayed response, to getting you a test project to recreate the problem.

Bug me if you need anything.

@lukebakken
Copy link
Contributor

@LeeWhite187 - the problem is in your RMQ_Client.cs code:

https://github.com/lukebakken/rabbitmq-dotnet-client-1376/blob/main/ClassesUnderTest/RMQ_Client.cs#L1285-L1291

Once there is an exception on a channel, you must not use it anymore, and must open a new channel.

Since you continue to use the same channel in your test scenario, you see the "stale" data.

Please feel free to reply if this does not seem to be the correct analysis.

@lukebakken lukebakken closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2023
@LeeWhite187
Copy link
Author

@lukebakken Thanks for your response.
Have a question, though:

To clarify: Are you saying that if, for some reason, a client (of your library) requests to create a queue binding, to an exchange that doesn't exist (as is certainly possible in a large, multi-service, multi-team cluster), that the client logic MUST close the channel it used, and create a new one?

What if that channel (the queue binding failed on) is associated with an existing consumer, or some other resource type?
This could definitely be a scenario, when resources are dynamically added and removed at runtime.
And that could likely occur, because the recommended channel isolation (from the web) is "one channel per thread/process/coroutine".

An even worse scenario, for this recovery choice, would be using the channel shared by a consumer of an exclusive queue (for a queue binding request that may fail in this way).
Of which, closing the channel (because a "throwing" queue binding request), also deletes any exclusive queue.
And, any destined messages to that queue, will be lost or dead-lettered, depending on timing.

The recovery choice you're proposing seems a little extreme in its impact, for what could be solved with logic ordering in the binding method call.
And this seems to be even more the case, when there is no actual "channel" resource required or referenced, during the REST call that binds a queue.

With this recovery choice, I'll either have to choose to create "admin" channels for such tasking, or stick with a REST client approach (like my current workaround).
Doing either of those, removes any risk of the library "corrupting" its own channel information, because of a poor client request, or state change of the cluster by another service.

Unless, I'm really using it wrong...?

@lukebakken
Copy link
Contributor

Yes, a channel-level exception from RabbitMQ requires that you close the channel. RabbitMQ will close the channel on its end. In your test scenario, you can see that IsOpen is false on the IModel instance after the error.

Creating an "admin" channel is a good solution.

@LeeWhite187
Copy link
Author

That’s interesting to know. My tests did not include any post checks, to confirm resources remained viable.
I’ll add that for my own sanity.

Thanks for the insight.
Good library.

@lukebakken
Copy link
Contributor

lukebakken commented Dec 9, 2023

Thank you. I really appreciate the effort you put into your issue description and for providing code.

This feels like something that could be added to the RabbitMQ tutorials and documentation - what to do in the case of an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants