-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.6.0 causing rabbit connection errors #160
Comments
If I had to guess, it is #142 that caused this change in behavior. @rlk833 if you could provide more information that would be great. Could you please quantify "normal heavy load":
Ideally if providing to reproduce the issue would help us out the most. Otherwise, if you can reproduce this issue in a test environment, use |
cc @fadams - if you have a second to look at the symptoms and chime in I would appreciate it. |
Sorry but unfortunately we don't have any usage metrics on Rabbit calls. On Avg: |
most of the errors showing in the rabbit logs show frame_too_large, got some invalid_frame_end_marker |
Thanks. I'm asking these questions so I can try to reproduce this issue, so please be as specific as possible.
|
i understand, but i just don't have the metrics. it can from bytes to a megabyte (maybe). We actually gzip the body if > 2K we use multiple connections - usually one or two per queue that we publish to. We never use the same channel/connection reentrantly. It is one channel per connection. We always only ever do one transaction at a time on a connection/channel. I didn't want to get into a failure of one transaction causing a failure in another transaction that was simultaneous using the connection/channel. If there are simultaneous requests we try to have a pool of a few connections that we hand out to each request. And then the connection returned to the pool. we try to use one connection per channel. Connections: 32 |
just took a closer look. Publishing is ALWAYS one channel and connection. No sharing. Consuming: we have a few consumers that share the channel when consuming. |
Hi @lukebakken I'm afraid I won't be able to have a proper look/think until the weekend 'cause work. The only place I can think of where I've seen anything resembling what has been described here was where my main goroutine terminated (too) early, but the publisher goroutine didn't have cancellation in place. So 1. If I were to bet I suspect that there is a bug somewhere in the (user's) application client code where a connection/channel has been closed and pulled the rug from under a publish. 2. What you've done with the defer to force a flush on an error looks like a good call. TBH I can't recall the exact error I saw, it was intermittent IIRC and I put it down to my (rather hacky at the time) application code terminating pretty uncleanly at the time. When I have a little spare time I'll see if I can "unfix" what I did to my application and try to reproduce. But your "flush on error" change looks a good call. The only other thing I can think of off the top of my head is the bit around chunking body into size. In my original change I had:
and I did wonder whether doing a flush per chunk might be the thing to do, but with buffered IO when the underlying buffer length is exceeded the buffer flushes implicitly and the reason for explicit flushes is when you have writes less than the buffer length, if you see what I mean. I suspect that's a red herring, but worth mentioning. If there were only ever observed issues for large messages I might wonder about that more but I more suspect latent client application "lifecycle" issues exposing an error handling edge case. That's all I can think of for now I'm afraid. I'll try to dig more deeply when I have a little more time. |
I really appreciate it @fadams @rlk833 - if you can test the changes I have made in #161 that would be great. I'm still working on reproducing this issue using this project. Of course everything works fine in my environment at the moment. |
Take a look at this stackoverflow question about "short write" https://stackoverflow.com/questions/66174520/how-to-solve-short-write-error-when-writing-to-csv are your sync locks correct? The last append says using bufio and writing and flushing at the same time. Though we are also getting on the rabbit side a lot of |
Ahh interesting, so on https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L490 it probably needs to read
That's definitely possible/plausible and if so my bad. In all honesty I actively avoid doing publish from multiple goroutines concurrently. I can't think of a good reason to do that (other than convenience) certainly from a performance/throughput perspective multiple goroutines publishing to a given channel, or even multiple channels on the same connection has never seemed to make any positive performance difference, so for throughput I always end up creating a pool of connections each with a (AMQP) channel and I have a handler goroutine for each connection in the pool receiving data from a (go) channel. |
i do too try to single thread use a connection, except for some reason my consume queues share a connection. but i am worried there may be some weird spot that it is not single threaded on a connection. But is there a lock on the write too! It is possible during writing that a flush occurs under the covers too if the buffer becomes full. |
OK, now I think I see the issue. ALL If these writes aren't serialized you can get interleaved frame writes on the connection if multiple goroutines use the same connection. |
busy at the moment. Day of meetings, yuck. But I will get to it. Tomorrow is free so far. I think I can ask for in go.mod for what a specific commit id. I'll need to look at that. |
@lukebakken minor thing, but do you need startSendUnflushed()/endSendUnflushed() as explicit methods as they are sumply locking and unlocking the mutex respectively? Wouldn't it actually be clearer to directly have There are arguments both ways I guess and it is really minor, but it'd make it more explicit that all the writes and flush for the message are being protected by that mutex. |
I find it easier to understand what is going on with the code as I modified it. I just moved the flush into I thought about adding an assertion to ensure that |
Haha NP, we all visualise things differently, right :-) TBF the startSendUnflushed/endSendUnflushed imply a sort of semantic "transaction", as I said it's minor. On the plus side what you've done is likely to have a (tiny) performance improvement on the non concurrent case as the lock is done at the start of the "transaction" and unlocked at the end rather than for each frame write. TBF for the case of no lock contention I think it'll only save a couple of atomic CAS. but hey it's a bonus :-) |
@rlk833 I have tagged version |
@lukebakken sorry, I've just had another thought on this. On your recent change the rationale is But thinking back to the original 1.5.0 code (that just used send()), well with that the mutex just protected WriteFrame() https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L429 so that pre separating out the flush code could have had interleaved frames if written to by multiple goroutines. I think there's a subtlety. My guess is that most people are likely to stand up multiple channels if they were planning on writing from multiple goroutines and in that case frames interleaved on the connection from different channels should be fine and that I think was the original behaviour - and I think simply putting the mutex around the flush would have been good enough. With your recent change the mutex protects against concurrent accesses with a wider critical section, so it now protects the case of say multiple concurrent writes to the same channel as it prevents interleaving, but the consequence of that is that it now (I think) serialises access from multiple channels so if you have say two goroutines each writing to a different channel then I think it would have been OK for those to have been interleaved (though the Flush should be protected to prevent an accidental flush mid-write) but by serialising writing the entire message it could (I think) potentially increase the latency for the case where you have the two channels with two concurrent writers where say a large message is being written and I think previously each "chunked" body frame from each channel could have been interleaved. |
I see what you're saying, let me think about it, read about how the underlying writer flushes data when it's a socket, and see what @rlk833 reports back. |
Aha, yep, this is the root of the issue introduced by #142. I'll open a new PR. |
@fadams thanks again for continuing to think about this issue. @rlk833 - I just tagged |
I've written a benchmark test to publish 1_000_000 messages, split between 10 go routines, with a body between 10-15 bytes. Publishers await confirmations, up to 100 confirmations in-flight. RabbitMQ 3.11.8 in Docker in my Mac laptop. Results confirm what has been discussed here 🙂 In summary, #142 provided a huge boost in performance. Tag The results:
I run each benchmark 3 times, they all yield similar results (difference in 100-200 msg/s each run). |
I re-opened this issue because we're waiting on confirmation from @rlk833 that |
We just pushed to our staging/test system. After two hours no errors yet. Hopefully we have a good load like we had when we had the errors. We are going to let it run for a day to see if anything develops. |
Thanks for the update. |
All right. It has been over 24 hours and there have been zero errors with the Rabbit Client. I think this patch is OK to go live. |
Roger that, thanks again! |
That's great news! |
I can also confirm that 1.6.0 was broken for our top producer that was updated to 1.6.0 this morning. Unfortunately via our internal library that had 1.6.0 locked earlier, so 1.6.1 didn't come up there yet. The producer started showing issues immediately with 1.6.0. Luckily 1.6.1 is ready. I tested it and all works fine. Thank you all for reporting and fixing 😉 |
By way of completeness I put together something roughly based on the send.go tutorial https://github.com/rabbitmq/rabbitmq-tutorials/blob/main/go/send.go that reproduces this issue. https://gist.github.com/fadams/d427e1faba74942d429a446027751689 It barfs with "panic: Failed to publish a message: short write" with 1.6.0 and seems happy when the flush is protected by the connection Mutex. When I was playing with that it was interesting to note that multiple goroutines writing to the same AMQP channel didn't seem to barf, which is kind of surprising TBH, but having multiple AMQP channels as per the gist barfs pretty much immediately. It's also kind of amusing to note that the throughput is higher for one goroutine than multiple. That's probably counterintuitive. My suspicion is that it's because with one goroutine and no lock contention the Mutexes are basically just doing atomic CAS, whereas with multiple goroutines there will be lots of contention, hitting the Mutex slow path and context switches. Using multiple connections seems to be more reliable as a way to improve throughput. |
We just pulled in v1.6.0 on Friday and within 5 mins of normal heavy load we started to see connection errors. Sometimes it got so bad we couldn't get a solid connection for an hour. We backed down to v1.5.0 and making connections were working again.
We are using:
In our client logs we see this error:
In the Rabbit logs we are seeing:
The text was updated successfully, but these errors were encountered: