v1.6.0 causing rabbit connection errors #160

rlk833 · 2023-01-30T17:34:32Z

We just pulled in v1.6.0 on Friday and within 5 mins of normal heavy load we started to see connection errors. Sometimes it got so bad we couldn't get a solid connection for an hour. We backed down to v1.5.0 and making connections were working again.

We are using:

golang v1.18.8
Rabbit MQ 3.9

In our client logs we see this error:

Channel closed with error - Exception (501) Reason: "short write"

In the Rabbit logs we are seeing:

0.1072.86>  operation none caused a connection exception frame_error: "type 20, all octets = <<>>: {frame_too_large,1936943471,131064}"
Jan 30 07:31:01 ibm-cloud-databases-prod 2023-01-30T12:31:01.498290249Z stdout F 2023-01-30 12:31:01.496657+00:00 [error] <0.22553.86>  operation none caused a connection exception frame_error: "type 3, first 16 octets = <<31,139,8,0,0,0,0,0,0,255,140,85,201,174,163,74>>: {invalid_frame_end_marker,\n                                                                               222}"
Jan 30 08:06:31 ibm-cloud-databases-prod 2023-01-30T13:06:31.1258836Z stdout F 2023-01-30 13:06:31.090056+00:00 [error] <0.5473.88>  operation none caused a connection exception frame_error: "type 20, all octets = <<>>: {frame_too_large,1936943471,131064}"
Jan 30 08:23:40 ibm-cloud-databases-prod 2023-01-30T13:23:40.184025484Z stdout F 2023-01-30 13:23:40.181798+00:00 [error] <0.17411.117>  operation none caused a connection exception frame_error: "type 26, all octets = <<>>: {frame_too_large,2037148769,131064}"
Jan 30 09:19:15 ibm-cloud-databases-prod 2023-01-30T14:19:14.732511148Z stdout F 2023-01-30 14:19:14.708010+00:00 [error] <0.22659.106>  operation none caused a connection exception frame_error: "type 59, all octets = <<>>: {frame_too_large,2621440,131064}"

The text was updated successfully, but these errors were encountered:

lukebakken · 2023-01-30T20:33:09Z

If I had to guess, it is #142 that caused this change in behavior.

@rlk833 if you could provide more information that would be great.

Could you please quantify "normal heavy load":

What is the message rate?
Message size?
How many connections, channels, queues, etc are in place?

Ideally if providing to reproduce the issue would help us out the most. Otherwise, if you can reproduce this issue in a test environment, use git bisect between the 1.5.0 and 1.6.0 tags to figure out which commit caused this behavior.

lukebakken · 2023-01-30T20:34:55Z

cc @fadams - if you have a second to look at the symptoms and chime in I would appreciate it.

Fixes #160

rlk833 · 2023-01-30T21:43:19Z

Sorry but unfortunately we don't have any usage metrics on Rabbit calls.
All I have is the Rabbit dashboard and it only goes back one day.
But:
Publish and Publish confirm rate: about 8/sec
Deliver Auto-ack: about 40/sec
Message size: I don't know. It varies from small to large, but the distribution of size haven't changed in over a year.

On Avg:
Connections: 32
Channels: 49
Exchanges: 24
Queues: 21
Consumers: 16

rlk833 · 2023-01-30T21:46:25Z

most of the errors showing in the rabbit logs show frame_too_large, got some invalid_frame_end_marker

lukebakken · 2023-01-30T22:20:39Z

Thanks. I'm asking these questions so I can try to reproduce this issue, so please be as specific as possible.

"varies from small to large". Are we talking 1K to several megabytes?
Do your publishing applications use one connection or several connections?
For each publishing connection, how many channels are used?
Are there multiple application goroutines accessing the same connection to publish messages?

rlk833 · 2023-01-30T22:34:28Z

i understand, but i just don't have the metrics.

it can from bytes to a megabyte (maybe). We actually gzip the body if > 2K

we use multiple connections - usually one or two per queue that we publish to.

We never use the same channel/connection reentrantly. It is one channel per connection. We always only ever do one transaction at a time on a connection/channel. I didn't want to get into a failure of one transaction causing a failure in another transaction that was simultaneous using the connection/channel. If there are simultaneous requests we try to have a pool of a few connections that we hand out to each request. And then the connection returned to the pool.

we try to use one connection per channel.

Connections: 32
Channels: 49
Exchanges: 24
Queues: 21
Consumers: 16

rlk833 · 2023-01-30T22:38:46Z

just took a closer look. Publishing is ALWAYS one channel and connection. No sharing.

Consuming: we have a few consumers that share the channel when consuming.

fadams · 2023-01-31T07:50:39Z

Hi @lukebakken I'm afraid I won't be able to have a proper look/think until the weekend 'cause work.
Some quick thoughts. The thing I've mostly been working on is something of a "wrapper" round amqp091-go intended to make reconnections more "transparent" from a client perspective. That also does "fancy" things like allowing a connection pool so a client publish can get sent to a go channel and multiple goroutines consume from that and publish to the underlying AMQP connection/channel. I mention that 'cause it's vaguely fancy and I've been hitting it with multiple hundreds of thousands of messages/s across multiple producer/consumer connections quite happily.

The only place I can think of where I've seen anything resembling what has been described here was where my main goroutine terminated (too) early, but the publisher goroutine didn't have cancellation in place.

So 1. If I were to bet I suspect that there is a bug somewhere in the (user's) application client code where a connection/channel has been closed and pulled the rug from under a publish. 2. What you've done with the defer to force a flush on an error looks like a good call.

TBH I can't recall the exact error I saw, it was intermittent IIRC and I put it down to my (rather hacky at the time) application code terminating pretty uncleanly at the time. When I have a little spare time I'll see if I can "unfix" what I did to my application and try to reproduce. But your "flush on error" change looks a good call.

The only other thing I can think of off the top of my head is the bit around chunking body into size. In my original change I had:

		// chunk body into size (max frame size - frame header size)
		for i, j := 0, size; i < len(body); i, j = j, j+size {
			if j > len(body) {
				j = len(body)
			}

			if err = ch.connection.sendUnflushed(&bodyFrame{
				ChannelId: ch.id,
				Body:      body[i:j],
			}); err != nil {
				return
			}
		}

		// Flush the buffer only after all the Frames that comprise the Message
		// have been written to maximise benefits of using a buffered writer.
		err = ch.connection.flush()

and I did wonder whether doing a flush per chunk might be the thing to do, but with buffered IO when the underlying buffer length is exceeded the buffer flushes implicitly and the reason for explicit flushes is when you have writes less than the buffer length, if you see what I mean. I suspect that's a red herring, but worth mentioning. If there were only ever observed issues for large messages I might wonder about that more but I more suspect latent client application "lifecycle" issues exposing an error handling edge case.

That's all I can think of for now I'm afraid. I'll try to dig more deeply when I have a little more time.

lukebakken · 2023-01-31T14:35:02Z

I really appreciate it @fadams

@rlk833 - if you can test the changes I have made in #161 that would be great. I'm still working on reproducing this issue using this project. Of course everything works fine in my environment at the moment.

rlk833 · 2023-01-31T14:44:39Z

Take a look at this stackoverflow question about "short write"

https://stackoverflow.com/questions/66174520/how-to-solve-short-write-error-when-writing-to-csv

are your sync locks correct? The last append says using bufio and writing and flushing at the same time.

Though we are also getting on the rabbit side a lot of frame_too_large

fadams · 2023-01-31T15:08:28Z

Ahh interesting, so on https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L490 it probably needs to read

	c.sendM.Lock()
	err = buf.Flush()
	c.sendM.Unlock()

That's definitely possible/plausible and if so my bad. In all honesty I actively avoid doing publish from multiple goroutines concurrently. I can't think of a good reason to do that (other than convenience) certainly from a performance/throughput perspective multiple goroutines publishing to a given channel, or even multiple channels on the same connection has never seemed to make any positive performance difference, so for throughput I always end up creating a pool of connections each with a (AMQP) channel and I have a handler goroutine for each connection in the pool receiving data from a (go) channel.

rlk833 · 2023-01-31T15:13:21Z

i do too try to single thread use a connection, except for some reason my consume queues share a connection. but i am worried there may be some weird spot that it is not single threaded on a connection.

But is there a lock on the write too! It is possible during writing that a flush occurs under the covers too if the buffer becomes full.

lukebakken · 2023-01-31T15:21:19Z

OK, now I think I see the issue. ALL sendUnflushed and subsequent flush() calls must be protected by the same lock. I'm making the change in my PR now.

If these writes aren't serialized you can get interleaved frame writes on the connection if multiple goroutines use the same connection.

lukebakken · 2023-01-31T15:30:58Z

@rlk833 I'm reasonably certain that #161 will fix your issue. It would be greatly appreciated if you can test that PR in your environment, since it seems you can readily reproduce the issue.

If you need me to merge and tag a beta release of 1.6.1 I can do that as well.

rlk833 · 2023-01-31T15:33:42Z

busy at the moment. Day of meetings, yuck. But I will get to it. Tomorrow is free so far.

I think I can ask for in go.mod for what a specific commit id. I'll need to look at that.

fadams · 2023-01-31T15:53:54Z

@lukebakken minor thing, but do you need startSendUnflushed()/endSendUnflushed() as explicit methods as they are sumply locking and unlocking the mutex respectively?

Wouldn't it actually be clearer to directly have ch.connection.sendM.Lock() where you have ch.connection.startSendUnflushed() and ch.connection.sendM.Unlock() where you have ch.connection.endSendUnflushed()

There are arguments both ways I guess and it is really minor, but it'd make it more explicit that all the writes and flush for the message are being protected by that mutex.

lukebakken · 2023-01-31T15:59:27Z

I find it easier to understand what is going on with the code as I modified it. I just moved the flush into endSendUnflushed() as well.

I thought about adding an assertion to ensure that sendUnflushed is called only after startSendUnflushed but that seems like overkill at this time.

fadams · 2023-01-31T16:16:41Z

Haha NP, we all visualise things differently, right :-) TBF the startSendUnflushed/endSendUnflushed imply a sort of semantic "transaction", as I said it's minor.

On the plus side what you've done is likely to have a (tiny) performance improvement on the non concurrent case as the lock is done at the start of the "transaction" and unlocked at the end rather than for each frame write. TBF for the case of no lock contention I think it'll only save a couple of atomic CAS. but hey it's a bonus :-)

…flushed writes (#161) * Ensure flush happens even with errors Fixes #160 * Use correct lock around unflushed frame writes * formatting * Move flushing to endSendUnflushed

lukebakken · 2023-01-31T17:01:39Z

@rlk833 I have tagged version v1.6.1-rc.1 that you should be able to use to test in your environment. Once we get confirmation from you I can produce the final release. Again, thanks a lot for helping out!

fadams · 2023-01-31T17:25:49Z

@lukebakken sorry, I've just had another thought on this. On your recent change the rationale is ALL sendUnflushed and subsequent flush() calls must be protected by the same lock and If these writes aren't serialized you can get interleaved frame writes on the connection if multiple goroutines use the same connection.

But thinking back to the original 1.5.0 code (that just used send()), well with that the mutex just protected WriteFrame() https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L429 so that pre separating out the flush code could have had interleaved frames if written to by multiple goroutines.

I think there's a subtlety. My guess is that most people are likely to stand up multiple channels if they were planning on writing from multiple goroutines and in that case frames interleaved on the connection from different channels should be fine and that I think was the original behaviour - and I think simply putting the mutex around the flush would have been good enough.

With your recent change the mutex protects against concurrent accesses with a wider critical section, so it now protects the case of say multiple concurrent writes to the same channel as it prevents interleaving, but the consequence of that is that it now (I think) serialises access from multiple channels so if you have say two goroutines each writing to a different channel then I think it would have been OK for those to have been interleaved (though the Flush should be protected to prevent an accidental flush mid-write) but by serialising writing the entire message it could (I think) potentially increase the latency for the case where you have the two channels with two concurrent writers where say a large message is being written and I think previously each "chunked" body frame from each channel could have been interleaved.

lukebakken · 2023-01-31T18:23:58Z

I see what you're saying, let me think about it, read about how the underlying writer flushes data when it's a socket, and see what @rlk833 reports back.

lukebakken · 2023-01-31T18:50:29Z

though the Flush should be protected to prevent an accidental flush mid-write

Aha, yep, this is the root of the issue introduced by #142. I'll open a new PR.

lukebakken · 2023-01-31T19:39:41Z

@fadams thanks again for continuing to think about this issue.

@rlk833 - I just tagged v1.6.1-rc.2

Zerpet · 2023-02-01T11:43:35Z

I've written a benchmark test to publish 1_000_000 messages, split between 10 go routines, with a body between 10-15 bytes. Publishers await confirmations, up to 100 confirmations in-flight. RabbitMQ 3.11.8 in Docker in my Mac laptop. Results confirm what has been discussed here 🙂 In summary, #142 provided a huge boost in performance. Tag v1.6.1-rc.1 shows this boost compared to v1.5.0. Tag v1.6.1-rc.2 shows a small boost in performance, compared to v1.6.1-rc.1.

The results:

### 10 go routines for channel.PublishWithContext

v1.5.0
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	16480017750 ns/op	     60679 msg/s

------

v1.6.1-rc.1
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	12594736667 ns/op	     79398 msg/s

-------

v1.6.1-rc.2
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	12244268958 ns/op	     81671 msg/s

I run each benchmark 3 times, they all yield similar results (difference in 100-200 msg/s each run).

lukebakken · 2023-02-01T14:46:45Z

I re-opened this issue because we're waiting on confirmation from @rlk833 that v1.6.1-rc.2 fixes the issue.

rlk833 · 2023-02-01T20:30:12Z

We just pushed to our staging/test system. After two hours no errors yet. Hopefully we have a good load like we had when we had the errors.

We are going to let it run for a day to see if anything develops.

lukebakken · 2023-02-01T20:42:42Z

Thanks for the update.

rlk833 · 2023-02-02T21:32:51Z

All right. It has been over 24 hours and there have been zero errors with the Rabbit Client.

I think this patch is OK to go live.

lukebakken · 2023-02-02T22:25:17Z

Roger that, thanks again!

lukebakken · 2023-02-02T22:44:35Z

@rlk833 - https://github.com/rabbitmq/amqp091-go/releases/tag/v1.6.1

fadams · 2023-02-03T07:56:38Z

That's great news!

slagiewka · 2023-02-03T10:29:34Z

I can also confirm that 1.6.0 was broken for our top producer that was updated to 1.6.0 this morning. Unfortunately via our internal library that had 1.6.0 locked earlier, so 1.6.1 didn't come up there yet.

The producer started showing issues immediately with 1.6.0. Luckily 1.6.1 is ready. I tested it and all works fine. Thank you all for reporting and fixing 😉

fadams · 2023-02-08T18:33:11Z

By way of completeness I put together something roughly based on the send.go tutorial https://github.com/rabbitmq/rabbitmq-tutorials/blob/main/go/send.go that reproduces this issue.

https://gist.github.com/fadams/d427e1faba74942d429a446027751689

It barfs with "panic: Failed to publish a message: short write" with 1.6.0 and seems happy when the flush is protected by the connection Mutex.

When I was playing with that it was interesting to note that multiple goroutines writing to the same AMQP channel didn't seem to barf, which is kind of surprising TBH, but having multiple AMQP channels as per the gist barfs pretty much immediately.

It's also kind of amusing to note that the throughput is higher for one goroutine than multiple. That's probably counterintuitive. My suspicion is that it's because with one goroutine and no lock contention the Mutexes are basically just doing atomic CAS, whereas with multiple goroutines there will be lots of contention, hitting the Mutex slow path and context switches. Using multiple connections seems to be more reliable as a way to improve throughput.

lukebakken assigned lukebakken and Zerpet Jan 30, 2023

lukebakken added this to the 1.6.1 milestone Jan 30, 2023

lukebakken added a commit that referenced this issue Jan 30, 2023

Ensure flush happens even with errors

b66f0c9

Fixes #160

lukebakken mentioned this issue Jan 30, 2023

Ensure flush happens and correctly lock connection for a series of unflushed writes #161

Merged

lukebakken added a commit that referenced this issue Jan 30, 2023

Ensure flush happens even with errors

18802b2

Fixes #160

lukebakken closed this as completed in #161 Jan 31, 2023

lukebakken mentioned this issue Jan 31, 2023

Do not overly protect writes #162

Merged

lukebakken reopened this Jan 31, 2023

lukebakken closed this as completed in #162 Jan 31, 2023

lukebakken reopened this Feb 1, 2023

lukebakken closed this as completed Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.6.0 causing rabbit connection errors #160

v1.6.0 causing rabbit connection errors #160

rlk833 commented Jan 30, 2023

lukebakken commented Jan 30, 2023

lukebakken commented Jan 30, 2023

rlk833 commented Jan 30, 2023

rlk833 commented Jan 30, 2023 •

edited

Loading

lukebakken commented Jan 30, 2023 •

edited

Loading

rlk833 commented Jan 30, 2023 •

edited

Loading

rlk833 commented Jan 30, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023 •

edited

Loading

rlk833 commented Jan 31, 2023 •

edited

Loading

fadams commented Jan 31, 2023

rlk833 commented Jan 31, 2023

lukebakken commented Jan 31, 2023 •

edited

Loading

lukebakken commented Jan 31, 2023 •

edited

Loading

rlk833 commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

lukebakken commented Jan 31, 2023

lukebakken commented Jan 31, 2023

Zerpet commented Feb 1, 2023

lukebakken commented Feb 1, 2023

rlk833 commented Feb 1, 2023

lukebakken commented Feb 1, 2023

rlk833 commented Feb 2, 2023

lukebakken commented Feb 2, 2023

lukebakken commented Feb 2, 2023

fadams commented Feb 3, 2023

slagiewka commented Feb 3, 2023

fadams commented Feb 8, 2023

v1.6.0 causing rabbit connection errors #160

v1.6.0 causing rabbit connection errors #160

Comments

rlk833 commented Jan 30, 2023

lukebakken commented Jan 30, 2023

lukebakken commented Jan 30, 2023

rlk833 commented Jan 30, 2023

rlk833 commented Jan 30, 2023 • edited Loading

lukebakken commented Jan 30, 2023 • edited Loading

rlk833 commented Jan 30, 2023 • edited Loading

rlk833 commented Jan 30, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023 • edited Loading

rlk833 commented Jan 31, 2023 • edited Loading

fadams commented Jan 31, 2023

rlk833 commented Jan 31, 2023

lukebakken commented Jan 31, 2023 • edited Loading

lukebakken commented Jan 31, 2023 • edited Loading

rlk833 commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

fadams commented Jan 31, 2023

lukebakken commented Jan 31, 2023

lukebakken commented Jan 31, 2023

lukebakken commented Jan 31, 2023

Zerpet commented Feb 1, 2023

lukebakken commented Feb 1, 2023

rlk833 commented Feb 1, 2023

lukebakken commented Feb 1, 2023

rlk833 commented Feb 2, 2023

lukebakken commented Feb 2, 2023

lukebakken commented Feb 2, 2023

fadams commented Feb 3, 2023

slagiewka commented Feb 3, 2023

fadams commented Feb 8, 2023

rlk833 commented Jan 30, 2023 •

edited

Loading

lukebakken commented Jan 30, 2023 •

edited

Loading

rlk833 commented Jan 30, 2023 •

edited

Loading

lukebakken commented Jan 31, 2023 •

edited

Loading

rlk833 commented Jan 31, 2023 •

edited

Loading

lukebakken commented Jan 31, 2023 •

edited

Loading

lukebakken commented Jan 31, 2023 •

edited

Loading