Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby and TLS causing odd timeouts/reconnections? #142

Open
mlartz opened this issue Jul 7, 2017 · 23 comments
Open

JRuby and TLS causing odd timeouts/reconnections? #142

mlartz opened this issue Jul 7, 2017 · 23 comments

Comments

@mlartz
Copy link

mlartz commented Jul 7, 2017

Is it possible/recommended to ack a message from a worker thread? I've been using a STOMP server (corporate ActiveMQ) and client acks ({ack: 'client-individual'}) to provide reliable receipt and storage of messages. So I'm receiving a STOMP message, storing it to an external service, and then acking the message. I'd like to not block the receive thread and increase the concurrency of the storage requests. As such, I created a thread pool to do the storage of the message and then ack the message, which should provide the guarantee that all messages that are acked to STOMP are also stored.

Unforutnately, I'm developing on a closed network, so the pseudo-code below is all constructed/simplified for this issue:

require 'stomp'
require 'concurrent' 

@thread_pool = Concurrent::FixedThreadPool.new(5) 
@client = Stomp.new
@client.subscribe('/topic/mytopic', :id => 'mytopic', :ack => 'client-individual') do |msg|
  @thread_pool.post do 
    begin
      store_to_external_service(msg)
    ensure
      @client.ack(msg)
    end
  end
end

loop do 
  sleep(10)
end

This seems to work fine in MRI. After a while, I needed to do a bit more work in the store_to_external_service, and MRI's parallelism was really lacking, so I attempted to switch to JRuby. When in JRuby, everything seems to chug along just fine to start, however the message receipt deadlocks after a few seconds, right after a @client.ack. When I eliminate the thread pool (and just do the storing/acking inline), no deadlocks are observed. If I use even one storage thread (@thread_pool = Concurrent::FixedThreadPool.new(1)), however, I get a deadlock after a few seconds of message receipt.

So, ultimate question ... is message acking threadsafe? Is it OK to ack a message from a thread other than the receive thread? Or am I doing something else boneheaded wrt the worker thread(s)?

@gmallard
Copy link

gmallard commented Jul 8, 2017

Apologies if much of this reply gets duplicated.

Given the multi-threading capabilities of Ruby, this gem surely does not see much multi-thread use.

And it also does not get very much testing with JRuby.

In theory I see nothing wrong with what you describe - including ACK from a thread other than the receiver. I know of no reason that should not work.

I will assume you have checked AMQ logs for possible problems, yes?

You use the term 'deadlock'. Do you mean just a 'hang' ?

Other Questions:

What version of jruby?
What version of stomp gem?
What version of concurrent-ruby gem?
What version of AMQ?
What STOMP Protocol level? (If you are using protocol level 1.0, please don't. especially if this is 'new' code).

I will help if I can.

However, I do not think this is a stomp gem problem. Here is why:

I just used Ruby 2.4.2p131 (2017-05-10 revision 58637) [x86_64-linux] (I build my own Ruby versions) and your sample code. I pumped 5000 messages through AMQ at 10 messages / second, message size about 100 bytes. Zero problems. So I think the gem is solid.

JRuby is a different story. Using jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-x86_64] I cannot get that code to run at all. I get different results depending on if the connection is marked reliable or not. I will investigate this further, because I am now curious ....... JRuby support in the stomp gem has always been painful.

Have you considered trying to use Ruby fibers for this effort? But frankly, if I wanted rich multi-threading/concurrency I would have gone with a client in Java, or go, or maybe even C++.

@gmallard
Copy link

gmallard commented Jul 8, 2017

I repeated the above 5000 message experiment with:

jruby 1.7.25 (1.9.3p551) 2016-04-13 867cb81 on OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-amd64]

Zero problems.

The latest JRuby 9.x version:

jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-x86_64]

has the same erratic behavior as the 9.1.2.0 version.

@gmallard
Copy link

gmallard commented Jul 8, 2017

I now have your example code running fine with JRuby 9.x versions.

Not your problem, but JRuby support is a pain. JRuby socket I/O does not behave like normal Ruby socket I/O (or even like C socket I/O). This gem has been changed multiple times to work around that.

In any case, I see nothing wrong with the stomp gem at this point. And cannot recreate what you describe.

If you will document further, I will explore further. But right now I am thinking there is something ..... unique about your particular environment that triggers this behavior.

@mlartz
Copy link
Author

mlartz commented Jul 10, 2017

Thanks for checking into this. I'll see if I can't answer all the questions. tldr; I think my issue might be due to using TLS to connect to the broker.

I will assume you have checked AMQ logs for possible problems, yes?
Unfortunately, no. The server is corporately supported, meaning that I have no access to the logs.

You use the term 'deadlock'. Do you mean just a 'hang' ?
Yes, sorry for being confusing. Given the threading issues, I assumed that it was a deadlock, but I don't have proof.

What version of jruby?
jruby 9.0.3.0 (2.2.2) 2015-10-21 633c9aa Java HotSpot(TM) 64-Bit Server VM 25.45-b02 on 1.8.0_45-b14 +jit [linux-amd64]

What version of stomp gem?
stomp (1.4.3)

What version of concurrent-ruby gem?
concurrent-ruby (1.0.5 java)

What version of AMQ?
ActiveMQ/5.14.5 (via the STOMP connection headers)

What STOMP Protocol level?
1.1, due to some historical eventmachine code that was not 1.2 compliant. I can most-likely update to 1.2 with little difficulty.

Also, I failed to mention that I am using TLS1.2 to connect to the broker with client certificates for auth.

I wrote a little consumer similar to that posted above, and it is not behaving in the same way as the entire application (full hang), however it does have some issues which might or might not be related.

First of all, I use the activemq.prefetchSize header to increase the number of unacked messages "in flight" to make sure that I don't spend (too much) time waiting on the server. I don't quite understand the official AMQ STOMP docs and more on prefetch, however, which indicate that dynamic languages should use a value of 1 for prefetch, however it suggests that other clients use a value > 1. When I attempt to set the prefetch header to 1, my client doesn't seem to receive any messages in either MRI or JRuby. That said, I set drdbg = true (https://github.com/stompgem/stomp/blob/dev/lib/connection/netio.rb#L29) and added a bunch of other print statements to see where things were getting hung up, and the last thing that I see in the log is a ["wiredatain_02A", "MESSAGE\n", 2017-07-10 18:01:12 +0000] (https://github.com/stompgem/stomp/blob/dev/lib/connection/netio.rb#L64). It does not get to the wiredatain_02B message after the following:

unless connread || @ssl
  raise Stomp::Error::ReceiveTimeout unless IO.select([read_socket], nil, nil, @iosto)
end

From my rudimentary understanding of the code it looks like the client is getting a message (or at least the start of one) and hanging on the IO.select for too long. At this point the receive eventually times out, the client reconnects, and the process repeats.

The second issue with the simple client is similar to the above, but in a separate spot in the code. When I increase the prefetch size to 2 (or more), the client starts receiving (and acking) messages. Eventually it stops, however, and the last message that I see is ["wiredatain_05_prep", "isr", true]. After a few seconds the timeout happens and the client reconnects. A couple more putss indicates that the code gets hung up at https://github.com/stompgem/stomp/blob/dev/lib/connection/netio.rb#L117 , which is identical to the above code:

unless connread || @ssl
  raise Stomp::Error::ReceiveTimeout unless IO.select([read_socket], nil, nil, @iosto)
end

The Ruby core docs indicate some differences with IO.select and IO-like objects, and specifically reference OpenSSL sockets with a bunch of caveats. I have no idea how JRuby implements this under the hood.

In the full application (albeit a bit stripped down to help diagnose the STOMP issue), it seems to be hanging in a similar place (i.e. the while block at https://github.com/stompgem/stomp/blob/dev/lib/connection/netio.rb#L115-L126), however further diagnosing seems to indicate that its having trouble when it calls the _is_ready method, specifically when calling s.ready? (https://github.com/stompgem/stomp/blob/dev/lib/connection/netio.rb#L148) on the socket.

Because all of the issues seem to be centered around the low-level socket stuffs, and I'm not using a "real" socket (OpenSSL), I'm inclined to believe that is the core problem. Let me know if I can do anything more to help debug.

@mlartz
Copy link
Author

mlartz commented Jul 10, 2017

After using jstack and googling, I figured out that the full app is hanging due to a deadlock caused by the logging gem (https://github.com/TwP/logging). logging uses thread-local variables to manage the mapped diagnostic contexts, and that seems to be what is causing the problem. I'll follow up with them about this issue.

As for the issue with the simple client (i.e. various "socket" select timeouts), I don't have any more information from jstack.

@gmallard
Copy link

gmallard commented Jul 10, 2017

Great reports above. Many reports do not have detail like you provide.

Question regarding TLS:

There are four basic use cases. See the examples subdirectory. Which use case are you using? (Does not sound like dual-auth).

The gem is full of JRuby hacks added over the years. I admit that. You can see that in the code. Need some refactoring efforts in that area I think. My problem.

The real difficulty with basic socket I/O is not JRuby per se.

It is Java implementation of sockets, which does not match normal C behavior. You can see all of those gem logic changes to try to attempt to support JRuby if you look at the code.

This issue is complicated by SSL/TLS implementations. Even in C, SSL is a hack upon basic sockets. Hence the 'ssl' checks in the IO routines.

And all is even more complicated by the fairly recent C libs use of epoll rather than select.

The _is_ready kind of logic is a reflection of all that.

If this turns out to be a real problem with your logging package .... please let me know.

An aside: I had to laugh at your lack of access to AMQ logs. Management loves to say: build me a swimming pool deck. Only use wood and screws. And oh, by the way, you can only use a hammer to do this.

I will continue to look at and think about this and experiment as time permits. Because something seems not quite right.

I will not support broker specific headers like AMQ prefetch, I cannot. There are too many RabbitMQ users and these days Artemis (which has it own thorns). And IMO it is regrettable that Apollo pretty much died. It was the best of the open source brokers.

@gmallard
Copy link

Your above analysis of the code in the _transmit method is excellent.

Knowing that you are using SSL was crucial for me to understand (and be able to reproduce) some of what you are reporting.

As a consequence I do not think all you are seeing is a result of problems with the logger you are using. Because I can certainly recreate some of it when SSL sockets are used.

Thanks for all this info. I will keep you posted on any upcoming changes.

And please let me know further information about the logger. Have you opened an issue with them?

@mlartz
Copy link
Author

mlartz commented Jul 11, 2017

Sorry I left out the SSL part in my initial report. I thought that it was the separate acking thread, but apparently there was more to it. Perhaps I should change the title of the issue to reflect our current understanding (i.e. Problems with JRuby, threading, logging, and SSL)?

And I'm using SSL similar to what is in ssl_uc3.rb. My STOMP options look like the following:

options = {
  :hosts => [{
    :login => 'me',
    :host => server,
    :port => 7230,
    :ssl => ::Stomp::SSLParams.new({ :key_file => "private_key.pem", :cert_file => "cert.pem" }),
    :parse_timeout => 10,
    :sslctx_newparm => :TLSv1_2,
    :connect_headers => { "accept-version" => "1.1", "host" => server }
  }],
}

client = Stomp::Client.new(options)

The "simple" client that I referenced above was stripped of the my entire app and limited to just requiring "concurrent" and "stomp". It is very similar to the code that I originally typed, with the inclusion of the above options hash.

I have not yet opened an issue with the logging gem as I'm trying to get a small (retypeable) script to demonstrate the issue and narrow it down to the logging gem and not my stupidity.

Thanks for your help, and I totally understand not supporting broker-specific extensions.

@gmallard
Copy link

Updating the title of this issue would be good I think. Please do so at your convenience. But - Use the term TLS, not SSL. Old names for old functionality are difficult to unlearn. And also ..... I would say not to mention logging unless you cross-ref this issue with one opened for the logging package.

At this point, I do not think the SSL use case matters, just the fact of SSL yes/no.

@mlartz mlartz changed the title Acking from a worker thread in JRuby? JRuby and TLS causing odd timeouts/reconnections? Jul 11, 2017
@mlartz
Copy link
Author

mlartz commented Jul 11, 2017

Unfortunately, I might be coming back around to my initial hypothesis (acks and receives in different threads causing problems). If you take the simple client (above) with TLS and require 'logging' at the top (after installing the logging gem, of course), I can get the client to reliably deadlock in a couple of seconds. That said, I'm not sure that the logging gem is the culprit.

I started analyzing the jstack trace after the deadlock occurs, and it says that their are two offending threads that are deadlocked: pool-1-thread-3 (apparently one of the thread pool processing/acking threads) and Ruby-0-Thread-3 (which looks like the STOMP receive thread).

pool-1-thread-3 (process/ack thread) is deadlocked at sock.puts, when it is writing the ack message. It is waiting on a java.util.concurrent.locks.ReentrantLock that is held by Ruby-0-Thread-3. The java callstack has mention of org.jruby.RubyIO.removeBlockingThread and org.jruby.ext.openssl.SSLSocket.waitSelect. These are the offending lines from the callstack in JRuby-land (most recent first):

Ruby-0-Thread-3 (STOMP recieve thread) deadlocks at a call to s.ready?, specifically in the ssl.ready? singleton method. in ssl.ready? there is a call to @io.ready? which I believe to be the source of the issue, as the Java call stack descends into talking about org.jruby.util.io.OpenFile.ready. Callstack:

All that said, since just require 'logging' causes the issue to occur, I'm not convinced that logging is not doing something tricksy at load time. As such, I opened an issue with the maintainer: TwP/logging#174

@gmallard
Copy link

A couple of notes.

There is no question that the gem has one or more problems in the ares you are attempting to use. Those problems may or may not be related to threading. I am not comfortable with that conclusion yet (but do not rule it out). The problems most certainly are related to low level socket I/O and or ssl and/or JRuby.

I have a plan of attack here, but have not started writing code on that line yet. It is time for me to do some major rework in the receive routines to isolate IO#select, ssl, and jruby logic. At present there are so many small hacks in that area (good intentions over the years), that it makes the code very difficult to reason about.

That effort will take a bit of time, and require extensive testing here. Changes like that have been known to have negative side effects in areas like heart beats and fail over. I cannot break fail over. The gem as a couple of corporate users with literally hundreds of brokers that use fail over extensively.

You may see this triggered by a logger require, but that actually may not be anything but the trigger. I suppose they could be doing something fancy during load but will repeat: I can recreate some (not all) of what you describe with no logger. I can recreate the fast timeout failures with ease. I cannot recreate a hang/block (and your recent analysis would seem to indicate a true deadlock).

However, I will attempt to run some experiments with that logger required, and observe the effects.

For a 'real' Ruby, as best I recall, the check of @io.ready? is a check of ..... an ssl instance variable (@io), that instance variable being defined in Ruby's C code. I will go back and look at that a bit more / again. To see if it gives me any ideas. I have no idea how JRuby would 'emulate' this in Java. I would almost bet that checks of that instance variable are not thread safe. The Ruby developers have traditionally not had to worry much about multi-threading. Hmmmm ..... just looking at some C and Ruby code in the openssl extension. @io is the underlying 'IO Like Object' for the socket:

class SSLSocket
...
...
    # The underlying IO object.
    attr_reader :io
    alias :to_io :io
...
...
end

and I am thinking the #ready? method is provided by class IO. Notice that SSLObject is a wrapper. It does not inherit from any class.

An interesting experiment you could run is something like this. In netio.rb:

    @ssl_ready_lock = Mutex.new
    #
    # .....
    #

    def ssl.ready?
        @ssl_ready_lock.synchronize do
            ! @rbuffer.empty? || @io.ready?
        end
    end

One other odd question: your original example shows the use of a 'topic'. Do you really expect and want message semantics for topics (as opposed to queues)? (This is a curiosity question. I have observed that many developers ...... use topics when ..... they really want a queue).

In any case, more to come on this. I will report back later with any news. I hope you did not need your code in production tomorrow.

@mlartz
Copy link
Author

mlartz commented Jul 12, 2017

I ran the following experiment with my deadlock case (i.e. require 'logging'):

def ssl.ready?
  STDOUT.puts "ssl.ready?"
  @ssl_ready_lock ||= Mutex.new
  @ssl_ready_lock.synchronize do
    STDOUT.puts "ssl.ready? sync"
    ! @rbuffer.empty? || @io.ready?
  end
end

The deadlock happens right after the STDOUT.puts "ssl.ready? sync", which (coupled the original stack trace) seems to indicate that @io.ready? is indeed involved.

It took me a while to realize why a bare puts wasn't working (I kept getting protocol exceptions), until I realized that the ssl IO-like object probably implements or inherits puts, so I was injecting random data into the STOMP stream.

Given JRuby's marketing about "true" threading parallelism, I'd hope that their implementations would consider multi-threading use cases, but perhaps something was missed.

Thanks for all the great work, and let me know if I can help any further. I'm fine running as we currently are (on MRI with limited parallelism), but would love to experiment more with JRuby when given the chance. We've also got some workarounds to limit the work done in our processing/acking thread which have been working decently. So yeah, no hurry.

As for my use of topics vs queues, I have both. Production has a series of queues that are bridged to various topics (usually a 1:1 topic:queue relationship) to allow multiple consumers and all the other good things that come along with persistent queues (buffering when the app is down being the most useful). Development configuration just uses the topics, however, since we are just exercising the app consumption and don't require the benefits of queues. Also, the corporate ActiveMQ team gets really mad when the queues back up (even with queue limits), as they say that it stresses their hubs. We can deal with this in production, but I really don't want to have to deal with it every day in development as the queues would almost always be backed up.

@gmallard
Copy link

Thanks.

Is it possible to show me your latest deadlock case code?

I'd like to make sure I really understand what you are doing, that I am missing nothing. And code tells me that.

@mlartz
Copy link
Author

mlartz commented Jul 12, 2017

Sure, although its re-typed from our internal network and a couple of things (server/login/key/topic names) are obviously changed:

require 'concurrent'
require 'logging'
require 'stomp'

options = {
  :hosts => [{
    :login => 'me',
    :host => 'myamqserver',
    :port => 7230,
    :ssl => ::Stomp::SSLParams.new({ :key_file => 'private_key.pem', :cert_file => 'cert.pem' }),
    :parse_timeout => 10,
    :sslctx_newparm => :TLSv1_2,
    :connect_headers => { "accept-version" => "1.1", "host" => 'myamqserver' }
  }],
}

thread_pool = Concurrent::FixedThreadPool.new(5) 
client = Stomp.new(options)

client.subscribe('/topic/mytopic', :id => '1', :ack => 'client-individual') do |msg|
  thread_pool.post do 
    begin
      sleep(0.01) # simulate "work"
    ensure
      client.ack(msg)
    end
  end
end

puts 'subscribed'

loop do 
  sleep 10 
  puts '.'
end

I'm using RVM to install JRuby, and I execute the code like

jruby -J-Xmx2048m -S bundle exec ruby bin/deadlock_test

@gmallard
Copy link

gmallard commented Jul 13, 2017

Thanks. I am current with your code now.

I will do a few things different in testing. I do not use RVM. And will not.

I cannot imagine who would care about your internal topic names.

I will start to work on this over the weekend.

If at some point I wanted to give you a gem build for smoke testing, would you be able to receive and test that?

That code does not show AMQ 'prefetch'. Are you using that or not? It does change broker behavior.

@gmallard
Copy link

I don't think this makes any difference, but I would like to know:

is your broker configured with needClientAuth=true or needClientAuth=false?

@gmallard
Copy link

I am somewhat at a loss as to how to proceed with your exact problem.

I cannot recreate deadlock/hang/block of the environment. Even with the logger required. And many other odd experiments.

I have to believe that something in your environment is triggering the behavior you see.

If you have any suggestions, I am all ears.

@gmallard
Copy link

I have finally been able to recreate this on a sporadic basis. Recreation is not consistent. Maybe 20% of my tests will hang. All of the following seem to be need to be present for me to ever recreate:

  • logger is required (which is bizarre).
  • number of threads is high. 5 is not enough here. I need at least 10 before I start to see hangs at all.
  • broker is on a 'slow' network connection (not always necessary, but helps recreate).
  • broker is on a 'slow' machine (not always necessary, but helps recreate).
  • jruby.

When I can recreate, if I repeatedly press ^C, I see a number of messages that contain:

ConcurrencyError: interrupted waiting for mutex

Which is an interesting message. What mutex?

I am pretty much back to thinking that this is not a stomp gem problem.

I will continue probing as time permits.

@mlartz
Copy link
Author

mlartz commented Jul 17, 2017

Thanks for all your research into this. I can probably import and test a pre-release gem, but it won't be a very fast process due to corporate restrictions sigh. Due to the same restrictions, I am not positive how the broker is configured, but I am pretty sure that they have needClientAuth=true, as we don't use passwords anymore (only client certs).

And I get a similar message when I ^C a deadlocked app, and was able to confirm the deadlock with jstack, namely running jstack -l <pid> against the JRuby process after the deadlock occurs, and it will tell you that a deadlock was detected at the end of the stack traces.

I am running by app on a decently provisioned "cloud" (OpenStack) VM, and the brokers seem to be adequately provisioned, however it is virtual, so I can't be sure what's actually going on underneath the VM. That said, I am currently processing over 250 messages/second with the MRI implementation and 10 threads on a similarly provisioned machine, and am hoping that I can increase that a bit with JRuby due to better parallelism (i.e. no GIL). In my test setup, I usually latch onto a topic that gives me about 5-10 messages/sec, and that seems to be enough to replicate the deadlock within a few (<20) seconds.

Also, as another data point, I was able to get into the deadlock situation without logging present, however it took over an hour to occur. Perhaps the logging overhead is doing enough to slow down the processes to increase the likelihood of a deadlock?

I don't disagree that this sounds like a non-STOMP problem, although I'm not sure where to look further. JRuby? Unfortunately, the latest JRuby that I have internally is 9.0.3.0, but I'll see if I can't request a newer on in the next couple of days.

@gmallard
Copy link

Odds and ends here ........

Wow. Cool that you saw a reproduce without the logger. Even if it took some time. That is good information.

I do not think a newer jruby will help. I can recreate the hangs with:

  • jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 +jit [linux-x86_64]

However, I do not see that 'waiting for mutex' message with 9.1.12.0. No information is produced with the use of ^C for that version.

In the interests of full disclosure of what I have looked at:

Thinking about all of those issue reports at a high level, I more-or-less conclude:

  • probably not a logger specific problem
  • probably not a concurrency gem problem (I did consider that)
  • probably not a stomp gem problem
  • is very likely a jruby problem - issue status a little unclear, but likely has not been fixed even in the latest version

Fortunately I do have access to my test broker. And to the logs produced. One interesting note (not surprising) is that: when the hang/block occurs, all of the ACKs have not made it to the broker.

I very much wish that Ruby had a 'deadlock detection' mechanism as robust as the implementation in go (golang). This would be much easier to analyze.

On average, how large are your messages (bytes)?

@gmallard
Copy link

I will add some comments before I get side tracked and/or just forget.

After additional research and experiments I conclude that this deadlock/hang behavior does not exist for:

  • Standard Ruby using TCP sockets
  • Standard Ruby using SSL sockets
  • JRuby using TCP sockets

To experience this behavior the required environment is:

  • JRuby using SSL sockets

I can recreate the behavior without using either the discussed logger gem or the concurrency gem. That code is here:

https://gist.github.com/gmallard/18108cac927b0d0704ca1c4cb95cc762

I had hoped that study and analysis would reveal a way to 'code around' this apparent JRuby problem. Unfortunately, given what I have observed during testing, I do not see a way the stomp gem can provide a 'code around'. (I remain open to suggestions of course).

What follows is a high level description.

  1. The multithread ACKer program starts.
  2. During processing, the various threads call stomp gem's ACK method
  3. The ACK method performs sanity checks, and calls the 'transmit' method.
  4. The 'transmit' method calls the low level '_transmit' method.

The '_transmit' method wraps a critical section using the 'transmit semaphore mutex'.

The basic job of this critical section is to put a single STOMP message onto the outbound wire. This is performed in phases, basically:

  • write the message COMMAND
  • write each of the message headers (if any)
  • write the message body (if any)

As time proceeds, the threads created by the ACKer program compete for this critical section.

At some point during processing an arbitrary thread will:

  • Obtain the transmit mutex
  • Enter the critical section
  • Proceed to write data to the outbound wire. This is done in the '_wire_write' method.
  • That method typically calls 'sock.puts' to write the data
  • The thread will execute that call. And that call will never return

This eventually and naturally leads to all other threads being waiting for the transmit mutex, i.e. the deadlock.

I hope the above description is reasonably clear.

To reiterate: I do not see a way to code around this at present.

@mlartz - my advice at this point is that you reconsider the use of SSL or JRuby or both. Also, I think this needs to be reported to the JRuby dev team, but my example(s) are too complex. Help me think about that please.

@gmallard
Copy link

gmallard commented Jul 20, 2019

I do not know why but this issue popped into my head recently, and I thought I would take another look.

General observations:

The deadlock/hang problem persists with: jruby 9.2.7.0 (2.5.3) 2019-04-09 8a269e3 OpenJDK 64-Bit Server VM 25.212-b03 on 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03 +jit [linux-x86_64]

I can now recreate the problem with a normal Ruby environment. Ubuntu build: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]

The problem appears specific to "topics". I can not recreate it using a 'queue'.

I tried setting:

    OpenSSL::debug = true

in hopes of getting more information. Unfortunately, that did not tell me anything new.

One other note at this time: I can not recreate this problem with the Apollo broker or with Apache Artemis. And my local RabbitMQ is not set up for SSL.

So the relevant environment seems to be:

  • SSL
  • Use of a topic
  • Apache MQ

@gmallard
Copy link

I have updated the program I am using for testing and recreating this issue. That can be found here:

https://gist.github.com/gmallard/0b1b6cc27d18c4c71b3649f5a3db409d

Testing Procedure:

  • set environment variables as appropriate for the test (see the code for what is supported)
  • start the program linked to above
  • in a different terminal session, push at least 10,000 messages to the topic being used

I have been testing with 5 receiver threads and 10,000 messages. That recreates the problem reliably.

I discovered something during recent tests. Specifying an appropriate activemq.prefetchSize makes this problem totally disappear !!

I just had success (using JRuby) with:

  • 5 receiver threads
  • activemq.prefetchSize = 25
  • 250,000 messages

I can only conclude that this issue is triggered by what is fundamentally an AMQ configuration problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants