Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash Crash with ConcurrencyError: interrupted waiting for mutex: null #26

Open
purbon opened this issue Aug 4, 2016 · 12 comments
Open
Assignees
Labels

Comments

@purbon
Copy link

purbon commented Aug 4, 2016

Under a concrete set of circumstances there might be a concurrency problem while processing the /etc/hosts file to setup necessary resolvers. This situation might only happen when the one filter instance is stuck setting up the hosts but this thread and then another instance got kill while waiting, making this tricky so reproduce with a living LS instance.

The condition could be reproduced independently of logstash using https://gist.github.com/purbon/106e80a5d85b3a6fbf1e5f10b5d0d643 code.

The exception reported is:

"exception"=>#<ConcurrencyError: interruptedwaitingformutex: null>,
    "backtrace"=>["org/jruby/ext/thread/Mutex.java:94:in `lock'",
    "org/jruby/ext/thread/Mutex.java:147:in `synchronize'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190:in `lazy_initialize'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:268:in `each_name'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:151:in `each_name'",
    "org/jruby/RubyArray.java:1613:in `each'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:150:in `each_name'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:132:in `getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:244:in `getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:231:in `retriable_getname'",
    "org/jruby/RubyProc.java:281:in `call'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:216:in `retriable_request'",
    "org/jruby/ext/timeout/Timeout.java:115:in `timeout'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:215:in `retriable_request'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:230:in `retriable_getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:178:in `reverse'",
    "org/jruby/RubyArray.java:1613:in `each'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:156:in `reverse'",

This problem is independant of LS code, it has been reported in several different environments see [1] and [2] as examples.

Plan of action:

@jordansissel
Copy link
Contributor

I don't yet agree fully with the problem description.

Your sample code is clear, thank you for that! However, my main concern is that the dns filter is interrupted at all -- what causes this?

Some notes from reading the code: seems like Mutex#synchronize is being interrupted. Interrupting a thread in a critical section seems like a bad practice. Anyway, the exception "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190:in 'lazy_initialize'", appears to be while the resolv library is waiting for ownership of the mutex. The exception "org/jruby/ext/timeout/Timeout.java:115:in 'timeout'", tells me we are using the timeout library to interrupt the dns lookup after an expiration time. Indeed, the Timeout library is used in the dns filter.

The problem I see is that we are using the Timeout library to abort lookups that take too long. This timeout interrupts the lookup thread and, I believe, results in this ConcurrencyError: interruptedwaitingformutex: null error.

Some possible solutions I think of:

  1. Avoid using Timeout library and signal timeouts with another mechanism
  2. or, rescue ConcurrencyError and have the code understand that the exception is a timeout if the exception message includes "interruptedwaitingformutex"

@purbon
Copy link
Author

purbon commented Aug 4, 2016

that is why I love discussion 👍, I agree timeout could be the source of getting this thread killed, but problem is more tricky to debug I guess. Even with very small timeout value we did not manage to reproduce this issue with a logstash pipeline.

I have a an open question:

  • Why resolv lazy_initialize gets interrupted while into the lazy initialization? My expectation would it be that this resolve is thread safe.

I agree your points are fair also as solutions, but have also a few open questions:

re: avoid using timeout:

  • we still need somehow a timeout mechanism and this usually means either killing a thread or having a control variable, but for a timeout tasks we kind of need killing if we don't control what is under the execution block.

re : rescue ConcurrencyError

  • I wonder pros and cons of this solution, If we catch this problem for this situation we're clearly providing a solution for this problem by threatening them in an equal level as timeout. But I wonder if we will be masquerading other kind of problems that could be dangerous and actually valid situations to blow up.

This is basically why I when for the mutex to "protect" the resolve operation to be threadsafe.

I think in case we decide to remove the mutex and go for one of your proposals I would think more about 2., however I would like to rollout possible masquerading.

What do you think?

@jordansissel
Copy link
Contributor

we still need somehow a timeout mechanism

The Resolv library appears to support timeouts for DNS lookups: http://ruby-doc.org/stdlib-2.1.0/libdoc/resolv/rdoc/Resolv/DNS.html#method-i-timeouts-3D

@purbon
Copy link
Author

purbon commented Aug 9, 2016

@jordansissel this is a very good idea, will update my proposal with this.

@purbon
Copy link
Author

purbon commented Aug 9, 2016

I reviewed the code and I wonder how are we with the move to use the 2.0 codebase, I remember some discussion with @andrewvc about this, but not sure if there was a direction for this. I ask because this feature is only available for 2.0, however we could have a patch to change this value in 1.9 and see how it works.

My preference would be not to change to 2.0 as we need time to test, but open to proposals for sure. Will update the PR with the patch solution for 1.9 and wait for your input.

What do you think?

@purbon
Copy link
Author

purbon commented Aug 9, 2016

@jordansissel updated the PR with a timeout module remove proposal, let me know what do you think.

@jordansissel
Copy link
Contributor

I wonder how are we with the move to use the 2.0 codebase

We won't be moving to 2.0 (and also jruby 9k) until after 5.0.0

@purbon
Copy link
Author

purbon commented Aug 10, 2016

@jordansissel then my current proposal at #27 is good as I patched resolv in 1.9 to introduce this timeouts. Let me know how it works for you? makes sense?

@headius
Copy link

headius commented Aug 24, 2016

FYI I looked at this a bit in jruby/jruby#4048. The example there does indeed produce the same error, but I'd expect it to produce that error since the native thread is being interrupted while it waits on a mutex. The real problem here is finding why that happens at runtime in Logstash. We need to trap the interrupt somehow.

@fedelemantuano
Copy link

Do you have any updates about this issue?

@VuokkoVuorinnen
Copy link

VuokkoVuorinnen commented Nov 16, 2016

Same issue happening here, or at least something similar

Logstash 2.4.0
CentOS 6.8 x84
Puppet writing a whole lot of hosts to /etc/hosts, no dns server because used in environment not connected to the big bad internet.

Error mesage

ConcurrencyError: interrupted waiting for mutex: null
               lock at org/jruby/ext/thread/Mutex.java:94
        synchronize at org/jruby/ext/thread/Mutex.java:147
    lazy_initialize at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:268
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:151
               each at org/jruby/RubyArray.java:1613
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:150
            getname at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:132
            getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:244
  retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:231
               call at org/jruby/RubyProc.java:281
  retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:216
            timeout at org/jruby/ext/timeout/Timeout.java:115
  retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:215
  retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:230
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/cache.rb:34
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:6
    mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:211
    mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:210
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:5
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
               each at org/jruby/RubyArray.java:1613
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:156
             filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:95
       multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:151
               each at org/jruby/RubyArray.java:1613
       multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:148
         initialize at (eval):41679
               each at org/jruby/RubyArray.java:1613
         initialize at (eval):41675
               call at org/jruby/RubyProc.java:281
        filter_func at (eval):5309
       filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:267
               each at org/jruby/RubyArray.java:1613
             inject at org/jruby/RubyEnumerable.java:852
       filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:265
        worker_loop at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:223
      start_workers at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:201

@sumitkgaur
Copy link

Does this issue get resolved in any version ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants