Skip to content

Conversation

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Feb 19, 2019

The way this dead-locks is:

Blocked size call:

"elasticsearch[node_t0][transport_worker][T#1]@11742" daemon prio=5 tid=0x193 nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
	 waiting for TEST-HotThreadsIT.testHotThreadsDontFail-seed#[E2559D1759E37C60]@3175 to release lock on <0x2e37> (a org.elasticsearch.test.InternalTestCluster)
	  at org.elasticsearch.test.InternalTestCluster.size(InternalTestCluster.java:1595)
	  at org.elasticsearch.action.admin.HotThreadsIT$1.onResponse(HotThreadsIT.java:96)
	  at org.elasticsearch.action.admin.HotThreadsIT$1.onResponse(HotThreadsIT.java:89)
	  at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68)
	  at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)
	  at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:238)
	  at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:215)
	  at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$000(TransportNodesAction.java:142)
	  at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:193)
	  at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:183)
	  at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108)
	  at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975)
	  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	  at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
	  at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:967)
	  at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:942)
	  at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753)
	  at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:780)
	  at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:241)
	  at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:213)
	  at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:47)
	  at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:119)
	  at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:398)
	  at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246)
	  at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
	  at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
	  at org.elasticsearch.nio.NioSelectorGroup$$Lambda$1618.1423413979.run(Unknown Source:-1)
	  at java.lang.Thread.run(Thread.java:834)

Blocked connect call (will only unblock if the above finishes since the above blocks the select loop):

"TEST-HotThreadsIT.testHotThreadsDontFail-seed#[E2559D1759E37C60]@3175" prio=5 tid=0x1f nid=NA waiting
  java.lang.Thread.State: WAITING
	 blocks elasticsearch[transport_client_node_t0][generic][T#1]@11829
	 blocks elasticsearch[node_t0][transport_worker][T#1]@11742
	  at jdk.internal.misc.Unsafe.park(Unsafe.java:-1)
	  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
	  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
	  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
	  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
	  at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
	  at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	  at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57)
	  at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34)
	  at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:226)
	  at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:85)
	  at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:367)
	  at org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler.doSample(TransportClientNodesService.java:411)
	  at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.sample(TransportClientNodesService.java:362)
	  - locked <0x2e36> (a java.lang.Object)
	  at org.elasticsearch.client.transport.TransportClientNodesService.addTransportAddresses(TransportClientNodesService.java:201)
	  at org.elasticsearch.client.transport.TransportClient.addTransportAddress(TransportClient.java:341)
	  at org.elasticsearch.test.InternalTestCluster$TransportClientFactory.client(InternalTestCluster.java:1084)
	  at org.elasticsearch.test.InternalTestCluster$NodeAndClient.getOrBuildTransportClient(InternalTestCluster.java:943)
	  at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:909)
	  at org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:749)
	  - locked <0x2e37> (a org.elasticsearch.test.InternalTestCluster)
	  at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:648)
	  at org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:641)
	  at org.elasticsearch.action.admin.HotThreadsIT.testHotThreadsDontFail(HotThreadsIT.java:126)
	  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
	  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	  at java.lang.reflect.Method.invoke(Method.java:566)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	  at java.lang.Thread.run(Thread.java:834)


The way this dead-locks is:
* The `size` method on the internal cluster needs a monitor on the internal cluster object in the select thread on node_1
* `client()` call on the test thread connects to node_1 at the same time, holding the monitor on the internal test cluster during the connect call
   * That connect can not finish because the select loop that would dispatch it is blocked waiting for the `size()` call to complete => Deadlock
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.2.0 labels Feb 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an issue in InternalTestCluster and we should try and fix it at source rather than here. If there are not already other places we call cluster().size() in a way that might lead to deadlock, there will be in future.

@original-brownbear
Copy link
Contributor Author

original-brownbear commented Feb 20, 2019

@DaveCTurner fair point, a more global fix is WIP in #39168 :) (sorry for not noting this here) I'll close here, the other PR should be ready in a few minutes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations >test Issues or PRs that are addressing/adding tests v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] HotThreadsIT.testHotThreadsDontFail failure on master

4 participants