Skip to content

Comments

Fix race condition in getXOpaqueId in tests#45447

Closed
pgomulka wants to merge 11 commits intoelastic:masterfrom
pgomulka:fix/catch-exceptions-deprecation-logger
Closed

Fix race condition in getXOpaqueId in tests#45447
pgomulka wants to merge 11 commits intoelastic:masterfrom
pgomulka:fix/catch-exceptions-deprecation-logger

Conversation

@pgomulka
Copy link
Contributor

Due to a race condition between isClosed and getHeader it is possible
that getXOpaqueId can throw an IllegalStateException.
This only occurs when run with multiple tests on the same JVM. This
value is not to be asserted in these tests. Only when running in
isolation in its own JVM.
This won't happen in production as we only expect one ThreadContext per
Node (jvm)

closes #45443

  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • If submitting code, have you built your formula locally prior to submission with gradle check?
  • If submitting code, is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed.
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • If you are submitting this code for a class then read our policy for that.

Due to a race condition between isClosed and getHeader it is possible
that getXOpaqueId can throw an IllegalStateException.
This only  occurs when run with multiple tests on the same JVM. This
value is not to be asserted in these tests. Only when running in
isolation in its own JVM.
This won't happen in production as we only expect one ThreadContext per
Node (jvm)

closes elastic#45443
@pgomulka pgomulka added >test Issues or PRs that are addressing/adding tests :Core/Infra/Core Core issues without another label labels Aug 12, 2019
@pgomulka pgomulka self-assigned this Aug 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

}
}
} catch (IllegalStateException e) {
// ignore exception as this is due to a race condition between isClosed and getHeader. Only in test env.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rethrow this if we aren't in a test environment? Could we fix the race condition? Blanket hiding of exceptions like this is dangerous because other exception sources can sneak in over time and we would be masking them unknowingly.

Copy link
Contributor Author

@pgomulka pgomulka Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Fixing this in ThreadContext would require synchronisation and would impose performance penalty.
The minimum would be a ReadWrite lock between close(write) and get(read).

I wonder if we should fix this in CloseableThreadLocal where the NullPointerException is thrown from
the close() method nullifies the threadLocal variable t = null;

 @Override
  public void close() {
 ...
    hardRefs = null;
  ...
    if (t != null) {
      t.remove();
    }
    t = null;
  }

but then first thing the get() method does is to access it without checking if it is null

public T get() {
    WeakReference<T> weakRef = t.get();
...

if we modify t and hardRefs to be a volatile variables and assign with "null objects" (new ThreadLocal and new empty map) or
we still nullify but we also guard t and hardRefs in get() and set() against nulls it should be fixed.

I wonder if this was done on purpose to keep the design of the class simple or I am missing something. The ClosableThreadLocal warns about this in java doc.

You should not call {@link #close} until all
 *  threads are done using the instance.

I am not sure if it is not too expensive to introduce synchronisation (either a lock or volatile) just because of testability. Again, we cannot even assert about the values of XOpaqueId when run in parallel. Only when there is only 1 Node per jvm (=> 1 `ThreadContext per jvm) we can reliably assert about these values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I would be tempted to throw exception on Prod but I am also happy to create a new issue to possibly fix this
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would prefer to not mask the exception in non-test environments since it is not expected there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw what are your views on fixing this in lucene?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change any of this code. IMHO the problem is that we pass in a threadlocal / threadcontext that is closed concurrently? How is this possible? This entire usage is desiged to be not accessed after it's closed, we should stick to it. Can you explain why this fails and maybe can we see an exception? I think the test must be fixed instead. This will just hide exceptions and cause trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw I think the concurrent closure can only happen in Test environment. See here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L61
where we have a Set of ThreadContext and some of them can be closed. In prod I expect only one ThreadContext, but I guess close() can still happen on a different thread.

The stacktrace can be found here #45443

the reason it ends up with this race condition is because it iterates over set of ThreadContext then it checks if it is closed, and then attempts to retrieve a header value. However closed and get are not synchronized together, so it might happen that a thread1 calls isClosed -> false then, thread2 calls close and thread1 comes back and calls getHeader which fails as in meantime ThreadContext was closed by thread2

In my usecase, it would be fine if I modify ThreadContext to return some default or null when get is called but it was already closed. But not sure of other scenarios.

I suppose this problem has been seen earlier as per comment here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L241

  } catch (final IllegalStateException e) {
                    // ignored; it should be removed shortly
                }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I understand now. I think the design of this is flawed but it's very hard to fix. Ideally we should refactor DeprecationLogger to only use 1 ThreadContext use SetOnce I think that's wise. Now that might be tricky if we allow to start multiple nodes in the same environment.

What I think we should do instead is to use a read-write lock and acquire the read-lock every-time we access that THREAD_CONTEXT set. One we access it for write we get exclusive access and for reading we can be concurrent. The write access will only be acquired once in a production env which is good. All we need to make sure now is that we close the ThreadContext after we removed it. Does this make sense? Then you can also remove https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L241

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I left some suggestions.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM left one comment


threadContext = new ThreadContext(settings);
// adds the context to the DeprecationLogger so that it does not need to be injected everywhere
DeprecationLogger.setThreadContext(threadContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this as the last thing in the ctor? otherwise we need to clean up after ourself in the case of an exception?

}
}

public void testIgnoresClosedThreadContext() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing as we don't expect closed contexts to be present in DeprecatedLogger. It should be removed before closing

public static boolean removeThreadContext(ThreadContext threadContext) {
assert threadContext != null;

// remove returning false means it did not have it already
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception should not be thrown, closing a ThreadPool now also removes ThreadContext from a DeprecationLogger.
It is possible to close a ThreadPool multiple times as per NodesTests.testAwaitCloseTimeoutsOnNonInterruptibleTask
THis could happen when someone is calling an api multiple times.

@@ -171,7 +171,6 @@ private void setupClient(ThreadPool threadPool) {
@After
public void tearDownAndVerifyCommonStuff() throws Exception {
client.close();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a client was created with the same threadPool that is terminated a line below.
a client terminates a threadPool as well in NoOpClient.close() so this is possibly not needed

It might as well stay, but I think it is unecessary

@jaymode
Copy link
Member

jaymode commented Nov 22, 2019

@pgomulka I don't think we need this PR anymore. #43249 changed the ThreadContext to no longer use a CloseableThreadLocal, so we should not hit this failure on master and 7.x. What do you think?

@pgomulka
Copy link
Contributor Author

fixed in ##43249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label >test Issues or PRs that are addressing/adding tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zen1IT testMultipleNodeMigrationFromZen1ToZen2WithThreeNodes failes on DeprecationLogging

5 participants