Fix race condition in getXOpaqueId in tests#45447
Fix race condition in getXOpaqueId in tests#45447pgomulka wants to merge 11 commits intoelastic:masterfrom
Conversation
Due to a race condition between isClosed and getHeader it is possible that getXOpaqueId can throw an IllegalStateException. This only occurs when run with multiple tests on the same JVM. This value is not to be asserted in these tests. Only when running in isolation in its own JVM. This won't happen in production as we only expect one ThreadContext per Node (jvm) closes elastic#45443
|
Pinging @elastic/es-core-infra |
| } | ||
| } | ||
| } catch (IllegalStateException e) { | ||
| // ignore exception as this is due to a race condition between isClosed and getHeader. Only in test env. |
There was a problem hiding this comment.
Can we rethrow this if we aren't in a test environment? Could we fix the race condition? Blanket hiding of exceptions like this is dangerous because other exception sources can sneak in over time and we would be masking them unknowingly.
There was a problem hiding this comment.
good point. Fixing this in ThreadContext would require synchronisation and would impose performance penalty.
The minimum would be a ReadWrite lock between close(write) and get(read).
I wonder if we should fix this in CloseableThreadLocal where the NullPointerException is thrown from
the close() method nullifies the threadLocal variable t = null;
@Override
public void close() {
...
hardRefs = null;
...
if (t != null) {
t.remove();
}
t = null;
}
but then first thing the get() method does is to access it without checking if it is null
public T get() {
WeakReference<T> weakRef = t.get();
...
if we modify t and hardRefs to be a volatile variables and assign with "null objects" (new ThreadLocal and new empty map) or
we still nullify but we also guard t and hardRefs in get() and set() against nulls it should be fixed.
I wonder if this was done on purpose to keep the design of the class simple or I am missing something. The ClosableThreadLocal warns about this in java doc.
You should not call {@link #close} until all
* threads are done using the instance.
I am not sure if it is not too expensive to introduce synchronisation (either a lock or volatile) just because of testability. Again, we cannot even assert about the values of XOpaqueId when run in parallel. Only when there is only 1 Node per jvm (=> 1 `ThreadContext per jvm) we can reliably assert about these values
There was a problem hiding this comment.
For now I would be tempted to throw exception on Prod but I am also happy to create a new issue to possibly fix this
WDYT?
There was a problem hiding this comment.
Yes I would prefer to not mask the exception in non-test environments since it is not expected there.
There was a problem hiding this comment.
@s1monw what are your views on fixing this in lucene?
There was a problem hiding this comment.
I don't think we should change any of this code. IMHO the problem is that we pass in a threadlocal / threadcontext that is closed concurrently? How is this possible? This entire usage is desiged to be not accessed after it's closed, we should stick to it. Can you explain why this fails and maybe can we see an exception? I think the test must be fixed instead. This will just hide exceptions and cause trouble.
There was a problem hiding this comment.
@s1monw I think the concurrent closure can only happen in Test environment. See here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L61
where we have a Set of ThreadContext and some of them can be closed. In prod I expect only one ThreadContext, but I guess close() can still happen on a different thread.
The stacktrace can be found here #45443
the reason it ends up with this race condition is because it iterates over set of ThreadContext then it checks if it is closed, and then attempts to retrieve a header value. However closed and get are not synchronized together, so it might happen that a thread1 calls isClosed -> false then, thread2 calls close and thread1 comes back and calls getHeader which fails as in meantime ThreadContext was closed by thread2
In my usecase, it would be fine if I modify ThreadContext to return some default or null when get is called but it was already closed. But not sure of other scenarios.
I suppose this problem has been seen earlier as per comment here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L241
} catch (final IllegalStateException e) {
// ignored; it should be removed shortly
}
There was a problem hiding this comment.
alright, I understand now. I think the design of this is flawed but it's very hard to fix. Ideally we should refactor DeprecationLogger to only use 1 ThreadContext use SetOnce I think that's wise. Now that might be tricky if we allow to start multiple nodes in the same environment.
What I think we should do instead is to use a read-write lock and acquire the read-lock every-time we access that THREAD_CONTEXT set. One we access it for write we get exclusive access and for reading we can be concurrent. The write access will only be acquired once in a production env which is good. All we need to make sure now is that we close the ThreadContext after we removed it. Does this make sense? Then you can also remove https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java#L241
s1monw
left a comment
There was a problem hiding this comment.
looks good. I left some suggestions.
server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/logging/DeprecationLogger.java
Outdated
Show resolved
Hide resolved
|
|
||
| threadContext = new ThreadContext(settings); | ||
| // adds the context to the DeprecationLogger so that it does not need to be injected everywhere | ||
| DeprecationLogger.setThreadContext(threadContext); |
There was a problem hiding this comment.
can we do this as the last thing in the ctor? otherwise we need to clean up after ourself in the case of an exception?
| } | ||
| } | ||
|
|
||
| public void testIgnoresClosedThreadContext() throws IOException { |
There was a problem hiding this comment.
removing as we don't expect closed contexts to be present in DeprecatedLogger. It should be removed before closing
| public static boolean removeThreadContext(ThreadContext threadContext) { | ||
| assert threadContext != null; | ||
|
|
||
| // remove returning false means it did not have it already |
There was a problem hiding this comment.
This exception should not be thrown, closing a ThreadPool now also removes ThreadContext from a DeprecationLogger.
It is possible to close a ThreadPool multiple times as per NodesTests.testAwaitCloseTimeoutsOnNonInterruptibleTask
THis could happen when someone is calling an api multiple times.
| @@ -171,7 +171,6 @@ private void setupClient(ThreadPool threadPool) { | |||
| @After | |||
| public void tearDownAndVerifyCommonStuff() throws Exception { | |||
| client.close(); | |||
There was a problem hiding this comment.
a client was created with the same threadPool that is terminated a line below.
a client terminates a threadPool as well in NoOpClient.close() so this is possibly not needed
It might as well stay, but I think it is unecessary
|
fixed in ##43249 |
Due to a race condition between isClosed and getHeader it is possible
that getXOpaqueId can throw an IllegalStateException.
This only occurs when run with multiple tests on the same JVM. This
value is not to be asserted in these tests. Only when running in
isolation in its own JVM.
This won't happen in production as we only expect one ThreadContext per
Node (jvm)
closes #45443
gradle check?