use ConcurrentHashMap to avoid synchronized block#27350
use ConcurrentHashMap to avoid synchronized block#27350tinder-xli wants to merge 4 commits intoelastic:masterfrom
Conversation
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
|
@elasticmachine please test this. |
| } else { | ||
| throw new IllegalArgumentException("cache type not supported [" + cacheType + "] for field [" + fieldName + "]"); | ||
| cache = fieldDataCaches.computeIfAbsent(fieldName, | ||
| s -> { |
There was a problem hiding this comment.
It might be cleaner and create less new-Function objects if you extract this compute block as a new method, say "newIndexFieldDataCache(fieldName)", then just do fieldDataCaches.computeIfAbsent(fieldName, this::newIndexFieldDataCache) here.
|
@jpountz if you ever worried about the correctness of cache.clear() etc we can at least switch to double-checked-locking, that would be performant as well: https://en.m.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java |
jasontedor
left a comment
There was a problem hiding this comment.
Sorry, but this approach is not safe.
| } | ||
|
|
||
| public synchronized void clear() { | ||
| public void clear() { |
There was a problem hiding this comment.
This is not thread-safe. The problem is the underlying cache implementation that backs IndexFieldCache says @this:
/**
* An LRU sequencing of the keys in the cache that supports removal. This sequence is not protected from mutations
* to the cache (except for {@link Iterator#remove()}. The result of iteration under any other mutation is
* undefined.
*
* @return an LRU-ordered {@link Iterable} over the keys in the cache
*/
and this method is invoked behind the scenes to invalidate all the keys.
There was a problem hiding this comment.
added synchronized back
|
Additionally, I don't understand what you mean by synchronized blocks causing GC issues for you, can you please explain that? Also, how long are your sample queries taking, can you please show, for example, the took time (as the code exists today)? |
|
my query is fetching a lot docvalues, so |
|
@jasontedor so just to confirm, you are saying it is unsafe because when |
|
Look for example at |
|
@jasontedor ok so you are saying any op to an instance of IndexFieldDataCache should be synchronized, correct? I'm wondering how existing code bases handles that? What if one thread calls |
|
@jasontedor just to confirm -- if I add |
|
@tinder-xli original code has @jasontedor can you help us understand:
|
|
The original reason why we have these sync blocks on the clear methods is point in time semantics. With a concurrent hashmap you always get a consistent view of the values in the cache which might include additions that are happening after we we |
@tinder-xli I'm not convinced this is the cause of the slowdown (I am not saying that you're wrong, I am saying that I need more convincing). The JFR output that you show shows threads waiting on the lock for < 7ms on average. That is not great, but I think it's a leap from threads are blocked for < 7 ms to query performance degrading by 330ms. I understand you think it is related to GC but there is not sufficient evidence here to conclude that (I would need more evidence to be convinced that threads holding onto references for an extra < 7ms on average would explain such a harsh degredation). Are you sure that there is not something else going on here (cache thrashing)? As you can see, getting this right is hard, so we really need to make sure we are targeting the right thing here. |
Can you clarify one thing here? How are you measuring this? Is this an average latency (:cry:) or a percentile latency and, if so, which percentile? |
|
@jasontedor that <6ms block is per method invokation, and total block time is > 30s and my total recording time is 60s so it is more than 50% time blocked. |
|
@s1monw thanks for explain. Clearing while mutating the hashmap cause resource leaking is a valid concern. @tinder-xli Let's close this PR and open a new one with double-checked-locking, that should address all concerns raised above with minimal impact on the logic. |
|
created a new PR #27365 |
|
@tinder-xli As I mentioned, I agree that threads being blocked for 7ms is not great, but I'm still unconvinced it explains the change in latency here. That is, you don't need to convince me threads waiting here for 7ms is bad, what I still need to be convinced of is that it explains a change in latency from 70ms to 400ms. |
|
@jasontedor let me try to explain it in another way: that is 7ms per hit, not per request. The avg hit size for my query is around 25k thus even a small amount of them get blocked, it can easily make the latency jump 5x. |
|
@tinder-xli Okay, with hit sizes that large now you've convinced me. Thank you. I think the double-checked locking solution is a much better one. |
During our perf test against elasticsearch, we noticed two synchronized block in the call stack of fetching doc values, which i think is necessary and cause a serious gc issue for us, especially when "size" param is large and fetch docvalue fields.
getForFieldmethod in IndexFieldDataService.javaThere is a synchronized block for getting from fieldDataCaches map, which is unnecessary when cache != null, we suggest changing fieldDataCaches from HashMap to ConcurrentHashMap thus we don't need any explicit synchronized logic at all.
We see threads waiting on getForField method in our JFR recording.
gradle testall passed in my local.For reference, below is a sample query we used for testing: