use ConcurrentHashMap to avoid synchronized block by tinder-xli · Pull Request #27350 · elastic/elasticsearch

tinder-xli · 2017-11-11T00:56:35Z

During our perf test against elasticsearch, we noticed two synchronized block in the call stack of fetching doc values, which i think is necessary and cause a serious gc issue for us, especially when "size" param is large and fetch docvalue fields.

getForField method in IndexFieldDataService.java
There is a synchronized block for getting from fieldDataCaches map, which is unnecessary when cache != null, we suggest changing fieldDataCaches from HashMap to ConcurrentHashMap thus we don't need any explicit synchronized logic at all.
We see threads waiting on getForField method in our JFR recording.

gradle test all passed in my local.

For reference, below is a sample query we used for testing:

{
  "stored_fields": "_none_",
  "docvalue_fields": ["user_number"],
  "size": 33000,
  "query": {
    "bool": {
    	 "must": [
    	   {
    	   	"match_all":{}
    	   }
    	 ]
    }
  }
}

elasticmachine · 2017-11-11T00:56:36Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

tinder-xli · 2017-11-11T00:56:45Z

@jasontedor

dnhatn · 2017-11-11T00:58:53Z

@elasticmachine please test this.

tinder-fren · 2017-11-11T01:44:23Z

core/src/main/java/org/elasticsearch/index/fielddata/IndexFieldDataService.java

-            } else {
-                throw new IllegalArgumentException("cache type not supported [" + cacheType + "] for field [" + fieldName + "]");
+        cache = fieldDataCaches.computeIfAbsent(fieldName,
+            s -> {


It might be cleaner and create less new-Function objects if you extract this compute block as a new method, say "newIndexFieldDataCache(fieldName)", then just do fieldDataCaches.computeIfAbsent(fieldName, this::newIndexFieldDataCache) here.

tinder-fren · 2017-11-11T01:49:54Z

@jpountz if you ever worried about the correctness of cache.clear() etc we can at least switch to double-checked-locking, that would be performant as well: https://en.m.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java

jasontedor

Sorry, but this approach is not safe.

jasontedor · 2017-11-11T03:54:49Z

core/src/main/java/org/elasticsearch/index/fielddata/IndexFieldDataService.java

    }

-    public synchronized void clear() {
+    public void clear() {


This is not thread-safe. The problem is the underlying cache implementation that backs IndexFieldCache says @this:

/** * An LRU sequencing of the keys in the cache that supports removal. This sequence is not protected from mutations * to the cache (except for {@link Iterator#remove()}. The result of iteration under any other mutation is * undefined. * * @return an LRU-ordered {@link Iterable} over the keys in the cache */

and this method is invoked behind the scenes to invalidate all the keys.

added synchronized back

jasontedor · 2017-11-11T04:10:16Z

Additionally, I don't understand what you mean by synchronized blocks causing GC issues for you, can you please explain that? Also, how long are your sample queries taking, can you please show, for example, the took time (as the code exists today)?

jasontedor

Sorry, but adding synchronized back on clear is not sufficient as it still the cache leaves open to modifications while a clear is taking place which is exactly what is unsafe.

tinder-xli · 2017-11-11T04:21:35Z

my query is fetching a lot docvalues, so getForField is invoked very often. However since all my queries are fetching the same field from docvalues, theoretically only the very first query has to create a cache for that field and all following queries don't have to create it again.
In current impl, there is a synchronized block no matter cache is null or not, thus all the search threads are blocked by it, resulting in lots of data hold in memory and gc is get triggered much more often. And since the thread are still holding the references to the data(they just get blocked by this synchronized blocked), it is hard for gc to find anything available to clean thus the whole gc process can be really really long.
Apparently existing impl works fine when my qps is low but as soon as I increase it a little bit, the gc issue started to surface. e.g. when I increase my query qps from 40 -> 60, the latency jumps from 70ms to 400ms, mostly due to super long gc.

tinder-xli · 2017-11-11T04:25:40Z

@jasontedor so just to confirm, you are saying it is unsafe because when clear() is going on, some other thread may be calling getForField() to create it again?

jasontedor · 2017-11-11T04:33:15Z

Look for example at clearField. It too invokes clear on an index field data cache element. If this occurs concurrently with a call to IndexFieldDataService#clear that can be a problem because both could call IndexFieldDataCache#clear concurrently on the same instance and that is a problem. In general, any concurrent modification to an instance of IndexFieldDataCache while IndexFieldDataCache#clear is unsafe.

tinder-xli · 2017-11-11T04:46:44Z

@jasontedor ok so you are saying any op to an instance of IndexFieldDataCache should be synchronized, correct? I'm wondering how existing code bases handles that? What if one thread calls IndexFieldDataCache.clear() and another thread calls IndexFieldDataCache.clearField()?

tinder-xli · 2017-11-11T06:04:33Z

@jasontedor just to confirm -- if I add synchronized back for clearField(), it should be fine?

tinder-fren · 2017-11-11T06:08:19Z

@tinder-xli original code has synchronized keyword on both clear and clearField, so they won't be executing at the same time.

@jasontedor can you help us understand:

At the end of getForField method, it calls builder.build, some implementation of Builder(DocValuesIndexFieldData for example) will fill the cache, which essentially modifies the cache, and is not synchronized, would that be a problem that one thread is doing clear, another is doing getForField and is executing builder.build?
If the concern is cache modification, in current PR, adding synchronized to both clearField and clear would work right? Or, at a lower granularity, synchronize on cache object when doing clear() seems good too.

s1monw · 2017-11-13T13:08:32Z

The original reason why we have these sync blocks on the clear methods is point in time semantics. With a concurrent hashmap you always get a consistent view of the values in the cache which might include additions that are happening after we we fetched the values. We somehow need a way to synchronizes the clearing of the map and I agree we might be able to do this but the addition of this concurrent map is not enough. The reason is that we have to guarantee that clear() is called on the IndexFieldDataCache and if that is not happening we are leaking resources. This means we have to make sure that we see all IndexFieldDataCache instances when we call clear() on the IndexFieldDataService level. I mean we can redesign this but it I would like to make sure this is actually the reason you are seeing slowdowns.

jasontedor · 2017-11-13T14:20:34Z

Apparently existing impl works fine when my qps is low but as soon as I increase it a little bit, the gc issue started to surface. e.g. when I increase my query qps from 40 -> 60, the latency jumps from 70ms to 400ms, mostly due to super long gc.

@tinder-xli I'm not convinced this is the cause of the slowdown (I am not saying that you're wrong, I am saying that I need more convincing). The JFR output that you show shows threads waiting on the lock for < 7ms on average. That is not great, but I think it's a leap from threads are blocked for < 7 ms to query performance degrading by 330ms. I understand you think it is related to GC but there is not sufficient evidence here to conclude that (I would need more evidence to be convinced that threads holding onto references for an extra < 7ms on average would explain such a harsh degredation). Are you sure that there is not something else going on here (cache thrashing)? As you can see, getting this right is hard, so we really need to make sure we are targeting the right thing here.

jasontedor · 2017-11-13T14:49:37Z

the latency jumps from 70ms to 400ms

Can you clarify one thing here? How are you measuring this? Is this an average latency (:cry:) or a percentile latency and, if so, which percentile?

tinder-xli · 2017-11-13T19:03:47Z

@jasontedor that <6ms block is per method invokation, and total block time is > 30s and my total recording time is 60s so it is more than 50% time blocked.
Also for each query, the getForField will be called for each hit, so if I have 1000 hit and each getForField get blocked for 6ms, that's a big deal. And the latency I mentioned is avg latency.

tinder-fren · 2017-11-13T19:17:03Z

@s1monw thanks for explain. Clearing while mutating the hashmap cause resource leaking is a valid concern.

@tinder-xli Let's close this PR and open a new one with double-checked-locking, that should address all concerns raised above with minimal impact on the logic.

tinder-xli · 2017-11-13T19:31:24Z

created a new PR #27365

jasontedor · 2017-11-13T19:32:17Z

@tinder-xli As I mentioned, I agree that threads being blocked for 7ms is not great, but I'm still unconvinced it explains the change in latency here. That is, you don't need to convince me threads waiting here for 7ms is bad, what I still need to be convinced of is that it explains a change in latency from 70ms to 400ms.

tinder-xli · 2017-11-13T19:38:34Z

@jasontedor let me try to explain it in another way: that is 7ms per hit, not per request. The avg hit size for my query is around 25k thus even a small amount of them get blocked, it can easily make the latency jump 5x.

jasontedor · 2017-11-13T19:44:49Z

@tinder-xli Okay, with hit sizes that large now you've convinced me. Thank you. I think the double-checked locking solution is a much better one.

use ConcurrentHashMap

2c4ae7d

tinder-xli mentioned this pull request Nov 11, 2017

fix unnecessary synchronized block #27348

Closed

dnhatn requested a review from jpountz November 11, 2017 00:59

use computeIfAbsent

4b80ea5

tinder-fren reviewed Nov 11, 2017

View reviewed changes

jasontedor requested changes Nov 11, 2017

View reviewed changes

add synchronized for clear()

7bfc45a

jasontedor requested changes Nov 11, 2017

View reviewed changes

add synchronized back for clearField

c62c2ff

tinder-xli mentioned this pull request Nov 13, 2017

Reduce synchronization on field data cache #27365

Merged

tinder-xli closed this Nov 13, 2017

Comments

Conversation

tinder-xli commented Nov 11, 2017

Uh oh!

elasticmachine commented Nov 11, 2017

Uh oh!

tinder-xli commented Nov 11, 2017

Uh oh!

dnhatn commented Nov 11, 2017

Uh oh!

tinder-fren Nov 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinder-fren commented Nov 11, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor Nov 11, 2017

Choose a reason for hiding this comment

Uh oh!

tinder-xli Nov 11, 2017

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Nov 11, 2017

Uh oh!

jasontedor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinder-xli commented Nov 11, 2017

Uh oh!

tinder-xli commented Nov 11, 2017

Uh oh!

jasontedor commented Nov 11, 2017

Uh oh!

tinder-xli commented Nov 11, 2017

Uh oh!

tinder-xli commented Nov 11, 2017

Uh oh!

tinder-fren commented Nov 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s1monw commented Nov 13, 2017

Uh oh!

jasontedor commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasontedor commented Nov 13, 2017

Uh oh!

tinder-xli commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinder-fren commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinder-xli commented Nov 13, 2017

Uh oh!

jasontedor commented Nov 13, 2017

Uh oh!

tinder-xli commented Nov 13, 2017

Uh oh!

jasontedor commented Nov 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tinder-fren Nov 11, 2017 •

edited

Loading

jasontedor left a comment •

edited

Loading

tinder-fren commented Nov 11, 2017 •

edited

Loading

jasontedor commented Nov 13, 2017 •

edited

Loading

tinder-xli commented Nov 13, 2017 •

edited

Loading

tinder-fren commented Nov 13, 2017 •

edited

Loading