Skip to content

Comments

use ConcurrentHashMap to avoid synchronized block#27350

Closed
tinder-xli wants to merge 4 commits intoelastic:masterfrom
tinder-xli:fix-cache
Closed

use ConcurrentHashMap to avoid synchronized block#27350
tinder-xli wants to merge 4 commits intoelastic:masterfrom
tinder-xli:fix-cache

Conversation

@tinder-xli
Copy link
Contributor

During our perf test against elasticsearch, we noticed two synchronized block in the call stack of fetching doc values, which i think is necessary and cause a serious gc issue for us, especially when "size" param is large and fetch docvalue fields.

getForField method in IndexFieldDataService.java
There is a synchronized block for getting from fieldDataCaches map, which is unnecessary when cache != null, we suggest changing fieldDataCaches from HashMap to ConcurrentHashMap thus we don't need any explicit synchronized logic at all.
We see threads waiting on getForField method in our JFR recording.
screen shot 2017-11-10 at 4 23 39 pm

gradle test all passed in my local.

For reference, below is a sample query we used for testing:

{
  "stored_fields": "_none_",
  "docvalue_fields": ["user_number"],
  "size": 33000,
  "query": {
    "bool": {
    	 "must": [
    	   {
    	   	"match_all":{}
    	   }
    	 ]
    }
  }
}

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@tinder-xli
Copy link
Contributor Author

@jasontedor

@dnhatn
Copy link
Member

dnhatn commented Nov 11, 2017

@elasticmachine please test this.

@dnhatn dnhatn requested a review from jpountz November 11, 2017 00:59
} else {
throw new IllegalArgumentException("cache type not supported [" + cacheType + "] for field [" + fieldName + "]");
cache = fieldDataCaches.computeIfAbsent(fieldName,
s -> {
Copy link

@tinder-fren tinder-fren Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be cleaner and create less new-Function objects if you extract this compute block as a new method, say "newIndexFieldDataCache(fieldName)", then just do fieldDataCaches.computeIfAbsent(fieldName, this::newIndexFieldDataCache) here.

@tinder-fren
Copy link

@jpountz if you ever worried about the correctness of cache.clear() etc we can at least switch to double-checked-locking, that would be performant as well: https://en.m.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this approach is not safe.

}

public synchronized void clear() {
public void clear() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe. The problem is the underlying cache implementation that backs IndexFieldCache says @this:

    /**
     * An LRU sequencing of the keys in the cache that supports removal. This sequence is not protected from mutations
     * to the cache (except for {@link Iterator#remove()}. The result of iteration under any other mutation is
     * undefined.
     *
     * @return an LRU-ordered {@link Iterable} over the keys in the cache
     */

and this method is invoked behind the scenes to invalidate all the keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added synchronized back

@jasontedor
Copy link
Member

Additionally, I don't understand what you mean by synchronized blocks causing GC issues for you, can you please explain that? Also, how long are your sample queries taking, can you please show, for example, the took time (as the code exists today)?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but adding synchronized back on clear is not sufficient as it still the cache leaves open to modifications while a clear is taking place which is exactly what is unsafe.

@tinder-xli
Copy link
Contributor Author

my query is fetching a lot docvalues, so getForField is invoked very often. However since all my queries are fetching the same field from docvalues, theoretically only the very first query has to create a cache for that field and all following queries don't have to create it again.
In current impl, there is a synchronized block no matter cache is null or not, thus all the search threads are blocked by it, resulting in lots of data hold in memory and gc is get triggered much more often. And since the thread are still holding the references to the data(they just get blocked by this synchronized blocked), it is hard for gc to find anything available to clean thus the whole gc process can be really really long.
Apparently existing impl works fine when my qps is low but as soon as I increase it a little bit, the gc issue started to surface. e.g. when I increase my query qps from 40 -> 60, the latency jumps from 70ms to 400ms, mostly due to super long gc.

@tinder-xli
Copy link
Contributor Author

@jasontedor so just to confirm, you are saying it is unsafe because when clear() is going on, some other thread may be calling getForField() to create it again?

@jasontedor
Copy link
Member

Look for example at clearField. It too invokes clear on an index field data cache element. If this occurs concurrently with a call to IndexFieldDataService#clear that can be a problem because both could call IndexFieldDataCache#clear concurrently on the same instance and that is a problem. In general, any concurrent modification to an instance of IndexFieldDataCache while IndexFieldDataCache#clear is unsafe.

@tinder-xli
Copy link
Contributor Author

@jasontedor ok so you are saying any op to an instance of IndexFieldDataCache should be synchronized, correct? I'm wondering how existing code bases handles that? What if one thread calls IndexFieldDataCache.clear() and another thread calls IndexFieldDataCache.clearField()?

@tinder-xli
Copy link
Contributor Author

@jasontedor just to confirm -- if I add synchronized back for clearField(), it should be fine?

@tinder-fren
Copy link

tinder-fren commented Nov 11, 2017

@tinder-xli original code has synchronized keyword on both clear and clearField, so they won't be executing at the same time.

@jasontedor can you help us understand:

  • At the end of getForField method, it calls builder.build, some implementation of Builder(DocValuesIndexFieldData for example) will fill the cache, which essentially modifies the cache, and is not synchronized, would that be a problem that one thread is doing clear, another is doing getForField and is executing builder.build?

  • If the concern is cache modification, in current PR, adding synchronized to both clearField and clear would work right? Or, at a lower granularity, synchronize on cache object when doing clear() seems good too.

@s1monw
Copy link
Contributor

s1monw commented Nov 13, 2017

The original reason why we have these sync blocks on the clear methods is point in time semantics. With a concurrent hashmap you always get a consistent view of the values in the cache which might include additions that are happening after we we fetched the values. We somehow need a way to synchronizes the clearing of the map and I agree we might be able to do this but the addition of this concurrent map is not enough. The reason is that we have to guarantee that clear() is called on the IndexFieldDataCache and if that is not happening we are leaking resources. This means we have to make sure that we see all IndexFieldDataCache instances when we call clear() on the IndexFieldDataService level. I mean we can redesign this but it I would like to make sure this is actually the reason you are seeing slowdowns.

@jasontedor
Copy link
Member

jasontedor commented Nov 13, 2017

Apparently existing impl works fine when my qps is low but as soon as I increase it a little bit, the gc issue started to surface. e.g. when I increase my query qps from 40 -> 60, the latency jumps from 70ms to 400ms, mostly due to super long gc.

@tinder-xli I'm not convinced this is the cause of the slowdown (I am not saying that you're wrong, I am saying that I need more convincing). The JFR output that you show shows threads waiting on the lock for < 7ms on average. That is not great, but I think it's a leap from threads are blocked for < 7 ms to query performance degrading by 330ms. I understand you think it is related to GC but there is not sufficient evidence here to conclude that (I would need more evidence to be convinced that threads holding onto references for an extra < 7ms on average would explain such a harsh degredation). Are you sure that there is not something else going on here (cache thrashing)? As you can see, getting this right is hard, so we really need to make sure we are targeting the right thing here.

@jasontedor
Copy link
Member

the latency jumps from 70ms to 400ms

Can you clarify one thing here? How are you measuring this? Is this an average latency (:cry:) or a percentile latency and, if so, which percentile?

@tinder-xli
Copy link
Contributor Author

tinder-xli commented Nov 13, 2017

@jasontedor that <6ms block is per method invokation, and total block time is > 30s and my total recording time is 60s so it is more than 50% time blocked.
Also for each query, the getForField will be called for each hit, so if I have 1000 hit and each getForField get blocked for 6ms, that's a big deal. And the latency I mentioned is avg latency.

@tinder-fren
Copy link

tinder-fren commented Nov 13, 2017

@s1monw thanks for explain. Clearing while mutating the hashmap cause resource leaking is a valid concern.

@tinder-xli Let's close this PR and open a new one with double-checked-locking, that should address all concerns raised above with minimal impact on the logic.

@tinder-xli
Copy link
Contributor Author

created a new PR #27365

@tinder-xli tinder-xli closed this Nov 13, 2017
@jasontedor
Copy link
Member

@tinder-xli As I mentioned, I agree that threads being blocked for 7ms is not great, but I'm still unconvinced it explains the change in latency here. That is, you don't need to convince me threads waiting here for 7ms is bad, what I still need to be convinced of is that it explains a change in latency from 70ms to 400ms.

@tinder-xli
Copy link
Contributor Author

@jasontedor let me try to explain it in another way: that is 7ms per hit, not per request. The avg hit size for my query is around 25k thus even a small amount of them get blocked, it can easily make the latency jump 5x.

@jasontedor
Copy link
Member

@tinder-xli Okay, with hit sizes that large now you've convinced me. Thank you. I think the double-checked locking solution is a much better one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants