-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Decompressor release memory buffer #11987
Conversation
too many shards. need to make sure this doesn't cause performance regression for normal use-cases. |
fwiw, assigning the 0-length array just makes even more waste. Still keeping logic to use arrayutil.grow to oversize the arrays when they won't be reused even more just adds more waste. better to assign null and create array of the correct size, if it won't be reused. |
@rmuir Thanks for the replying this issue, i did some benchmarks:
LGTM , i assigned null at commit 0dfb7c
I try to run benchmark shows little regression as follows: runStoredFieldsBenchmark.py
localrun.py -source wikimedium1m:
|
Thanks for running the stored fields benchmark: are you able to report the retrieval time as well? That's my first concern. Maybe, the StoredFieldsBenchmark.java needs to be run standalone to report it, here is the relevant code: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/StoredFieldsBenchmark.java#L89-L101 My other concern would be if we create too much pressure on GC for unoptimized merges. The StoredFieldsBenchmark uses geonames and does not delete/update documents, so it would never exercise this path much. In all cases when running the benchmark, we may want to explicitly supply smaller heap (-Xmx), since the dataset is not very big and otherwise jvm may allocate a huge heap, dodging any GC impacts that we want to see. Thank you again for benchmarking, if you run into trouble I can try to help run these benchmarks too. |
runStoredFieldsBenchmark.py
I will do this benchmark.
We have been met a situation: a es node with max_shards_per_node default 1000, and normally one shard with 40 segments, one buffer with retained heap: 100KB, so this would use 4G resident heap memory. |
Thanks, yeah my remaining concern is the non-optimized merge... especially for those that delete and update documents (as it prevents them from getting optimized merges). Alternative solution to this issue might be, instead of removing the reuse from Decompressor, to instead try removing the stored fields/term vectors CloseableThreadLocals from SegmentCoreReaders.... this is more difficult as we'd have to change APIs around IndexReader to no longer call It might alleviate the pressure, while still allowing merge to reuse stuff efficiently and queries to reuse stuff efficiently when pulling their top N, but it would require bigger changes. cc @jpountz |
Yes, before we work around all that stuff here, I'd also suggest to remove those ThreadLocals. |
@rmuir I just modified https://github.com/mikemccand/luceneutil/blob/master/src/python/runStoredFieldsBenchmark.py#L43 with i do 4 different runStoredFieldsBenchmark as following tables shows which shows little performance regressions: runStoredFieldsBenchmark.py enableBulkMerge=false
runStoredFieldsBenchmark.py enableBulkMerge=false -Xmx1g
runStoredFieldsBenchmark.py enableBulkMerge=false -Xmx512m
runStoredFieldsBenchmark.py enableBulkMerge=false -Xmx256m
|
thanks for running. somehow i think bulk merge didnt get disabled. without bulk merge optimization, indexing time should be significantly higher, the benchmark should be very very slow. |
@rmuir I am also curious about it. and i manually modified lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#BULK_MERGE_ENABLED as false when enableBulkMerge=false -Xmx256m
i checked the thread runs into BTW my runStoredFieldsBenchmark |
@uschindler I think this issue just have a GC path of ThreadLocals. BUT, for instance in ES, when there is a 1000-shard-nodes, and normally one shard with 40 segments per shard, one opened segments would allocate one buffer with retained heap: 100KB, so this would use 4G resident heap memory, even some segments are rarely used. |
You can use less shards as a workaround for now. |
In fact there is no situation where thousands of shards makes sense on a single node. That's bad design. |
I will investigate the document API and try to make a proposal so that threadlocal is no longer needed. I'm really concerned about the merge case here causing regressions for folks that delete/update documents, I'll also try to run the benchmark myself for the unoptimized merge case. It requires a bit of time, as document api will impact 100% of lucene users so we need to get it right. |
@rmuir i have another proposal: every time we call decompress and wrap decompressed BytesRef as ByteArrayDataInput the decompress can make a list ByteBuffer, and every time just copy dictionary, simple code like following:
|
I had opened a very similar PR to this one at #137 which handled the merge case. |
I think I had not merged it because the follow-up discussion about removing thread locals had triggered naming/API concerns, but it should be a good incremental step and we could figure a way to remove threadlocals in Lucene 10 since it will require API changes? |
I don't think so, i tend to lean towards Uwe's thoughts here. I feel like this is "rush in another fix for 10,000 shards". We can investigate removing the threadlocals in lucene 10. Very possibly the change could be backported with deprecations for |
Sounds good. |
its just gonna take me some time, i can't get something out there like today. for example nearly 100% of tests would be impacted :) It is fair, I will feel the same pain the users will. But it is some motivation to take care with the API and spend some time to do it well. |
Threadlocal just scale up the StoredFieldsReader's heap useage, BUT when one instance with only 10K segments would use 1G heap memory in |
the idea is not to have one instance per segment. there would be zero instances. when you want to retrieve docs from indexreader, the user would call .getFieldsReader() or similar to create one. it would be eligible for GC afterwards. but it means we can reuse buffers per-search and per-merge rather than creating per-document garbage. basically it would work like every other part of lucene. |
and one idea i have is to try to prototype with the term vectors first (since both stored fields and term vectors have per-segment threadlocals that I'd like to remove). it is just less tests to fix, but still gives us a chance to look at the idea. |
Hi, Of course if one has thread pools with 10.000 Threads like Solr this may still a resorce problem, but that's not Lucene's fault. So my poposal would be: Use a single ThreadLocal (non closeable) for decompression buffers in a static final field of the Decompressor class. |
The threadlocals (closeable or not, i want them out here) only exist because the api is dumb and stateless: hence threadlocal is used to workaround a problematic API. i propose we fix the api, then no threadlocal is needed anymore. |
I think you are confused. There is much more than just this buffer underneath threadlocal today. |
* Move the buffer lifecycle into reader block state
@rmuir it is absolutely right!! How about make the buffer decouple from Decompressor? when some instance wanna reuse buffer per-merge or retrieve docs, let the in commits 2f676e6 , i move the buffer into BytesRef, and let BytesRef to decide the buffer's lifecycle. so when there is a merging, the bytes buffer would reused like code: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java#L527-L532 and benchmarks shows almost no performance regression:
|
sorry, we definitely should not be adding arrays to bytesref here. Like i said, we can just remove the threadlocal. The issue to me has nothing to do with buffers. it has to do with allowing codec to maintain state across multiple calls to In the case of this codec, some of that state happens to be a buffer which annoys you. But that isn't the only possible state which is possible (e.g. file pointer locations etc). Basically the |
@rmuir Thanks for your replying! i am looking forward to your no-threadlocal-design. i have no question about it, i'll close this issue |
Description
we have a es cluster(31G heap, 96G Mem, 30 instance nodes), with many shards per node(4000 per nodes), when nodes do many bulk and search requests concurrently, we can see the jvm going high memory usage, and can not release the memory even with the frequently GC and stop all write/search requests. we have to restart the node for recovery the heap, like the following GC metrics shows
we dumped the heap shows,
CompressingStoredFieldsReader
oncupied 70% heap:all this reader path2GC roots shows with following(maybe in search or write thread):
Root cause
i think the root cause that these threadlocal holds the referent, because
SegmentReader#getFieldsReader
calling following code, and Elasticsearch always using fixed thread_pool and never callingCloseableThreadLocal#purge
we have searched some issues like LUCENE-9959 , and LUCENE-10419, there is no answer for this problem
i compare between different jvm heap, and different LUCENE versions, i think the root cause is
LZ4WithPresetDictDecompressor
would allocate a buffer in the class and initwhen the elasticsearch instance doing
Stored-Fields-Read
operations, it will reallocate the JVM heap. but without release, because escurrentEngineReference
will keep the referenceProposal
i think we can releasee this buffer memory when the decompress is done. it shows that jvm can holds more segment readers in the heap.
when these buffer memory can release, the heap metrics shows as following: