Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When searching across multiple segments, one doesn't need to wait until the first segment is done collecting to start doing the I/O for terms dictionary lookups in the next segment. However, doing so introduces a risk that the search on the first segment needs to visit so much data that it in-turn evicts data that we had prefetched for the second segment before we start searching this second segment. So we need some way to control the amount of inter-segment I/O concurrency that we allow. I went for a threshold on the sum of the max doc of the segments for which we do I/O concurrently, the reasoning being that you can search many small segments concurrently since they won't load much into the page cache anyway, but you need to be more careful with larger segments. This heuristic is not perfect as it only looks at what happens in a single thread and only looks at
maxDoc
rather than e.g. the on-disk size of data, but I would still expect it to work well enough in practice. I opted for a conservative default value of 1,000,000. Said otherwise, Lucene will do (part of the) I/O concurrently for as many segments as possible whose sum ofmaxDoc
doesn't exceed 1,000,000.We should do the same for collectors, but we cannot do it at the moment because we have a number of implementations that expect a segment to be fully collected before
Collector#getLeafCollector
is called on the next segment. So I am leaving it for a follow-up change.