Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-segment I/O concurrency. #13509

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 20, 2024

When searching across multiple segments, one doesn't need to wait until the first segment is done collecting to start doing the I/O for terms dictionary lookups in the next segment. However, doing so introduces a risk that the search on the first segment needs to visit so much data that it in-turn evicts data that we had prefetched for the second segment before we start searching this second segment. So we need some way to control the amount of inter-segment I/O concurrency that we allow. I went for a threshold on the sum of the max doc of the segments for which we do I/O concurrently, the reasoning being that you can search many small segments concurrently since they won't load much into the page cache anyway, but you need to be more careful with larger segments. This heuristic is not perfect as it only looks at what happens in a single thread and only looks at maxDoc rather than e.g. the on-disk size of data, but I would still expect it to work well enough in practice. I opted for a conservative default value of 1,000,000. Said otherwise, Lucene will do (part of the) I/O concurrently for as many segments as possible whose sum of maxDoc doesn't exceed 1,000,000.

We should do the same for collectors, but we cannot do it at the moment because we have a number of implementations that expect a segment to be fully collected before Collector#getLeafCollector is called on the next segment. So I am leaving it for a follow-up change.

When searching across multiple segments, one doesn't need to wait until the
first segment is done collecting to start doing the I/O for terms dictionary
lookups in the next segment. However, doing so introduces a risk that the
search on the first segment needs to visit so much data that it in-turn evicts
data that we had prefetched for the second segment before we start searching
this second segment. So we need some way to control the amount of inter-segment
I/O concurrency that we allow. I went for a threshold on the sum of the max doc
of the segments for which we do I/O concurrently, the reasoning being that you
can search many small segments concurrently since they won't load much into the
page cache anyway, but you need to be more careful with larger segments. This
heuristic is not perfect as it only looks at what happens in a single thread
and only looks at `maxDoc` rather than e.g. the on-disk size of data, but I
would still expect it to work well enough in practice. I opted for a
conservative default value of 1,000,000. Said otherwise, Lucene will do (part of
the) I/O concurrently for as many segments as possible whose sum of `maxDoc`
doesn't exceed 1,000,000.

We should do the same for collectors, but we cannot do it at the moment because
we have a number of implementations that expect a segment to be fully collected
before `Collector#getLeafCollector` is called on the next segment. So I am
leaving it for a follow-up change.
@jpountz jpountz added this to the 10.0.0 milestone Jun 20, 2024
Copy link

github-actions bot commented Jul 5, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant