Inter-segment I/O concurrency. #13509

jpountz · 2024-06-20T10:57:13Z

When searching across multiple segments, one doesn't need to wait until the first segment is done collecting to start doing the I/O for terms dictionary lookups in the next segment. However, doing so introduces a risk that the search on the first segment needs to visit so much data that it in-turn evicts data that we had prefetched for the second segment before we start searching this second segment. So we need some way to control the amount of inter-segment I/O concurrency that we allow. I went for a threshold on the sum of the max doc of the segments for which we do I/O concurrently, the reasoning being that you can search many small segments concurrently since they won't load much into the page cache anyway, but you need to be more careful with larger segments. This heuristic is not perfect as it only looks at what happens in a single thread and only looks at maxDoc rather than e.g. the on-disk size of data, but I would still expect it to work well enough in practice. I opted for a conservative default value of 1,000,000. Said otherwise, Lucene will do (part of the) I/O concurrently for as many segments as possible whose sum of maxDoc doesn't exceed 1,000,000.

We should do the same for collectors, but we cannot do it at the moment because we have a number of implementations that expect a segment to be fully collected before Collector#getLeafCollector is called on the next segment. So I am leaving it for a follow-up change.

When searching across multiple segments, one doesn't need to wait until the first segment is done collecting to start doing the I/O for terms dictionary lookups in the next segment. However, doing so introduces a risk that the search on the first segment needs to visit so much data that it in-turn evicts data that we had prefetched for the second segment before we start searching this second segment. So we need some way to control the amount of inter-segment I/O concurrency that we allow. I went for a threshold on the sum of the max doc of the segments for which we do I/O concurrently, the reasoning being that you can search many small segments concurrently since they won't load much into the page cache anyway, but you need to be more careful with larger segments. This heuristic is not perfect as it only looks at what happens in a single thread and only looks at `maxDoc` rather than e.g. the on-disk size of data, but I would still expect it to work well enough in practice. I opted for a conservative default value of 1,000,000. Said otherwise, Lucene will do (part of the) I/O concurrently for as many segments as possible whose sum of `maxDoc` doesn't exceed 1,000,000. We should do the same for collectors, but we cannot do it at the moment because we have a number of implementations that expect a segment to be fully collected before `Collector#getLeafCollector` is called on the next segment. So I am leaving it for a follow-up change.

github-actions · 2024-07-05T00:18:54Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

jpountz added this to the 10.0.0 milestone Jun 20, 2024

github-actions bot added the Stale label Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-segment I/O concurrency. #13509

Inter-segment I/O concurrency. #13509

jpountz commented Jun 20, 2024

github-actions bot commented Jul 5, 2024

Inter-segment I/O concurrency. #13509

Are you sure you want to change the base?

Inter-segment I/O concurrency. #13509

Conversation

jpountz commented Jun 20, 2024

github-actions bot commented Jul 5, 2024