Skip to content

Conversation

@Swiddis
Copy link
Collaborator

@Swiddis Swiddis commented Sep 22, 2025

Description

In local benchmarking of merge operations, I saw we were spending a lot of time waiting for synchronous fetching of batches across both indices.

Because of the PIT-based design, we can't parallelize page fetches directly, but one low-hanging fruit here is to start fetching the next batch as soon as we get the current one, so by the time we start the next batch it'll already be halfway ready. This cuts enumerated merge times by ~40%.

To implement this safely, this PR needs to do a few things:

  • Register a new thread pool that has authentication context (we can't run background threads if we don't do this)
    • See SQLPlugin.java changes. I also fixed our thread configuration settings.
    • We need a new pool as we'll hang the worker pool if there's only one thread.
  • Safely handle whether we have a NodeClient or not within the Calcite enumeration inner loop
    • This was the interface change in OpenSearchClient.java, I did several plumbing changes around that update.
  • Actually implement the background scanner, with a fallback to synchronous scanning if we're missing node context. BackgroundSearchScanner.java

Some alternatives for the long-term:

In draft pending testing.

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Swiddis Swiddis added the enhancement New feature or request label Sep 22, 2025
@Swiddis
Copy link
Collaborator Author

Swiddis commented Sep 23, 2025

Security IT failures are confusing me here -- seems like they're all consistently failing but the changed code doesn't show up anywhere in any of the stack traces

@Swiddis
Copy link
Collaborator Author

Swiddis commented Sep 25, 2025

Some additional testing info:

I took 5 million records from the big5 benchmarking dataset and compared the current mainline with this.

First, as sanity, the results are the same for one of the queries requiring a full index enumeration:

source = big5
| eval range_bucket = case(
   `metrics.size` < -10, 'range_1',
   `metrics.size` >= -10 and `metrics.size` < 10, 'range_2',
   `metrics.size` >= 10 and `metrics.size` < 100, 'range_3',
   `metrics.size` >= 100 and `metrics.size` < 1000, 'range_4',
   `metrics.size` >= 1000 and `metrics.size` < 2000, 'range_5',
   `metrics.size` >= 2000, 'range_6')
| stats count() by range_bucket, span(`@timestamp`, 1h) as auto_span
| sort + range_bucket, + auto_span

Current mainline:

fetched rows / total rows = 48/48
+---------+---------------------+--------------+
| count() | auto_span           | range_bucket |
|---------+---------------------+--------------|
| 122464  | 2022-12-31 16:00:00 | range_5      |
| 121585  | 2022-12-31 17:00:00 | range_5      |
| 122052  | 2022-12-31 18:00:00 | range_5      |
| 122220  | 2022-12-31 19:00:00 | range_5      |
| 122163  | 2022-12-31 20:00:00 | range_5      |
| 121840  | 2022-12-31 21:00:00 | range_5      |
| 121606  | 2022-12-31 22:00:00 | range_5      |
| 121889  | 2022-12-31 23:00:00 | range_5      |
| 121088  | 2023-01-01 00:00:00 | range_5      |
| 121943  | 2023-01-01 01:00:00 | range_5      |

After update:

fetched rows / total rows = 48/48
+---------+---------------------+--------------+
| count() | auto_span           | range_bucket |
|---------+---------------------+--------------|
| 122464  | 2022-12-31 16:00:00 | range_5      |
| 121585  | 2022-12-31 17:00:00 | range_5      |
| 122052  | 2022-12-31 18:00:00 | range_5      |
| 122220  | 2022-12-31 19:00:00 | range_5      |
| 122163  | 2022-12-31 20:00:00 | range_5      |
| 121840  | 2022-12-31 21:00:00 | range_5      |
| 121606  | 2022-12-31 22:00:00 | range_5      |
| 121889  | 2022-12-31 23:00:00 | range_5      |
| 121088  | 2023-01-01 00:00:00 | range_5      |
| 121943  | 2023-01-01 01:00:00 | range_5      |

Second, I wanted to benchmark and check for impact. I already tested with joins and it's ~40% faster, but for non-joins we potentially pay overhead for nothing.

For the slowest big5 queries (BG fetches on the left, sync fetches on the right), we see slight perf gains:
image

For the fastest ones, the performance is approximately the same (some minor latency and throughput diffs but I'm not confident that this isn't just random variation):
image

Signed-off-by: Simeon Widdis <[email protected]>
Signed-off-by: Simeon Widdis <[email protected]>
@Swiddis Swiddis removed the v3.3.0 label Sep 30, 2025
@Swiddis Swiddis added the calcite calcite migration releated label Oct 3, 2025
@Swiddis
Copy link
Collaborator Author

Swiddis commented Oct 7, 2025

Turns out I flipped the benchmark in my head, so this is overall a regression -- going to put back in draft and figure out a better approach

@Swiddis Swiddis marked this pull request as draft October 7, 2025 23:42
@Swiddis Swiddis marked this pull request as ready for review November 14, 2025 21:58
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 2 weeks with no activity.

@Swiddis Swiddis merged commit d28c226 into opensearch-project:main Dec 1, 2025
38 checks passed
@Swiddis Swiddis deleted the feature/pre-fetch-batches-in-enumeration branch December 1, 2025 18:46
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4345-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d28c226140f1b98db4ed8e8d76d6451f2072273f
# Push it to GitHub
git push --set-upstream origin backport/backport-4345-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4345-to-2.19-dev.

@LantaoJin
Copy link
Member

Turns out I flipped the benchmark in my head, so this is overall a regression -- going to put back in draft and figure out a better approach

@Swiddis , do you mean the current implementation has performance regression? So why the PR merged finally? If there is no regression, please backport it to 2.19-dev since backporting of #4884 is blocked by this backporting.

@Swiddis
Copy link
Collaborator Author

Swiddis commented Dec 8, 2025

Couldn't get the regression to reliably repro & the regression was smaller than the gains in more tests, so I wanted to see what the diff was in the OSB benchmarks. I don't see any benchmark diff since merging.

Wasn't planning on backporting this originally since it's still largely experimental, can open the PR at least

Swiddis added a commit to Swiddis/sql that referenced this pull request Dec 8, 2025
@LantaoJin LantaoJin added backport-manually Filed a PR to backport manually. and removed stalled labels Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. calcite calcite migration releated enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants