-
Notifications
You must be signed in to change notification settings - Fork 180
Use _doc + _shard_doc as sort tiebreaker to get better performance
#4569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use _doc + _shard_doc as sort tiebreaker to get better performance
#4569
Conversation
Signed-off-by: Lantao Jin <[email protected]>
_shard_doc as sort tiebreaker to get better performance_doc + _shard_doc as sort tiebreaker to get better performance
Signed-off-by: Lantao Jin <[email protected]>
| // Workaround to preserve sort location more exactly, | ||
| // see https://github.com/opensearch-project/sql/pull/3061 | ||
| this.sourceBuilder.sort(METADATA_FIELD_ID, ASC); | ||
| this.sourceBuilder.sort(SortBuilders.shardDocSort()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if we duplicate fields in the sorting list? We could simplify/remove the below else logic by just always appending this, I would expect Lucene to optimize it in the background but I haven't measured it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure will Lucene optimize duplicated fields or _doc in sorting, but for sure the duplicated _shard_doc is not allowed in OpenSearch Core. It is no harmful for restricted checker here.
|
Can you share the performance benchmark for the 3 approaches? |
I haven't run the benchmark, the RCA was made by reading the code of Luence and OS The performance of For case 2, the current Will rerun some benchmark to double confirm. |
commit cba8d02 Author: Tomoyuki MORITA <[email protected]> Date: Wed Oct 15 13:08:05 2025 -0700 Add MAP_APPEND internal function to Calcite PPL (opensearch-project#4515) * Add MAP_APPEND internal function to Calcite PPL Signed-off-by: Tomoyuki Morita <[email protected]> * Minor fix Signed-off-by: Tomoyuki Morita <[email protected]> * Address comment Signed-off-by: Tomoyuki Morita <[email protected]> * Rebase and fix IT issue Signed-off-by: Tomoyuki Morita <[email protected]> --------- Signed-off-by: Tomoyuki Morita <[email protected]> commit 3388dc7 Author: Lantao Jin <[email protected]> Date: Thu Oct 16 01:45:29 2025 +0800 Use `_doc` + `_shard_doc` as sort tiebreaker to get better performance (opensearch-project#4569) * Use _shard_doc as sort tiebreaker Signed-off-by: Lantao Jin <[email protected]> * _doc as a part of tie-breaker have better performance Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]> commit 5630119 Author: qianheng <[email protected]> Date: Wed Oct 15 16:40:41 2025 +0800 Fix sort push down into agg after project already pushed (opensearch-project#4546) * Fix sort push down into agg Signed-off-by: Heng Qian <[email protected]> * Change some json files to yaml format Signed-off-by: Heng Qian <[email protected]> --------- Signed-off-by: Heng Qian <[email protected]> commit 1e62fba Author: Tomoyuki MORITA <[email protected]> Date: Tue Oct 14 17:20:38 2025 -0700 Fix JsonExtractAllFunctionIT failure (opensearch-project#4556) Signed-off-by: Tomoyuki Morita <[email protected]> commit 02ee33e Author: Kai Huang <[email protected]> Date: Tue Oct 14 14:28:53 2025 -0700 Add more examples to the `where` command doc (opensearch-project#4457) Co-authored-by: Manasvini B S <[email protected]> commit 0b7e86c Author: Jialiang Liang <[email protected]> Date: Tue Oct 14 10:46:01 2025 -0700 [Enhancement] Error handling for illegal character usage in java regex named capture group (opensearch-project#4434) Co-authored-by: Simeon Widdis <[email protected]> commit 9c97cfb Author: Tomoyuki MORITA <[email protected]> Date: Tue Oct 14 08:36:43 2025 -0700 Add JSON_EXTRACT_ALL internal function for Calcite PPL (opensearch-project#4489) * Add JSON_EXTRACT_ALL internal function for Calcite PPL Signed-off-by: Tomoyuki Morita <[email protected]> * Address comments Signed-off-by: Tomoyuki Morita <[email protected]> * Minor fix Signed-off-by: Tomoyuki Morita <[email protected]> --------- Signed-off-by: Tomoyuki Morita <[email protected]> commit 89dbc31 Author: Lantao Jin <[email protected]> Date: Tue Oct 14 18:24:52 2025 +0800 Check server status before starting Prometheus (opensearch-project#4537) * Check server status before starting Prometheus Signed-off-by: Lantao Jin <[email protected]> * Change to func call Signed-off-by: Lantao Jin <[email protected]> * Fix doc Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]> commit fe62472 Author: Lantao Jin <[email protected]> Date: Tue Oct 14 18:10:27 2025 +0800 Update request builder after pushdown sort into agg buckets (opensearch-project#4541) Signed-off-by: Lantao Jin <[email protected]> commit 42a415f Author: qianheng <[email protected]> Date: Tue Oct 14 17:42:45 2025 +0800 Including metadata fields type when doing agg/filter script push down (opensearch-project#4522) * Including metadata fields type when doing agg/filter script push down Signed-off-by: Heng Qian <[email protected]> * Fix IT Signed-off-by: Heng Qian <[email protected]> --------- Signed-off-by: Heng Qian <[email protected]> commit 8de0386 Author: Xinyuan Lu <[email protected]> Date: Tue Oct 14 16:41:08 2025 +0800 Fix percentile bug (opensearch-project#4539) * fix percentile bug Signed-off-by: xinyual <[email protected]> * add IT Signed-off-by: xinyual <[email protected]> * optimize it Signed-off-by: xinyual <[email protected]> --------- Signed-off-by: xinyual <[email protected]> commit de2fdc8 Author: Lantao Jin <[email protected]> Date: Tue Oct 14 12:29:03 2025 +0800 [FollowUp] Set 0 and negative value of subsearch.maxout as unlimited (opensearch-project#4534) * [FollowUp] Set 0 and negative value of subsearch.maxout as unlimited Signed-off-by: Lantao Jin <[email protected]> * fix doctest Signed-off-by: Lantao Jin <[email protected]> * Fix conflicts Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]> commit 977b7ab Author: Simeon Widdis <[email protected]> Date: Mon Oct 13 20:23:10 2025 -0700 Update stalled action (opensearch-project#4485) commit fddbb70 Author: Lantao Jin <[email protected]> Date: Tue Oct 14 10:23:12 2025 +0800 Add configurable sytem limitations for `subsearch` and `join` command (opensearch-project#4501) * Add configurable sytem limitations for subsearch and join command Signed-off-by: Lantao Jin <[email protected]> * Fix IT Signed-off-by: Lantao Jin <[email protected]> * typo Signed-off-by: Lantao Jin <[email protected]> * fix IT Signed-off-by: Lantao Jin <[email protected]> * remove rollback in doc Signed-off-by: Lantao Jin <[email protected]> * address comments Signed-off-by: Lantao Jin <[email protected]> * fix typo Signed-off-by: Lantao Jin <[email protected]> * Fix IT Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]> Signed-off-by: Tomoyuki Morita <[email protected]>
Description
Before #4378, the sort in PIT search is
case 1: if no sort field specified, sort by
_doc+_id(+ means "then"). (❎ could cause high memory issue)case 2: if sort fields specified, sort by
fields. (❎ paged results could miss or duplicate hits)case 3: if sort fields specified and query contains a filter, sort by
_doc. (❎ paged results could miss or duplicate hits)#4378 added the
_shard_docas sort tiebreaker withcase 1: if no sort field specified, sort by
_shard_doc. (❎ performance regression)case 2: if sort fields specified, sort by
fields+_shard_doc.(❎ lower performance on low cardinality field)#4435 found performance regression in case 1 and partially revert the changes to
case 1: if no sort field specified, sort by
_doc+_id. (❎ could cause high memory issue)case 2: if sort fields specified, sort by
fields. (❎ paged results could miss or duplicate hits)After this PR, we change the sort in PIT search to
case 1: if no sort field specified, sort by
_doc+_shard_doc. ✅case 2: if sort fields specified, sort by
fields+_doc+_shard_doc.✅RCA of performance regression:
_shard_docis not a stored field in index which will be generated in runtime when comparison. Computing_shard_docper document is a high cost operation. But sorting by_docthen_shard_doconly generates_shard_docwhen the_docvalues are conflicted.Even in the case of user specified sort fields, we should sort by
fieldsthen_docthen_shard_docto reduce the computing of_shard_doc. For example, if the sort field is a low cardinality field, e.g.gender, sorting bygenderthen_docthen_shard_docgenerates_shard_docfor comparison only if values ofgenderand_docare both conflicted.This PR is no needed to backport to 2.19-dev since
shard_docfeature is only available since OS 3.3.0Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.