Skip to content

Conversation

@prudhvigodithi
Copy link
Member

@prudhvigodithi prudhvigodithi commented Jul 15, 2025

Description

Disable approximation when using multiple sorts because the secondary sort field is only applied after collection, not during the collection phase. With early termination at 10k documents, the query may miss documents with better secondary sort values that appear later in the index. Therefore, when multiple sorts are present, we must examine all documents to ensure correct results

Coming from suggestion here #18763 (comment), its worth for approximation framework to add support when query dealing with multiple sorts.

NOTE: Here are more details on the attempt #18763 (comment) to support multiple sorts.

curl -X POST "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  },
  "sort": [
    { "@timestamp": "desc" },
    { "agent.name": "desc" }
  ]
}
'

Related Issues

Part of #18619

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Jul 15, 2025

Adding @kkewwei @msfroh @harshavamsi @andrross to take look and let me know your thoughts (while I try to add some tests), I have identified this while trying to implement approximation for search_after queries #18546. Essentially there has to be two search_after values when there are two sorts. This PR attempts to disable approximation when there are two sorts.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 649efd6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@msfroh
Copy link
Contributor

msfroh commented Jul 21, 2025

because the secondary sort field is only applied after collection

I don't think that's true, unless you're talking about the collection that's done during the approximation phase. (Lucene's collectors will just use a priority queue that applies the secondary sort as a tie-breaker after the first sort.) But if you are referring to the approximation collection, the other option is to keep collecting documents for the primary sort as long as the primary value remains the same as the last value.

That is, if I'm sorting by timestamp descending and my 10,000th hit corresponds to a a document with time t, then (if there's a composite sort) I should continue collecting hits until I find a document with timestamp less than t. So, maybe instead of collecting 10,000 hits, I potentially collect 11,000. The good news is that it's probably still cheaper than running the full query that may match millions of hits.

You should be able to confirm with a unit test that fails without that fix.

From a sequencing standpoint, I think it's easier to address the composite sort problem first, and then address the search_after problem, since (I think) most search_after use-cases use a composite sort to ensure a unique document ordering.

@prudhvigodithi
Copy link
Member Author

I don't think that's true, unless you're talking about the collection that's done during the approximation phase.

Yes Froh I was in context with Approximation.

That is, if I'm sorting by timestamp descending and my 10,000th hit corresponds to a a document with time t, then (if there's a composite sort) I should continue collecting hits until I find a document with timestamp less than t. So, maybe instead of collecting 10,000 hits, I potentially collect 11,000. The good news is that it's probably still cheaper than running the full query that may match millions of hits.

So collect normally until we hit our target lets says as 10,000. Now When we reach documents 10,000 compare the next doc if it has same value continue collecting until the value is not same. So even if its CELL_INSIDE_QUERY we should continue to use visitDocValues in that case if the query has tie breaker sort.

@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Jul 23, 2025

From a sequencing standpoint, I think it's easier to address the composite sort problem first, and then address the search_after problem, since (I think) most search_after use-cases use a composite sort to ensure a unique document ordering.

Moving this to draft as I will change this PR to ensure Approximation Framework supports multi (tie-breaker) sort https://docs.opensearch.org/docs/latest/search-plugins/searching-data/sort/ and then take up the search_after support.
Adding @getsaurabh02

@prudhvigodithi prudhvigodithi changed the title Disable Approximation when dealing with multiple sorts Support Approximation with multiple sorts Jul 23, 2025
@prudhvigodithi prudhvigodithi marked this pull request as draft July 23, 2025 16:58
@prudhvigodithi prudhvigodithi changed the title Support Approximation with multiple sorts Support Approximation framework with multiple sorts Jul 23, 2025
@prudhvigodithi prudhvigodithi moved this from Todo to In Progress in Performance Roadmap Jul 28, 2025
@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Jul 29, 2025

  • I was able to reproduce the bug when dealing with multiple sorts where the current code does not honor the secondary sort and can see the issue with docID assertion.
 public void testApproximationBreaksWithMultiSort() throws Exception {
        try (Directory dir = newDirectory();
             RandomIndexWriter w = new RandomIndexWriter(random(), dir, new WhitespaceAnalyzer())) {
            String field      = numericType.fieldName;     // primary numeric field
            String secondary  = " secondary";               // secondary sort key
            /* ───────────────────────── 1. UNIQUE docs 0‑999 ───────────────────────── */
            for (int i = 0; i < 1_000; i++) {
                Document d = new Document();
                numericType.addField(d, field, i);
                numericType.addDocValuesField(d, field, i);
                d.add(new NumericDocValuesField(secondary, i % 100));
                w.addDocument(d);
            }

            /* ──────────────────────── 2. 500 TIES, worst‑first ───────────────────────
             * Insert ties *in descending secondary order* (499 … 0)
             * so the first 50 the walker meets are the "bad" ones.
             */
            for (int i = 0; i < 500; i++) {
                int sec = 499 - i;                     // 499, 498, … 0
                Document d = new Document();
                numericType.addField(d, field, 1000);
                numericType.addDocValuesField(d, field, 1000);
                d.add(new NumericDocValuesField(secondary, sec));
                w.addDocument(d);
            }
            w.flush();
            for (int i = 1001; i <= 1100; i++) {
                Document d = new Document();
                numericType.addField(d, field, i);
                numericType.addDocValuesField(d, field, i);
                d.add(new NumericDocValuesField(secondary, i % 100));
                w.addDocument(d);
            }
            w.forceMerge(1);
            try (IndexReader r = DirectoryReader.open(w.w)) {
                IndexSearcher s = new IndexSearcher(r);
                long lower = 900, upper = 1100;
                int size   = 150;
                Sort sort = new Sort(
                    new SortField(numericType.getSortFieldName(), numericType.getSortFieldType(), true), // DESC primary
                    new SortField(secondary, SortField.Type.LONG, true)                                  // DESC secondary
                );
                Query exact = numericType.rangeQuery(field, lower, upper);
                ApproximatePointRangeQuery approx = new ApproximatePointRangeQuery(
                    field,
                    numericType.encode(lower),
                    numericType.encode(upper),
                    /* dims */ 1,
                    /* size */ size,
                    SortOrder.DESC,
                    numericType.format
                );
                SearchContext mockContext = mock(SearchContext.class);
                ShardSearchRequest mockRequest = mock(ShardSearchRequest.class);
                SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
                sourceBuilder.sort(new FieldSortBuilder(field).order(SortOrder.DESC));
                sourceBuilder.sort(new FieldSortBuilder(secondary).order(SortOrder.DESC));
                when(mockContext.request()).thenReturn(mockRequest);
                when(mockRequest.source()).thenReturn(sourceBuilder);
                approx.setContext(mockContext);
                TopDocs exactHits  = s.search(exact, size, sort);
                TopDocs approxHits = s.search(approx, size, sort);
                for (int i = 0; i < size; i++) {
                    assertEquals("doc at pos=" + i + " differs!",
                        exactHits.scoreDocs[i].doc,
                        approxHits.scoreDocs[i].doc);
                }
            }
        }
    }
  • Now coming with the following Idea:

That is, if I'm sorting by timestamp descending and my 10,000th hit corresponds to a a document with time t, then (if there's a composite sort) I should continue collecting hits until I find a document with timestamp less than t. So, maybe instead of collecting 10,000 hits, I potentially collect 11,000. The good news is that it's probably still cheaper than running the full query that may match millions of hits.

I was able to add a logic and test with multiple sorts but I can see the performance impact. I updated the visit with iterator input to compare the value and if its different return as CELL_OUTSIDE_QUERY

                    @Override
                    public void visit(DocIdSetIterator iterator, byte[] packedValue) throws IOException {
                        if (matches(packedValue)) {
                            if (hasMultipleSorts && docCount[0] >= size) {
                                if (tieBreakValue == null) {
                                    tieBreakValue = packedValue.clone();
                                } else {
                                    if (comparator.compare(packedValue, 0, tieBreakValue, 0) != 0) {
                                        foundDifferentValue = true;
                                        return;
                                    }
                                }
                            }
                            adder.add(iterator);
                        }
                    }

                    @Override
                    public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
                        if (foundDifferentValue) {
                            return PointValues.Relation.CELL_OUTSIDE_QUERY;
                        }
                        return relate(minPackedValue, maxPackedValue);
                    }

Updated the intersectRight to test with desc sorts perforamce

            // custom intersect visitor to walk the right of tree (from rightmost leaf going left)
            public void intersectRight(PointValues.IntersectVisitor visitor, PointValues.PointTree pointTree, long[] docCount)
                throws IOException {
                if (docCount[0] >= size && hasMultipleSorts) {
                    PointValues.Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue());
                    if (r == PointValues.Relation.CELL_OUTSIDE_QUERY) {
                        return;
                    }
                } else if (docCount[0] >= size) {
                    return;
                }
                PointValues.Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue());
                if (r == PointValues.Relation.CELL_OUTSIDE_QUERY) {
                    return;
                }
                // Handle leaf nodes
                if (pointTree.moveToChild() == false) {
                    if (r == PointValues.Relation.CELL_INSIDE_QUERY) {
                        pointTree.visitDocIDs(visitor);
                    } else {
                        // CELL_CROSSES_QUERY
                        pointTree.visitDocValues(visitor);
                    }
                    return;
                }
                // Internal node - get left child reference (we're at left child initially)
                PointValues.PointTree leftChild = pointTree.clone();
                // Move to right child if it exists
                boolean hasRightChild = pointTree.moveToSibling();
                // For CELL_INSIDE_QUERY, check if we can skip left child
                if (!hasMultipleSorts && r == PointValues.Relation.CELL_INSIDE_QUERY && hasRightChild) {
                    long rightSize = pointTree.size();
                    long needed = size - docCount[0];
                    if (rightSize >= needed) {
                        // Right child has all we need - only process right
                        intersectRight(visitor, pointTree, docCount);
                        pointTree.moveToParent();
                        return;
                    }
                }
                if (hasMultipleSorts && docCount[0] >= size) {
                    if (hasRightChild) {
                        PointValues.Relation rightRelation = visitor.compare(
                            pointTree.getMinPackedValue(),
                            pointTree.getMaxPackedValue()
                        );
                        if (rightRelation != PointValues.Relation.CELL_OUTSIDE_QUERY) {
                            intersectRight(visitor, pointTree, docCount);
                        }
                    }
                    // Then check left child
                    PointValues.Relation leftRelation = visitor.compare(
                        leftChild.getMinPackedValue(),
                        leftChild.getMaxPackedValue()
                    );
                    if (leftRelation != PointValues.Relation.CELL_OUTSIDE_QUERY) {
                        intersectRight(visitor, leftChild, docCount);
                    }
                } else {
                    // Normal processing when we haven't reached size yet
                    if (hasRightChild) {
                        intersectRight(visitor, pointTree, docCount);
                    }
                    // Process left child
                    if (hasMultipleSorts || docCount[0] < size) {
                        intersectRight(visitor, leftChild, docCount);
                    }
                }
                pointTree.moveToParent();
            }
  • Tested nyc_taxis dataset and I can the following curl command took 959ms (too much tree traversal was done) vs without approximation is around 350ms
curl -X POST "localhost:9200/nyc_taxis/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  },
  "sort": [
    { "pickup_datetime": "desc" },
    { "passenger_count": "desc" }
  ]
}
'

@msfroh am I missing anything here as seeing this I feel there is performance impact when dealing with multiple sorts. In the flame graph I see too many tree traversals

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

✅ Gradle check result for 25f5b8b: SUCCESS

@prudhvigodithi
Copy link
Member Author

I have tested few attempts and looks like there is a small complexity attempting to compare the packedValue with collected docs. The main issue is:

  • When handling multi-sort queries, the approximation needs to collect all documents with the same primary sort value as the last document in the result set. Document Counting Happens Everywhere: The docCount[0] is incremented in multiple visitor methods:

    • visit(int docID) - increments by 1
    • visit(IntsRef ref) - increments by ref.length
    • visit(DocIdSetIterator iterator) - doesn't increment (this is because for desc sort even though we traverse from right to left, the docs inside the leaf nodes are in asc order, in order to retain highest docs (not just higher docs) and avoid again flipping and adding from backward we dont increment and directly add the iterator)
  • Value Access is Limited: The actual document values (packedValue) to compare and set the tie-break value are only available in the following methods which should only get called when we collecting a full leaf:

    • visit(int docID, byte[] packedValue)
    • visit(DocIdSetIterator iterator, byte[] packedValue)
  • When docCount[0] reaches size through the basic visit methods (without packedValue), the traversal logic stops because:

    • docCount[0] >= size (we have "enough" docs)
    • tieBreakValue == null (we haven't seen a value to set it)
    • Now the condition docCount[0] < size || (tieBreakValue != null && !foundDifferentValue) evaluates to false and tree traversal stops before we can set the tie-break value.
  • Now without a tie-break value we can't properly collect all tied documents and the approximation returns incorrect results for multi-sort queries if we early terminate or force to traverse the entire tree which cause query latency and does not get the early termination advantage.

@msfroh I will create a separate issue with some proposals to support multiple sorts and since we are close to 3.2 release lets gets the bug fix merged to disable the multiple sorts WDYT?

Thanks

@prudhvigodithi prudhvigodithi added the bug Something isn't working label Jul 30, 2025
@prudhvigodithi prudhvigodithi changed the title Support Approximation framework with multiple sorts Disable Approximation when dealing with multiple sorts Jul 30, 2025
@prudhvigodithi prudhvigodithi marked this pull request as ready for review July 30, 2025 00:11
@prudhvigodithi
Copy link
Member Author

@msfroh can I get your approval to move forward with this PR? Thanks

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 22e4ae4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 22e4ae4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for 22e4ae4: SUCCESS

@cwperks cwperks merged commit 4b90da1 into opensearch-project:main Jul 31, 2025
32 of 37 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Jul 31, 2025
sunqijun1 pushed a commit to sunqijun1/OpenSearch that referenced this pull request Aug 4, 2025
…oject#18763)

* Approximation Framework bug fix

Signed-off-by: Prudhvi Godithi <[email protected]>

* Add tests

Signed-off-by: Prudhvi Godithi <[email protected]>

* Fix spacing

Signed-off-by: Prudhvi Godithi <[email protected]>

---------

Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: sunqijun.jun <[email protected]>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
…oject#18763)

* Approximation Framework bug fix

Signed-off-by: Prudhvi Godithi <[email protected]>

* Add tests

Signed-off-by: Prudhvi Godithi <[email protected]>

* Fix spacing

Signed-off-by: Prudhvi Godithi <[email protected]>

---------

Signed-off-by: Prudhvi Godithi <[email protected]>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
…oject#18763)

* Approximation Framework bug fix

Signed-off-by: Prudhvi Godithi <[email protected]>

* Add tests

Signed-off-by: Prudhvi Godithi <[email protected]>

* Fix spacing

Signed-off-by: Prudhvi Godithi <[email protected]>

---------

Signed-off-by: Prudhvi Godithi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v3.2.0

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants