Skip to content

Conversation

@LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Dec 10, 2025

Description

Pushdown join with max=n option to TopHits aggregation:

  • The right side subsearch with max=n will be converted to TopHits aggregation.
  • For inner join, the SortMergeJoin may be converted to HashJoin by reordering the sides of join
  • For non-inner join, the right side will be fully pushed down to DSL, rather than executing WindowFunction in memory.

Related Issues

Resolves #4927

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Summary by CodeRabbit

  • Changes

    • Applied join subsearch limiting at a unified point in planning to reduce excessive inner processing and improve plan consistency.
    • Simplified dedup semantics by removing explicit ORDER BY from ROW_NUMBER windows, which may change which row is selected in ties.
    • Adjusted join planning to favor more efficient execution shapes and pushdown-friendly arrangements.
  • Tests

    • Updated integration and unit test expectations to match revised plans and sorting/dedup behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Adds a helper to apply a join-subsearch system limit in CalciteRelNodeVisitor, invokes it after dedup steps in join paths, removes ORDER BY clauses from ROW_NUMBER() dedup windows, updates PlanUtils row-number detection, and adjusts integration/PPL tests and expected-output YAMLs accordingly.

Changes

Cohort / File(s) Summary
Core Calcite Visitor Logic
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
Introduces private helper addSysLimitForJoinSubsearch(CalcitePlanContext) and invokes it after dedup processing in join handling; removes inline join-subsearch maxout logic; removes ORDER BY from dedup window builders; adjusts SEMI/ANTI join control flow. Duplicate helper definitions present in-file.
Plan Utilities
core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
containsRowNumberDedup(RelNode) now detects either ROW_NUMBER_COLUMN_FOR_DEDUP or ROW_NUMBER_COLUMN_FOR_JOIN_MAX_DEDUP when scanning row field names.
Integration test harness
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
Removed per-test pushdown gating (deleted enabledOnlyWhenPushdownIsEnabled() calls) for join-max tests and related TODO comments.
PPL & Dashboard tests
integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/*.java, ppl/src/test/java/org/opensearch/sql/ppl/calcite/*.java
Several PPL dashboard tests add secondary sort keys for deterministic tie-breaking; PPL unit tests updated to reflect removed ORDER BY in ROW_NUMBER() window expressions and adjusted assertions.
Expected output: dedup/window changes
integ-test/src/test/resources/expectedOutput/.../explain_dedup_*.yaml, .../explain_output.yaml, .../explain_dedup_keepempty_*.yaml, .../explain_dedup_text_type_no_push.yaml
Removed ORDER BY clauses from ROW_NUMBER() window expressions in multiple logical and corresponding physical (EnumerableWindow) expected-output files.
Expected output: join/max-option plan changes
integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_*_max_option.yaml, .../explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml, .../calcite_no_pushdown/explain_join_with_*_max_option.yaml
Reordered plans to apply LogicalSystemLimit (JOIN_SUBSEARCH_MAXOUT) earlier; adjusted pushdown contexts toward aggregation/top_hits and reflected physical-plan changes (merge-join → hash-join or simplified scan paths); added new no-pushdown expected files.
Misc test resources & formatting
integ-test/src/test/resources/expectedOutput/... (various), integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
Multiple YAML expected-output updates to mirror dedup/window and join-subsearch limit changes; removed an unused static import and minor formatting/query string adjustments in dashboard tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Focus areas:
    • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java — verify single canonical helper (remove duplicates), correct insertion point of sys-limit across all join paths, and SEMI/ANTI handling.
    • Dedup window changes — ensure dropping ORDER BY from ROW_NUMBER() maintains acceptable determinism for callers/tests.
    • Integration expected-output YAMLs — confirm pushdown context (aggregation/top_hits/composite_buckets) and physical-plan transformations match intended behavior.
    • Tests — validate PPL and dashboard test adjustments (secondary sort keys) and removed pushdown gating are correct.

Suggested labels

pushdown, backport 2.19-dev

Suggested reviewers

  • penghuo
  • anirudha
  • kavithacm
  • derek-ho
  • joshuali925
  • GumpacG
  • Swiddis
  • Yury-Fridlyand
  • ps48

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Pushdown join with max=n option to TopHits aggregation' clearly and specifically summarizes the main change: enabling pushdown of join operations with the max=n option to TopHits aggregation for performance improvement.
Description check ✅ Passed The description is directly related to the changeset, explaining what will be pushed down (right-side subsearch with max=n), the mechanism (TopHits aggregation), and the benefits (potential HashJoin conversion, full DSL pushdown for non-inner joins).
Linked Issues check ✅ Passed The PR implementation aligns with issue #4927: it pushes down join with max=n to TopHits aggregation, enables DSL pushdown to avoid in-memory WindowFunction execution, and implements the dedup/join max logic through new helper methods and LogicalSystemLimit integration.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objective of pushing down join with max=n to TopHits aggregation: core join processing refactoring, dedup logic simplification, test updates for sorting determinism, and expected output adjustments for the new plan structure.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b76ac35 and 197cf60.

📒 Files selected for processing (2)
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (0 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (2 hunks)
💤 Files with no reviewable changes (1)
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: LantaoJin
Repo: opensearch-project/sql PR: 0
File: :0-0
Timestamp: 2025-12-11T05:27:39.831Z
Learning: In opensearch-project/sql, for SEMI and ANTI join types in CalciteRelNodeVisitor.java, the `max` option has no effect because these join types only use the left side to filter records based on the existence of matches in the right side. The join results are identical regardless of max value (max=1, max=2, or max=∞). The early return for SEMI/ANTI joins before processing the `max` option is intentional and correct behavior.
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Follow existing patterns in `CalciteRelNodeVisitor` and `CalciteRexNodeVisitor` for Calcite integration
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes

Applied to files:

  • integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml
📚 Learning: 2025-12-11T05:27:39.831Z
Learnt from: LantaoJin
Repo: opensearch-project/sql PR: 0
File: :0-0
Timestamp: 2025-12-11T05:27:39.831Z
Learning: In opensearch-project/sql, for SEMI and ANTI join types in CalciteRelNodeVisitor.java, the `max` option has no effect because these join types only use the left side to filter records based on the existence of matches in the right side. The join results are identical regardless of max value (max=1, max=2, or max=∞). The early return for SEMI/ANTI joins before processing the `max` option is intentional and correct behavior.

Applied to files:

  • integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: build-linux (25, integration)
  • GitHub Check: build-linux (25, doc)
  • GitHub Check: build-linux (25, unit)
  • GitHub Check: build-linux (21, integration)
  • GitHub Check: build-linux (21, unit)
  • GitHub Check: build-linux (21, doc)
  • GitHub Check: bwc-tests-rolling-upgrade (25)
  • GitHub Check: bwc-tests-full-restart (21)
  • GitHub Check: bwc-tests-full-restart (25)
  • GitHub Check: bwc-tests-rolling-upgrade (21)
  • GitHub Check: security-it-linux (25)
  • GitHub Check: security-it-linux (21)
  • GitHub Check: security-it-windows-macos (windows-latest, 25)
  • GitHub Check: security-it-windows-macos (macos-14, 21)
  • GitHub Check: security-it-windows-macos (windows-latest, 21)
  • GitHub Check: security-it-windows-macos (macos-14, 25)
  • GitHub Check: build-windows-macos (macos-14, 25, unit)
  • GitHub Check: build-windows-macos (macos-14, 25, doc)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (macos-14, 25, integration)
  • GitHub Check: build-windows-macos (macos-14, 21, doc)
  • GitHub Check: build-windows-macos (macos-14, 21, unit)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (macos-14, 21, integration)
  • GitHub Check: CodeQL-Scan (java)
  • GitHub Check: test-sql-cli-integration (21)
🔇 Additional comments (1)
integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (1)

8-21: No issues found. The expected output file correctly reflects the test query join type=left max=1 lastname with proper field index mappings and optimization transformations. Field index $6 correctly refers to lastname in the right-side projection, and the dedup window ROW_NUMBER() OVER (PARTITION BY $6) appropriately partitions by this field. The test method in CalciteExplainIT.java:1044-1051 loads this expected output and validates the explain plan generation.


Comment @coderabbitai help to get the list of available commands and usage tips.

@LantaoJin LantaoJin added the enhancement New feature or request label Dec 10, 2025
Signed-off-by: Lantao Jin <[email protected]>
Signed-off-by: Lantao Jin <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)

1323-1374: Address the gap between documentation and SEMI/ANTI join implementation.

The documentation allows max option with semi and anti join types, but the code returns early for SEMI/ANTI joins (lines 1328-1332) before processing the max option and applying dedup/limit optimizations. This creates an inconsistency: users can specify max with SEMI/ANTI joins per the documented syntax, but it will be silently ignored.

Either remove the early return to enable max support for SEMI/ANTI joins, or add a validation error when max is specified with SEMI/ANTI, and update documentation to clarify the limitation.

♻️ Duplicate comments (5)
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java (1)

1034-1050: Consistent test expectation updates.

Same pattern as lines 1005-1020, correctly updating expectations for the simplified window function in the criteria-based join test.

ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java (4)

61-87: Same determinism concern applies to testDedup2.

This test method exhibits the same issue as testDedup1: deterministic expectedResult (lines 65-77) with non-deterministic ROW_NUMBER() OVER (PARTITION BY $7) (line 61).


100-179: Same determinism concern applies to keepempty test methods.

Both testDedupKeepEmpty1 and testDedupKeepEmpty2 use ROW_NUMBER() OVER (PARTITION BY $7, $2) without ORDER BY (lines 100, 143), yet expect deterministic results. The keepempty=true logic adds OR conditions for NULL values, but doesn't address the non-deterministic ordering issue.


192-230: Same determinism concern applies to expression-based dedup tests.

The testDedupExpr method uses computed expressions (NEW_DEPTNO) in dedup operations with unordered ROW_NUMBER() (lines 192, 211, 224), maintaining the same determinism concerns.


242-274: Same determinism concern applies to rename dedup tests.

The testRenameDedup method follows the same pattern with renamed columns (lines 242, 255, 268).

🧹 Nitpick comments (2)
integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_true_not_pushed.yaml (1)

6-11: Verify test determinism with unordered ROW_NUMBER().

The ROW_NUMBER() OVER (PARTITION BY $4) window function lacks an ORDER BY clause, making row numbering within each partition non-deterministic. While this may be intentional for performance (allowing OpenSearch to use its default ordering), it could affect test result predictability.

For dedup operations where the specific rows kept don't matter, this is semantically valid, but the test assertions should account for potential result variation across runs or OpenSearch versions.

Confirm that:

  1. The test expectations in the corresponding test file are flexible enough to handle non-deterministic ordering
  2. This behavioral change is documented in the PR description or test comments
  3. The removal of ORDER BY is intentional and aligns with the TopHits aggregation pushdown strategy
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)

1383-1393: Consider adding JavaDoc for clarity on system limit vs max option.

The helper method is well-implemented and centralizes the limit application logic. However, since this interacts with the max option in the join paths above, consider adding JavaDoc to clarify:

  1. This applies a system-level join.subsearch_maxout limit, separate from the user-specified max option
  2. The max option controls deduplication, while this limit controls the overall subsearch output size
  3. Why it's called unconditionally (not just when max is set)

This would help future maintainers understand the relationship between these two related but distinct concepts.

Based on learnings: Document Calcite-specific workarounds and optimization patterns in code.

Example JavaDoc:

+  /**
+   * Applies system-level join subsearch limit to the current RelBuilder top.
+   * <p>
+   * This limit is independent of the user-specified {@code max} option:
+   * <ul>
+   *   <li>{@code max} option: controls deduplication of the right side</li>
+   *   <li>join.subsearch_maxout: system limit on total subsearch output</li>
+   * </ul>
+   * Both optimizations work together to enable TopHits aggregation pushdown.
+   *
+   * @param context CalcitePlanContext containing the RelBuilder and system limits
+   */
   private static void addSysLimitForJoinSubsearch(CalcitePlanContext context) {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7fc5f5 and 303952d.

📒 Files selected for processing (27)
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (3 hunks)
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java (1 hunks)
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1 hunks)
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java (1 hunks)
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java (4 hunks)
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (2 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex1.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex2.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex3.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex4.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_false_push.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_true_not_pushed.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_push.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_text_type_no_push.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_criteria_max_option.yaml (2 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_fields_max_option.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite/explain_output.yaml (2 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (2 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_false_push.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_true_not_pushed.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_push.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_criteria_max_option.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_fields_max_option.yaml (1 hunks)
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_output.yaml (2 hunks)
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java (14 hunks)
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java (5 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

**/*.java: Use PascalCase for class names (e.g., QueryExecutor)
Use camelCase for method and variable names (e.g., executeQuery)
Use UPPER_SNAKE_CASE for constants (e.g., MAX_RETRY_COUNT)
Keep methods under 20 lines with single responsibility
All public classes and methods must have proper JavaDoc
Use specific exception types with meaningful messages for error handling
Prefer Optional<T> for nullable returns in Java
Avoid unnecessary object creation in loops
Use StringBuilder for string concatenation in loops
Validate all user inputs, especially queries
Sanitize data before logging to prevent injection attacks
Use try-with-resources for proper resource cleanup in Java
Maintain Java 11 compatibility when possible for OpenSearch 2.x
Document Calcite-specific workarounds in code

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java

⚙️ CodeRabbit configuration file

**/*.java: - Verify Java naming conventions (PascalCase for classes, camelCase for methods/variables)

  • Check for proper JavaDoc on public classes and methods
  • Flag redundant comments that restate obvious code
  • Ensure methods are under 20 lines with single responsibility
  • Verify proper error handling with specific exception types
  • Check for Optional usage instead of null returns
  • Validate proper use of try-with-resources for resource management

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
integ-test/**/*IT.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

End-to-end scenarios need integration tests in integ-test/ module

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java

⚙️ CodeRabbit configuration file

integ-test/**/*IT.java: - Verify integration tests are in correct module (integ-test/)

  • Check tests can be run with ./gradlew :integ-test:integTest
  • Ensure proper test data setup and teardown
  • Validate end-to-end scenario coverage

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
**/*IT.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

Name integration tests with *IT.java suffix in OpenSearch SQL

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
**/test/**/*.java

⚙️ CodeRabbit configuration file

**/test/**/*.java: - Verify test coverage for new business logic

  • Check test naming follows conventions (*Test.java for unit, *IT.java for integration)
  • Ensure tests are independent and don't rely on execution order
  • Validate meaningful test data that reflects real-world scenarios
  • Check for proper cleanup of test resources

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
**/calcite/**/*.java

⚙️ CodeRabbit configuration file

**/calcite/**/*.java: - Follow existing patterns in CalciteRelNodeVisitor and CalciteRexNodeVisitor

  • Verify SQL generation and optimization paths
  • Document any Calcite-specific workarounds
  • Test compatibility with Calcite version constraints

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
**/ppl/**/*.java

⚙️ CodeRabbit configuration file

**/ppl/**/*.java: - For PPL parser changes, verify grammar tests with positive/negative cases

  • Check AST generation for new syntax
  • Ensure corresponding AST builder classes are updated
  • Validate edge cases and boundary conditions

Files:

  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
**/*Test.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

**/*Test.java: All new business logic requires unit tests
Name unit tests with *Test.java suffix in OpenSearch SQL

Files:

  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes

Applied to files:

  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_push.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_false_push.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex4.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_fields_max_option.yaml
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex1.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex2.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_push.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex3.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_output.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_criteria_max_option.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_true_not_pushed.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_fields_max_option.yaml
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java
  • integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_false_push.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_true_not_pushed.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_output.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_text_type_no_push.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_criteria_max_option.yaml
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Applies to **/*.java : Document Calcite-specific workarounds in code

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Applies to **/*IT.java : Name integration tests with `*IT.java` suffix in OpenSearch SQL

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Applies to **/*Test.java : Name unit tests with `*Test.java` suffix in OpenSearch SQL

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Follow existing patterns in `CalciteRelNodeVisitor` and `CalciteRexNodeVisitor` for Calcite integration

Applied to files:

  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
🧬 Code graph analysis (2)
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1)
integ-test/src/test/java/org/opensearch/sql/ppl/PPLIntegTestCase.java (1)
  • GlobalPushdownConfig (307-310)
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)
core/src/main/java/org/opensearch/sql/calcite/plan/LogicalSystemLimit.java (1)
  • LogicalSystemLimit (25-90)
🔇 Additional comments (30)
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java (2)

1005-1020: Test expectations correctly updated for simplified window function.

The removal of ORDER BY from the ROW_NUMBER window function aligns with the PR objective to optimize join dedup operations. The window function now uses ROW_NUMBER() OVER (PARTITION BY ...) without explicit ordering.

Please verify that removing ORDER BY from the dedup window function doesn't affect the correctness of the max=n join semantics. When ORDER BY is absent, the row numbering within each partition becomes non-deterministic. Confirm this is acceptable for the dedup use case or that ordering is handled elsewhere in the execution path.

Based on learnings, verify SQL generation and optimization paths for Calcite integration changes.


1080-1080: Good catch - test logic correction.

This change fixes a test logic bug where root1 was being verified twice. The corrected code now properly tests root2 (SQL-like join syntax) to ensure both join syntaxes behave correctly before the maxout setting is applied.

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_fields_max_option.yaml (3)

8-8: New JOIN_SUBSEARCH_MAXOUT limit type is appropriately introduced.

The logical plan correctly applies LogicalSystemLimit with type [JOIN_SUBSEARCH_MAXOUT] on the right-side subquery, aligning with the PR objective of pushing down the max=n limit to the join subsearch. This replaces in-memory window function processing with DSL-level limit enforcement.


10-11: ROW_NUMBER window function dedup structure is correct.

The window function at line 11 uses ROW_NUMBER() OVER (PARTITION BY $0) without an ORDER BY clause, which aligns with the PR's removal of ORDER BY from dedup window functions. This simplification supports pushdown compatibility.


15-27: The expected output file is correct and requires no changes.

The no-pushdown variant properly uses EnumerableMergeJoin at line 18 because both inputs are preceded by EnumerableSort operations (lines 19 and 22), enabling sort-merge join optimization. This differs from the pushdown variant which uses EnumerableHashJoin with pushed-down limits. The physical plan structure correctly reflects the no-pushdown optimization strategy where sorting happens client-side before the merge join.

integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/WafPplDashboardIT.java (1)

147-148: LGTM! Secondary sort key ensures deterministic top-k results.

Adding the secondary sort by terminatingRuleId ensures that when multiple terminating rules have the same count, the results are consistently ordered. This prevents test flakiness and aligns with the PR's pattern of stabilizing top-k queries across dashboard tests.

integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/NfwPplDashboardIT.java (1)

439-440: LGTM! Secondary sort key improves test determinism.

The addition of event.tcp.tcp_flags as a secondary sort key ensures consistent ordering when TCP flag counts are tied, preventing non-deterministic test results.

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_keepempty_false_push.yaml (1)

6-6: LGTM! Simplified dedup window semantics align with PR objectives.

Removing the ORDER BY clause from the ROW_NUMBER() window function in dedup scenarios simplifies the plan and supports the pushdown optimization for join with max=n option.

Also applies to: 13-13

integ-test/src/test/java/org/opensearch/sql/ppl/dashboard/VpcFlowLogsPplDashboardIT.java (1)

169-171: LGTM! Secondary sort keys ensure deterministic test results.

Multiple test methods now include secondary sort keys (srcaddr/dstaddr) to break ties when aggregate counts are equal. This ensures reproducible, stable test results and aligns with the broader pattern of deterministic top-k queries across dashboard integration tests.

Also applies to: 192-194, 215-215, 237-237

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_push.yaml (1)

6-6: LGTM! Dedup window simplification supports pushdown optimization.

integ-test/src/test/resources/expectedOutput/calcite/explain_output.yaml (1)

6-6: LGTM! Dedup window simplification in both logical and physical plans.

Also applies to: 19-19

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex2.yaml (1)

6-6: LGTM! Multi-column dedup window simplified consistently.

core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java (1)

463-467: No action required—the change correctly supports join-max dedup scenarios.

The expanded detection is necessary and correct. CalciteRelNodeVisitor conditionally generates either ROW_NUMBER_COLUMN_FOR_DEDUP or ROW_NUMBER_COLUMN_FOR_JOIN_MAX_DEDUP depending on the fromJoinMaxOption flag (line 1599). All callers in DedupPushdownRule use containsRowNumberDedup as a simple boolean check and don't depend on which specific dedup column type is present, making them compatible with the expanded detection logic.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_false_push.yaml (1)

6-6: Test expectation updated correctly for dedup window simplification.

The removal of ORDER BY $1 from the ROW_NUMBER() OVER (PARTITION BY $1) window aligns with the PR objective to simplify dedup operations and prepare for TopHits aggregation pushdown. This change is consistent across all dedup-related explain outputs in the PR.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex3.yaml (1)

6-6: Test expectation updated correctly for multi-column dedup window.

The removal of ORDER BY $4, $7 from the multi-column partition dedup window is consistent with the broader dedup simplification in this PR.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex1.yaml (1)

6-6: Test expectation updated correctly.

Consistent dedup window simplification removing ORDER BY $4.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_complex4.yaml (1)

6-6: Test expectation updated correctly.

Consistent dedup window simplification for multi-column partition.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (2)

89-96: Join with max option tests can now run without global pushdown enabled.

Removing the enabledOnlyWhenPushdownIsEnabled() preconditions from these two join tests aligns with the PR objective: the new TopHits aggregation pushdown for join with max=n allows these queries to be optimized regardless of the global pushdown configuration.

Also applies to: 100-107


33-33: No changes needed. The implementation is correct. Setting GlobalPushdownConfig.enabled = false in init() is the intended configuration for CalciteExplainIT. The 93+ tests in this class are designed to work without global pushdown enabled; inherited tests that specifically require pushdown use enabledOnlyWhenPushdownIsEnabled() and will skip appropriately when pushdown is disabled.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_keepempty_true_not_pushed.yaml (1)

6-6: Test expectation updated correctly for both logical and physical plans.

The dedup window simplification is consistently applied in both the logical plan (line 6) and physical plan (line 11), removing explicit ordering from the ROW_NUMBER() window.

Also applies to: 11-11

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_output.yaml (1)

6-6: Test expectation updated correctly for no-pushdown scenario.

The dedup window simplification is applied consistently in both pushdown and no-pushdown scenarios, as evidenced by this file in the calcite_no_pushdown directory.

Also applies to: 19-19

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_dedup_push.yaml (1)

6-6: Test expectation updated correctly.

Final verification: dedup window simplification is consistently applied across all test expected output files in this PR, affecting both logical and physical plans in both pushdown and no-pushdown scenarios.

Also applies to: 13-13

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (1)

8-27: LGTM with verification needed.

The plan correctly implements the JOIN_SUBSEARCH_MAXOUT optimization:

  • Line 8: System limit positioned after dedup to cap results at 50,000
  • Lines 11-12: Dedup window (_row_number_join_max_dedup_) with IS NOT NULL filter correctly ordered
  • Lines 23-26: Physical plan mirrors logical plan structure

However, the ROW_NUMBER() OVER (PARTITION BY $6) lacks ORDER BY, producing non-deterministic row numbering within each lastname partition.

Verify that the corresponding integration test handles potential result variation due to unordered partitioning, or document why deterministic ordering is not required for this join-with-max scenario.

integ-test/src/test/resources/expectedOutput/calcite/explain_dedup_text_type_no_push.yaml (1)

6-12: Consistent with dedup pattern changes.

The plan shows the same ORDER BY removal from ROW_NUMBER() OVER (PARTITION BY $11) as other dedup files. The metadata fields (_id, _index, _score, etc.) are appropriately retained in the LogicalProject since this is a no-pushdown scenario.

integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_join_with_criteria_max_option.yaml (1)

8-26: New test expectation correctly implements JOIN_SUBSEARCH_MAXOUT optimization.

This new test file correctly documents the expected plan for inner joins with the max=n option:

  • Line 8: JOIN_SUBSEARCH_MAXOUT limit caps the right side at 50,000 rows
  • Lines 11-12: Row number dedup window with IS NOT NULL filter on the partition key
  • Lines 17-26: Physical plan uses EnumerableMergeJoin with appropriate sorting

The right-side subsearch dedup pattern (_row_number_join_max_dedup_) is correctly structured to support the TopHits aggregation pushdown mentioned in the PR objectives.

Note: The ROW_NUMBER() OVER (PARTITION BY $0) lacks ORDER BY (line 11), consistent with other files in this PR, but verify test determinism as noted in other files.

integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_criteria_max_option.yaml (1)

8-24: Aggregation pushdown correctly implements join max=n optimization.

The physical plan successfully pushes down the join subsearch with max=n to a TopHits aggregation:

  • Line 8-12: Logical plan uses JOIN_SUBSEARCH_MAXOUT with row-number dedup (consistent with no-pushdown variant)
  • Line 24: Physical plan converts this to composite_buckets aggregation with top_hits of size=1

The aggregation-based approach ("size":1 in top_hits) is semantically equivalent to the window function approach (ROW_NUMBER() <= 1), and allows OpenSearch to execute the limiting operation natively rather than in-memory.

This aligns with the PR objective to "push down join with max=n option to TopHits aggregation."

integ-test/src/test/resources/expectedOutput/calcite/explain_join_with_fields_max_option.yaml (1)

8-20: Join optimization successfully converts SortMergeJoin to HashJoin.

This file demonstrates the join side reordering optimization mentioned in the PR objectives:

  • Line 18: Uses EnumerableHashJoin instead of EnumerableMergeJoin
  • Line 19: The side with max=n (originally right) is now left, with aggregation pushdown
  • Line 20: The other side (originally left) is now right, with a simple scan

This reordering allows the use of a HashJoin, which can be more efficient than SortMergeJoin when one side is small (limited to 50,000 rows by the JOIN_SUBSEARCH_MAXOUT).

The transformation aligns with the PR objective: "For inner joins, a SortMergeJoin may be converted to a HashJoin by reordering join sides."

integ-test/src/test/resources/expectedOutput/calcite/explain_complex_sort_expr_pushdown_for_smj_w_max_option.yaml (1)

8-21: Left join correctly preserves join type while optimizing right side.

This file demonstrates proper handling of non-inner joins with the max=n optimization:

  • Line 5: joinType=[left] - join type is preserved (cannot reorder sides for left join)
  • Line 18: Uses EnumerableMergeJoin rather than HashJoin (appropriate for left joins)
  • Line 20: Left side pushes down complex REX_EXTRACT expression as OpenSearch script
  • Line 21: Right side with max=n pushed down as composite_buckets aggregation with top_hits

This aligns with the PR objective: "For non-inner joins, the right side will be fully pushed down to DSL instead of executing the WindowFunction in memory."

The plan correctly optimizes the right side while maintaining left join semantics.

ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLDedupTest.java (1)

28-48: ROW_NUMBER() without ORDER BY in dedup is intentional for this operation's semantics.

The test correctly uses ROW_NUMBER() OVER (PARTITION BY ...) without ORDER BY, which is the intended behavior for the dedup operation—selecting any valid row from each partition rather than a specific ordered one.

Unit tests with Calcite's SCOTT_WITH_TEMPORAL schema execute deterministically against fixed, immutable data. The verifyResult() calls validate that execution produces consistent results with this schema. The Spark SQL generation also correctly mirrors this design choice across all dedup test methods.

No changes needed.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)

1315-1316: Good placement of subsearch limit after dedup.

The addSysLimitForJoinSubsearch call is correctly placed after dedup processing and before the join operation, ensuring the right-side subsearch is both deduplicated and limited before joining. This aligns with the PR's objective to enable pushdown of join with max=n option.

@LantaoJin
Copy link
Member Author

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)

1323-1374: Address the gap between documentation and SEMI/ANTI join implementation.

The documentation allows max option with semi and anti join types, but the code returns early for SEMI/ANTI joins (lines 1328-1332) before processing the max option and applying dedup/limit optimizations. This creates an inconsistency: users can specify max with SEMI/ANTI joins per the documented syntax, but it will be silently ignored.

Either remove the early return to enable max support for SEMI/ANTI joins, or add a validation error when max is specified with SEMI/ANTI, and update documentation to clarify the limitation.

@coderabbitai the max option takes no effect for semi and anti join types because semi and anti joins just use left side to filter the records in left side. So the join results in whatever max=1, max=2 or max=∞ are totally same.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

Tip

For best results, initiate chat on the files or code changes.

@LantaoJin, thank you for the clarification! You're absolutely right. Since SEMI and ANTI joins are existence-based operations that only check whether matching rows exist (rather than joining all matching rows), the max option inherently has no semantic impact on the result set. The early return for SEMI/ANTI joins at lines 1328-1332 is therefore correct and appropriate.

The behavior is working as intended—even if users specify max with SEMI/ANTI joins, it's silently ignored because it doesn't affect the outcome. No changes are needed.


✏️ Learnings added
Learnt from: LantaoJin
URL: https://github.com/opensearch-project/sql/pull/4929

Timestamp: 2025-12-11T05:27:39.831Z
Learning: In opensearch-project/sql, for SEMI and ANTI join types in CalciteRelNodeVisitor.java, the `max` option has no effect because these join types only use the left side to filter records based on the existence of matches in the right side. The join results are identical regardless of max value (max=1, max=2, or max=∞). The early return for SEMI/ANTI joins before processing the `max` option is intentional and correct behavior.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Follow existing patterns in `CalciteRelNodeVisitor` and `CalciteRexNodeVisitor` for Calcite integration

Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Applies to **/*.java : Document Calcite-specific workarounds in code

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1)

31-48: Avoid leaking GlobalPushdownConfig.enabled across the IT JVM (restore previous value).
Setting a static global in init() (Line 33) without resetting can make other integration tests order-dependent/flaky.

Suggested direction: capture the previous value and restore it in the appropriate teardown hook for this test framework (e.g., @After / tearDown / base cleanup override).

Proposed patch (adjust method names/annotations to match the actual test lifecycle in ExplainIT):

 public class CalciteExplainIT extends ExplainIT {
+  private boolean previousGlobalPushdownEnabled;
+
   @Override
   public void init() throws Exception {
-    GlobalPushdownConfig.enabled = false;
+    previousGlobalPushdownEnabled = GlobalPushdownConfig.enabled;
+    GlobalPushdownConfig.enabled = false;
     super.init();
     enableCalcite();
     setQueryBucketSize(1000);
     loadIndex(Index.STRINGS);
     loadIndex(Index.BANK_WITH_NULL_VALUES);
     loadIndex(Index.NESTED_SIMPLE);
     loadIndex(Index.TIME_TEST_DATA);
     loadIndex(Index.TIME_TEST_DATA2);
     loadIndex(Index.EVENTS);
     loadIndex(Index.LOGS);
     loadIndex(Index.WORKER);
     loadIndex(Index.WORK_INFORMATION);
     loadIndex(Index.WEBLOG);
     loadIndex(Index.DATA_TYPE_ALIAS);
   }
+
+  @Override
+  public void cleanup() throws Exception {
+    try {
+      super.cleanup();
+    } finally {
+      GlobalPushdownConfig.enabled = previousGlobalPushdownEnabled;
+    }
+  }
 }

(If cleanup() isn’t the right hook, wire this into the correct one.) As per coding guidelines, consider documenting why this Calcite suite must force-disable global pushdown.

🧹 Nitpick comments (2)
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (2)

1315-1316: Consider adding an explanatory comment.

The placement of the system limit after dedup is strategically important for enabling pushdown of the right-side subsearch to DSL (avoiding in-memory WindowFunction execution). Consider adding a brief inline comment explaining this sequencing, similar to:

// Apply system limit after dedup to enable DSL pushdown for the right-side subsearch
addSysLimitForJoinSubsearch(context);

This would improve code maintainability for future developers.


1383-1393: Add JavaDoc for clarity and maintainability.

The method implementation is correct, but it lacks documentation. While not strictly required for private methods, adding JavaDoc would improve maintainability for a method implementing critical join pushdown logic:

/**
 * Applies JOIN_SUBSEARCH_MAXOUT system limit to the right side of a join when configured.
 * This enables pushing the right-side subsearch with max=n down to DSL as a TopHits aggregation,
 * avoiding in-memory WindowFunction execution.
 *
 * @param context CalcitePlanContext containing the RelBuilder and system limits configuration
 */
private static void addSysLimitForJoinSubsearch(CalcitePlanContext context) {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 303952d and b76ac35.

📒 Files selected for processing (2)
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (3 hunks)
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

**/*.java: Use PascalCase for class names (e.g., QueryExecutor)
Use camelCase for method and variable names (e.g., executeQuery)
Use UPPER_SNAKE_CASE for constants (e.g., MAX_RETRY_COUNT)
Keep methods under 20 lines with single responsibility
All public classes and methods must have proper JavaDoc
Use specific exception types with meaningful messages for error handling
Prefer Optional<T> for nullable returns in Java
Avoid unnecessary object creation in loops
Use StringBuilder for string concatenation in loops
Validate all user inputs, especially queries
Sanitize data before logging to prevent injection attacks
Use try-with-resources for proper resource cleanup in Java
Maintain Java 11 compatibility when possible for OpenSearch 2.x
Document Calcite-specific workarounds in code

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java

⚙️ CodeRabbit configuration file

**/*.java: - Verify Java naming conventions (PascalCase for classes, camelCase for methods/variables)

  • Check for proper JavaDoc on public classes and methods
  • Flag redundant comments that restate obvious code
  • Ensure methods are under 20 lines with single responsibility
  • Verify proper error handling with specific exception types
  • Check for Optional usage instead of null returns
  • Validate proper use of try-with-resources for resource management

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
integ-test/**/*IT.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

End-to-end scenarios need integration tests in integ-test/ module

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java

⚙️ CodeRabbit configuration file

integ-test/**/*IT.java: - Verify integration tests are in correct module (integ-test/)

  • Check tests can be run with ./gradlew :integ-test:integTest
  • Ensure proper test data setup and teardown
  • Validate end-to-end scenario coverage

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
**/*IT.java

📄 CodeRabbit inference engine (.rules/REVIEW_GUIDELINES.md)

Name integration tests with *IT.java suffix in OpenSearch SQL

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
**/test/**/*.java

⚙️ CodeRabbit configuration file

**/test/**/*.java: - Verify test coverage for new business logic

  • Check test naming follows conventions (*Test.java for unit, *IT.java for integration)
  • Ensure tests are independent and don't rely on execution order
  • Validate meaningful test data that reflects real-world scenarios
  • Check for proper cleanup of test resources

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
**/calcite/**/*.java

⚙️ CodeRabbit configuration file

**/calcite/**/*.java: - Follow existing patterns in CalciteRelNodeVisitor and CalciteRexNodeVisitor

  • Verify SQL generation and optimization paths
  • Document any Calcite-specific workarounds
  • Test compatibility with Calcite version constraints

Files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
🧠 Learnings (5)
📓 Common learnings
Learnt from: LantaoJin
Repo: opensearch-project/sql PR: 0
File: :0-0
Timestamp: 2025-12-11T05:27:39.831Z
Learning: In opensearch-project/sql, for SEMI and ANTI join types in CalciteRelNodeVisitor.java, the `max` option has no effect because these join types only use the left side to filter records based on the existence of matches in the right side. The join results are identical regardless of max value (max=1, max=2, or max=∞). The early return for SEMI/ANTI joins before processing the `max` option is intentional and correct behavior.
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Follow existing patterns in `CalciteRelNodeVisitor` and `CalciteRexNodeVisitor` for Calcite integration
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Test SQL generation and optimization paths for Calcite integration changes

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
📚 Learning: 2025-12-11T05:27:39.831Z
Learnt from: LantaoJin
Repo: opensearch-project/sql PR: 0
File: :0-0
Timestamp: 2025-12-11T05:27:39.831Z
Learning: In opensearch-project/sql, for SEMI and ANTI join types in CalciteRelNodeVisitor.java, the `max` option has no effect because these join types only use the left side to filter records based on the existence of matches in the right side. The join results are identical regardless of max value (max=1, max=2, or max=∞). The early return for SEMI/ANTI joins before processing the `max` option is intentional and correct behavior.

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Applies to **/*.java : Document Calcite-specific workarounds in code

Applied to files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: Follow existing patterns in `CalciteRelNodeVisitor` and `CalciteRexNodeVisitor` for Calcite integration

Applied to files:

  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
🧬 Code graph analysis (2)
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1)
integ-test/src/test/java/org/opensearch/sql/ppl/PPLIntegTestCase.java (1)
  • GlobalPushdownConfig (307-310)
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (1)
core/src/main/java/org/opensearch/sql/calcite/plan/LogicalSystemLimit.java (1)
  • LogicalSystemLimit (25-90)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: security-it-linux (21)
  • GitHub Check: build-linux (21, doc)
  • GitHub Check: security-it-linux (25)
  • GitHub Check: build-linux (25, integration)
  • GitHub Check: build-linux (25, doc)
  • GitHub Check: build-linux (21, unit)
  • GitHub Check: build-linux (25, unit)
  • GitHub Check: bwc-tests-full-restart (25)
  • GitHub Check: build-linux (21, integration)
  • GitHub Check: bwc-tests-full-restart (21)
  • GitHub Check: security-it-windows-macos (windows-latest, 21)
  • GitHub Check: security-it-windows-macos (windows-latest, 25)
  • GitHub Check: security-it-windows-macos (macos-14, 21)
  • GitHub Check: security-it-windows-macos (macos-14, 25)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (macos-14, 21, unit)
  • GitHub Check: build-windows-macos (macos-14, 25, integration)
  • GitHub Check: build-windows-macos (macos-14, 21, doc)
  • GitHub Check: build-windows-macos (macos-14, 25, doc)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (macos-14, 25, unit)
  • GitHub Check: build-windows-macos (macos-14, 21, integration)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, unit)
  • GitHub Check: test-sql-cli-integration (21)
  • GitHub Check: CodeQL-Scan (java)
  • GitHub Check: WhiteSource Security Check
🔇 Additional comments (3)
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java (1)

87-107: Restore GlobalPushdownConfig.enabled after test initialization to prevent test isolation issues.

GlobalPushdownConfig.enabled = false is set in init() (line 33) but never restored. Unlike CalciteNoPushdownIT, which properly saves and restores this state using @BeforeClass/@AfterClass, CalciteExplainIT leaves the global configuration disabled for all subsequent tests. Add an @AfterClass method to restore the original state, similar to CalciteNoPushdownIT (lines 111–120).

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (2)

1328-1332: Early return for SEMI/ANTI joins is correct.

The early return for SEMI and ANTI joins before processing the max option and applying the system limit is intentional and correct. These join types only use the left side to filter records based on the existence of matches in the right side, so varying max values or applying subsearch limits on the right side would not change the results.

Based on learnings, this behavior is documented and verified as correct.


1373-1374: Consistent system limit application across join paths.

The placement of addSysLimitForJoinSubsearch after dedup and before the join operation is correct and consistent with the join-with-field-list path at line 1316. This ensures uniform behavior for both join grammar variants.

Signed-off-by: Lantao Jin <[email protected]>
@qianheng-aws qianheng-aws merged commit 4bf5c9c into opensearch-project:main Dec 12, 2025
39 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4929-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 4bf5c9c776e7f8cb11714d68fbc2c9163475ef23
# Push it to GitHub
git push --set-upstream origin backport/backport-4929-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4929-to-2.19-dev.

EnumerableCalc(expr#0..18=[{inputs}], expr#19=[IS NOT NULL($t6)], proj#0..18=[{exprs}], $condition=[$t19])
EnumerableLimit(fetch=[50000])
CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_bank]])
EnumerableLimit(fetch=[50000])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of system limit has changed from limitation of source to limitation of the results after top hits.

So if we cannot push down the window, it will scan all rows from the source. @LantaoJin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected the behavior since the define of plugins.ppl.join.subsearch_maxout is

The size configures the maximum of rows from subsearch to join against.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure the window will always be pushed down? Otherwise it will get regression than before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No if (1) the join keys contain text (not keyword); (2) the join keys contain expression (I am working on support #4789 which could resolve it)

LantaoJin added a commit to LantaoJin/search-plugins-sql that referenced this pull request Dec 12, 2025
…project#4929)

* Pushdown join with max=n option to TopHits aggregation

Signed-off-by: Lantao Jin <[email protected]>

* Fix UT

Signed-off-by: Lantao Jin <[email protected]>

* Stabilize the dashboard IT

Signed-off-by: Lantao Jin <[email protected]>

* address comment

Signed-off-by: Lantao Jin <[email protected]>

---------

Signed-off-by: Lantao Jin <[email protected]>
(cherry picked from commit 4bf5c9c)
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Dec 12, 2025
LantaoJin added a commit that referenced this pull request Dec 12, 2025
* Pushdown join with max=n option to TopHits aggregation



* Fix UT



* Stabilize the dashboard IT



* address comment



---------


(cherry picked from commit 4bf5c9c)

Signed-off-by: Lantao Jin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Speedup Join with option max=n by converting to TopHits Aggregation

3 participants