Skip to content

fix(core): disable ReplaceDistinctWithAggregate rule#1294

Merged
douenergy merged 1 commit intoCanner:mainfrom
goldmedal:fix/disable-distinct-to-agg
Aug 22, 2025
Merged

fix(core): disable ReplaceDistinctWithAggregate rule#1294
douenergy merged 1 commit intoCanner:mainfrom
goldmedal:fix/disable-distinct-to-agg

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Aug 22, 2025

DataFusion unparser has some issues when unparsing the derived table generated by ReplaceDistinctWithAggregate. Disable it to avoid generating invalid plans.

Summary by CodeRabbit

  • New Features

    • Enabled information schema support for easier metadata discovery.
    • Improved function discovery by separating parsing and execution contexts for more accurate listings.
  • Bug Fixes

    • Prevented DISTINCT from being rewritten to GROUP BY during SQL transformation, ensuring expected results and reducing unexpected derived tables.
    • Increased stability when switching between transformation modes.
  • Tests

    • Added test coverage to validate DISTINCT behavior and snapshot the resulting SQL transformation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Aug 22, 2025

Walkthrough

Introduces a separate execution context in PySessionContext alongside the unparsing context, updates default and constructor wiring, and routes function discovery through the execution context. In core MDL, enables information_schema and disables the ReplaceDistinctWithAggregate rule for unparse optimization. Adds a snapshot test verifying DISTINCT is not rewritten to GROUP BY.

Changes

Cohort / File(s) Summary
Python binding: dual contexts
wren-core-py/src/context.rs
Adds exec_ctx: SessionContext; updates Default and new() to create/wire unparsing ctx and execution ctx separately; get_available_functions now reads from exec_ctx.
MDL context configuration
wren-core/core/src/mdl/context.rs
Enables DataFusion information_schema; removes ReplaceDistinctWithAggregate from unparse optimizer rule set; comments rationale.
MDL unparse test
wren-core/core/src/mdl/mod.rs
Adds test_disable_distinct_to_group_by verifying DISTINCT is not rewritten to GROUP BY via snapshot of transformed SQL.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor PyCaller as Python Caller
    participant PyCtx as PySessionContext::new
    participant MDLCtx as create_ctx_with_mdl(Mode::Unparse)
    participant Exec as LocalRuntime::SessionContext

    PyCaller->>PyCtx: new(mdl_base64: Option<String>)
    alt mdl_base64 is Some
        PyCtx->>MDLCtx: build unparsing ctx
        PyCtx->>Exec: build execution ctx
        PyCtx-->>PyCaller: PySessionContext{ ctx=unparser_ctx, exec_ctx=exec_ctx }
    else mdl_base64 is None
        PyCtx->>Exec: reuse provided ctx as exec_ctx
        PyCtx-->>PyCaller: PySessionContext{ ctx=ctx.clone(), exec_ctx=ctx }
    end

    PyCaller->>PySessionContext: get_available_functions()
    PySessionContext->>Exec: query registered functions
    Exec-->>PySessionContext: functions
    PySessionContext-->>PyCaller: functions
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

core, rust, python

Suggested reviewers

  • douenergy

Poem

I split my burrow: one for think, one for run,
Two tunnels humming—work and fun.
DISTINCT hops freely, no GROUP BY snare,
Info schema lanterns light the lair.
With whisker-twitch and tidy stacks,
I bound through contexts—no tracks overlap. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added core rust Pull requests that update Rust code python Pull requests that update Python code labels Aug 22, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
wren-core/core/src/mdl/context.rs (1)

227-229: Improve disable comment with tracking link

To ensure future maintainability, add a reference to the upstream issue where this fragility is tracked so it’s easy to revert once fixed.

• File: wren-core/core/src/mdl/context.rs
Lines: 227–229

Suggested update:

-// Unparser has some issues for handling derived table generated by ReplaceDistinctWithAggregate rule
+// Unparser has issues with derived tables generated by ReplaceDistinctWithAggregate.
// See: https://github.com/apache/datafusion/issues/<issue_number> (re-enable when fixed)
 // Arc::new(ReplaceDistinctWithAggregate::new()),
wren-core/core/src/mdl/mod.rs (1)

1709-1743: New DISTINCT test covers the intended behavior; add a cheap guard against GROUP BY regressions

Snapshot is good. Add an explicit assertion to ensure GROUP BY isn't introduced and (optionally) validate the generated SQL still round-trips via DataFusion to catch unparser regressions early.

Apply within this test:

         let result = transform_sql_with_ctx(
             &ctx,
             Arc::clone(&analyzed_mdl),
             &[],
             Arc::new(HashMap::new()),
             sql,
         )
         .await?;
+        // Ensure DISTINCT wasn't rewritten into a GROUP BY
+        assert!(
+            !result.contains(" GROUP BY "),
+            "Unexpected GROUP BY rewrite for DISTINCT"
+        );
+        // Optional: sanity check the output SQL is still executable/round-trippable
+        // (helper is already available in this module)
+        assert_sql_valid_executable(&result).await?;
         assert_snapshot!(
             result,
             @"SELECT DISTINCT customer.c_custkey, customer.c_name FROM (SELECT customer.c_custkey, customer.c_name FROM (SELECT __source.c_custkey AS c_custkey, __source.c_name AS c_name FROM customer AS __source) AS customer) AS customer"
         );

Follow-up ideas (no change required now):

  • Add coverage for DISTINCT on a single column and for DISTINCT nested in a subquery.
wren-core-py/src/context.rs (1)

171-188: Creating distinct unparser_ctx and exec_ctx from the same base state is correct; minor readability tweak

Using create_ctx_with_mdl for both modes ensures consistent registrations (UDFs, config) via new_from_existing(state). Consider renaming the local variable ctx (the base context) to base_ctx for clarity in this constructor.

- let ctx = wren_core::SessionContext::new_with_config(config);
+ let base_ctx = wren_core::SessionContext::new_with_config(config);
...
- .block_on(create_ctx_with_mdl(
-     &ctx,
+ .block_on(create_ctx_with_mdl(
+     &base_ctx,
      Arc::clone(&analyzed_mdl),
      Arc::clone(&properties_ref),
      mdl::context::Mode::Unparse,
 ))
...
- .block_on(create_ctx_with_mdl(
-     &ctx,
+ .block_on(create_ctx_with_mdl(
+     &base_ctx,
      Arc::clone(&analyzed_mdl),
      Arc::clone(&properties_ref),
      mdl::context::Mode::LocalRuntime,
 ))
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 92a1c7d and 0bba263.

📒 Files selected for processing (3)
  • wren-core-py/src/context.rs (6 hunks)
  • wren-core/core/src/mdl/context.rs (2 hunks)
  • wren-core/core/src/mdl/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
wren-core-py/src/context.rs (1)
wren-core/core/src/mdl/context.rs (2)
  • new (319-341)
  • create_ctx_with_mdl (45-107)
wren-core/core/src/mdl/mod.rs (1)
wren-core-py/src/context.rs (2)
  • new (81-203)
  • default (62-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: cargo test (win64)
  • GitHub Check: test
  • GitHub Check: cargo test (macos)
  • GitHub Check: cargo check
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: ci
🔇 Additional comments (5)
wren-core/core/src/mdl/context.rs (1)

69-71: Enable information_schema for all derived contexts — good alignment with Python function discovery

Turning on information_schema in create_ctx_with_mdl ensures every Mode (Unparse, LocalRuntime, PermissionAnalyze) exposes routines/views that PySessionContext enumerates. This matches the Python side’s query against information_schema.routines.

wren-core-py/src/context.rs (4)

49-53: exec_ctx separation from ctx (unparser) is a solid design choice

Clear split between transformation (ctx) and execution (exec_ctx) avoids optimizer rule leakage into unparser behavior and enables accurate function enumeration from the runtime context.


61-66: Default initializers keep both contexts usable out of the box

Initializing exec_ctx alongside ctx in Default is consistent and avoids None-guarding elsewhere. Good.


120-126: When no MDL is provided, reusing the same configured SessionContext for both ctx and exec_ctx is fine

Because the base ctx is created with SessionConfig::default().with_information_schema(true), get_available_functions will work in this path as well.


229-235: Rename typo in get_regietered_functions to get_registered_functions

To keep the codebase consistent and improve discoverability, update all three occurrences:

• wren-core-py/src/context.rs:98
Change

-   .block_on(Self::get_regietered_functions(&ctx))
+   .block_on(Self::get_registered_functions(&ctx))

• wren-core-py/src/context.rs:229
Change

-   .block_on(Self::get_regietered_functions(&self.exec_ctx))
+   .block_on(Self::get_registered_functions(&self.exec_ctx))

• wren-core-py/src/context.rs:320 (function definition)
Change

-   async fn get_regietered_functions(
+   async fn get_registered_functions(
     ctx: &wren_core::SessionContext,
 ) -> PyResult<Vec<RemoteFunctionDto>> { … }

All references to the old, misspelled name have been identified (rg found only these three instances), so applying this patch completes the rename.

@goldmedal goldmedal requested a review from douenergy August 22, 2025 08:46
@douenergy douenergy merged commit bdd52c2 into Canner:main Aug 22, 2025
14 checks passed
nhaluc1005 pushed a commit to nhaluc1005/text2sql-practice that referenced this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core python Pull requests that update Python code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants