Skip to content

feat(core): disable enable_ident_normalization for case-sensitive identifier by default#1305

Merged
douenergy merged 3 commits intomainfrom
feat/disable-ident-normalize
Sep 3, 2025
Merged

feat(core): disable enable_ident_normalization for case-sensitive identifier by default#1305
douenergy merged 3 commits intomainfrom
feat/disable-ident-normalize

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Sep 2, 2025

Description

This PR changes the default behavior of Wren engine for case-sensitive identifier. I disable datafusion.sql_parser.enable_ident_normalization by default.

/// When set to true, SQL parser will normalize ident (convert ident to lowercase when not quoted)
pub enable_ident_normalization: bool, default = true

Now, all the identifier won't be normalized. It means all the identifier will be case-sensitive. It match the logic for finding the Wren model but the user won't need to add the quote anymore.

Summary by CodeRabbit

  • New Features
    • Query tables and columns with non-ASCII (e.g., Chinese) identifiers without forced lowercasing.
  • Bug Fixes
    • Improved handling of quoted/unquoted and mixed-case identifiers to prevent misinterpretation.
    • Clearer error when a table name is ambiguous across models.
  • Tests
    • Added end-to-end coverage for Unicode identifiers, including caching behavior.
    • Expanded tests for ambiguous naming scenarios and case sensitivity.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Adds a Unicode-named table to Postgres test setup, introduces an async test querying that table via the v3 API, disables identifier normalization in DataFusion session config, and updates MDL transform tests to use unquoted/Unicode identifiers and cover ambiguous table names. No public APIs changed.

Changes

Cohort / File(s) Summary
Postgres test setup
ibis-server/tests/routers/v3/connector/postgres/conftest.py
Within the existing setup transaction, creates and populates a table named 中文表 with columns 欄位1, 欄位2 and sample rows.
Postgres API query test
ibis-server/tests/routers/v3/connector/postgres/test_query.py
Adds async test_query_unicode_table that builds a Unicode manifest, queries SELECT 欄位1, 欄位2 FROM 中文表 LIMIT 1 via the v3 API with caching, asserts 200, data [1, 2], and int32 dtypes.
DataFusion session config
wren-core/core/src/mdl/context.rs
Sets datafusion.sql_parser.enable_ident_normalization = "false" in create_ctx_with_mdl; no signature or control-flow changes.
MDL transform tests and docs
wren-core/core/src/mdl/mod.rs
Removes a doc comment; adjusts tests to unquoted identifiers (including non-ASCII), updates expectations accordingly; adds tests for ambiguous table names (customer/Customer) including an error case.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test Client
  participant API as v3 Query API
  participant M as MDL Transform
  participant DF as DataFusion Session
  participant PG as Postgres

  Note over DF: Config: enable_ident_normalization = false

  T->>API: POST /query (SELECT 欄位1, 欄位2 FROM 中文表 LIMIT 1)
  API->>M: Transform SQL using manifest (Unicode identifiers)
  M-->>API: Transformed SQL (identifiers preserved)
  API->>DF: Execute SQL (session with config)
  DF->>PG: Run query
  PG-->>DF: Rows [[1,2], ...]
  DF-->>API: Result + schema (int32,int32)
  API-->>T: 200 OK, data, dtypes
  Note right of API: Caching enabled per request
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • douenergy

Poem

I nibbled on bytes by the database glade,
Hopped through tables where 中文表 was laid.
欄位1, 欄位2, what a tasty view—
No lowercasing winds, just Unicode true.
Cache a carrot, fetch a row—what delight!
Queries bloom bright in the moonlit byte. 🥕✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/disable-ident-normalize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added core ibis rust Pull requests that update Rust code python Pull requests that update Python code labels Sep 2, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-core/core/src/mdl/context.rs (1)

64-67: Disabling identifier normalization: good default; consider making it overridable.

Hardcoding false matches the PR goal. As a small enhancement, allow a session property (e.g., x-wren-enable-ident-normalization) to toggle this for engines that prefer case-insensitive behavior.

Apply minimal change:

-        .set(
-            "datafusion.sql_parser.enable_ident_normalization",
-            &ScalarValue::Utf8(Some("false".to_string())),
-        )
+        .set(
+            "datafusion.sql_parser.enable_ident_normalization",
+            &ScalarValue::Utf8(
+                properties
+                    .get("x-wren-enable-ident-normalization")
+                    .cloned() // Option<Option<String>>
+                    .unwrap_or(Some("false".to_string())),
+            ),
+        )
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1185-1225: Good Unicode/caching coverage; consider asserting cache headers.

Test validates Unicode identifiers and dtypes. Since cache is enabled, optionally assert X-Cache-Hit on a second identical request to exercise cache keys with non-ASCII SQL.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d560ac3 and 3940d94.

📒 Files selected for processing (5)
  • ibis-server/tests/routers/v3/connector/postgres/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_query.py (1 hunks)
  • wren-core/core/src/mdl/context.rs (1 hunks)
  • wren-core/core/src/mdl/function.rs (1 hunks)
  • wren-core/core/src/mdl/mod.rs (11 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-05T02:27:29.829Z
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.

Applied to files:

  • wren-core/core/src/mdl/mod.rs
🧬 Code graph analysis (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)
ibis-server/tests/conftest.py (1)
  • client (18-23)
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
  • connection_info (51-58)
wren-core/core/src/mdl/mod.rs (3)
wren-core/core/src/mdl/context.rs (1)
  • new (323-345)
wren-core-py/src/context.rs (2)
  • new (81-203)
  • default (62-70)
wren-core/core/src/logical_plan/analyze/model_anlayze.rs (2)
  • new (74-84)
  • analyze (47-66)
🪛 GitHub Actions: ibis CI
ibis-server/tests/routers/v3/connector/postgres/test_query.py

[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/test_query.py).

ibis-server/tests/routers/v3/connector/postgres/conftest.py

[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/conftest.py).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: cargo test (macos)
  • GitHub Check: cargo check
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: cargo test (win64)
  • GitHub Check: test
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (13)
wren-core/core/src/mdl/function.rs (1)

390-399: Rename in test is correct and consistent.

date_diff → date_test change is applied consistently in SQL and expected plan; no logic impact.

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)

43-45: Formatting applied — Ruff format has been run on ibis-server/tests/routers/v3/connector/postgres/conftest.py; CI should now pass.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1-1225: Ruff formatting applied: The file has been reformatted with ruff format and now passes the CI check.

wren-core/core/src/mdl/mod.rs (10)

729-742: Tests updated to unquoted identifiers: LGTM.

Switching to CTest.STest.Customer (unquoted) aligns with disabled normalization; snapshots remain stable.


779-783: UDF call test uses unquoted identifiers: LGTM.

Matches new parser behavior and preserves snapshot intent.


822-839: Unicode column expressions now unquoted: LGTM.

These exercise non-ASCII identifiers under disabled normalization; good coverage.


904-909: Keep using unquoted Unicode in expressions.

Consistent with the new default; no issues.


967-967: Redundant approval for similar change.

No new concerns beyond previous comments.


1002-1002: Redundant approval for similar change.

No new concerns beyond previous comments.


1052-1052: Redundant approval for similar change.

No new concerns beyond previous comments.


1336-1337: Cast expression unquoted: LGTM.

cast(出道時間 as timestamptz) verifies parser respects Unicode identifiers without quotes.


2597-2619: RLAC with Unicode property names: solid test.

Rule uses non-ASCII session property; this aligns with prior learnings on RLAC literal validation to prevent injection. Good signal.


3119-3180: Ambiguous table name test is excellent.

Covers case-sensitive resolution for customer vs Customer and the not-found error on CUSTOMER; assertions look correct with the new normalization setting.

@goldmedal goldmedal requested a review from douenergy September 2, 2025 09:39
@goldmedal goldmedal force-pushed the feat/disable-ident-normalize branch from 94be388 to 9d6ec4b Compare September 3, 2025 02:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1203-1210: Avoid name shadowing the manifest_str fixture

Use a distinct local name to prevent confusion with the module-scoped fixture.

-    manifest_str = base64.b64encode(orjson.dumps(manifest)).decode("utf-8")
+    unicode_manifest_b64 = base64.b64encode(orjson.dumps(manifest)).decode("utf-8")
@@
-            "manifestStr": manifest_str,
+            "manifestStr": unicode_manifest_b64,
wren-core/core/src/mdl/mod.rs (1)

3465-3526: Great addition: ambiguity behavior covered for model names differing only by case

Solid coverage for:

  • Resolving customer vs Customer
  • Erroring on CUSTOMER (no matching model)

Consider adding a join case to prove both models can be referenced simultaneously.

@@
     let sql = "select * from CUSTOMER";
     match transform_sql_with_ctx(
         &ctx,
         Arc::clone(&analyzed_mdl),
         &[],
         Arc::clone(&headers),
         sql,
     )
     .await
     {
         Ok(_) => {
             panic!("Expected error, but got SQL");
         }
         Err(e) => assert_snapshot!(
             e.to_string(),
             @"Error during planning: table 'wren.test.CUSTOMER' not found"
         ),
     }
 
+    // Additional: both models in one query to assert simultaneous resolution
+    let sql = r#"select customer.c_name, "Customer"."C_name" from customer join "Customer" on customer.c_name = "Customer".c_name"#;
+    let _both = transform_sql_with_ctx(
+        &ctx,
+        Arc::clone(&analyzed_mdl),
+        &[],
+        Arc::clone(&headers),
+        sql,
+    ).await?;
+
     Ok(())
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 94be388 and 9d6ec4b.

📒 Files selected for processing (4)
  • ibis-server/tests/routers/v3/connector/postgres/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_query.py (1 hunks)
  • wren-core/core/src/mdl/context.rs (1 hunks)
  • wren-core/core/src/mdl/mod.rs (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • wren-core/core/src/mdl/context.rs
  • ibis-server/tests/routers/v3/connector/postgres/conftest.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-05T02:27:29.829Z
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.

Applied to files:

  • wren-core/core/src/mdl/mod.rs
🧬 Code graph analysis (2)
wren-core/core/src/mdl/mod.rs (1)
wren-core/core/src/logical_plan/analyze/plan.rs (6)
  • new (72-85)
  • new (112-127)
  • new (778-780)
  • new (895-992)
  • new (1046-1093)
  • new (1152-1154)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (3)
ibis-server/tests/conftest.py (1)
  • client (18-23)
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
  • connection_info (53-60)
ibis-server/tests/routers/v3/connector/postgres/test_validate.py (1)
  • manifest_str (28-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: test
  • GitHub Check: cargo test (macos)
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: cargo test (win64)
  • GitHub Check: cargo check
  • GitHub Check: ci
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (10)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1186-1221: Unicode identifier query path works — LGTM

Good end-to-end coverage for non-ASCII model/column names with DataFusion fallback disabled and cache path exercised.

wren-core/core/src/mdl/mod.rs (9)

737-749: Unquoted qualified identifiers test — LGTM

This validates case-sensitive resolution for CTest.STest.Customer under disabled normalization.


787-791: Remote UDF call with case-sensitive column — LGTM

Exercising unquoted Custkey on Customer matches the new default.


831-846: Unquoted Unicode identifiers in expressions — LGTM

Switching to unquoted 名字/組別/訂閱數 aligns with disabled normalization and keeps snapshots stable.


911-917: Expression parsing for simple column references — LGTM

Unquoted 名字 in expressions is appropriate with the new parser setting.


975-988: Hidden-column scenario with unquoted Unicode — LGTM

Covers CLAC + hidden source column with unquoted derived field.


1061-1071: IN-subquery with unquoted Unicode — LGTM

Confirms predicates and subqueries behave with case-sensitive identifiers.


1361-1367: Cast to timestamptz using unquoted Unicode source — LGTM

Matches the new identifier semantics.


2624-2644: RLAC with Unicode session property and unquoted SQL — LGTM

Keeps property handling intact while relying on case-sensitive identifiers.


267-274: Align expression parsing with disabled identifier normalization

sql_to_expr uses DFParser with default options, which may still normalize ASCII identifiers. That can silently break inference for expressions like C_name || C_name in manifests. Ensure the same “enable_ident_normalization = false” applies here (e.g., via parser options/APIs available in your DataFusion version).

Would you like me to draft a patch that wires parser options here once we confirm the exact DataFusion API in this repo?

@douenergy
Copy link
Copy Markdown
Contributor

Thanks @goldmedal

@douenergy douenergy merged commit 131d82d into main Sep 3, 2025
18 checks passed
@goldmedal goldmedal deleted the feat/disable-ident-normalize branch September 15, 2025 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core ibis python Pull requests that update Python code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants