feat(core): disable `enable_ident_normalization` for case-sensitive identifier by default by goldmedal · Pull Request #1305 · Canner/wren-engine

goldmedal · 2025-09-02T08:19:48Z

Description

This PR changes the default behavior of Wren engine for case-sensitive identifier. I disable datafusion.sql_parser.enable_ident_normalization by default.

/// When set to true, SQL parser will normalize ident (convert ident to lowercase when not quoted)
pub enable_ident_normalization: bool, default = true

Now, all the identifier won't be normalized. It means all the identifier will be case-sensitive. It match the logic for finding the Wren model but the user won't need to add the quote anymore.

Summary by CodeRabbit

New Features
- Query tables and columns with non-ASCII (e.g., Chinese) identifiers without forced lowercasing.
Bug Fixes
- Improved handling of quoted/unquoted and mixed-case identifiers to prevent misinterpretation.
- Clearer error when a table name is ambiguous across models.
Tests
- Added end-to-end coverage for Unicode identifiers, including caching behavior.
- Expanded tests for ambiguous naming scenarios and case sensitivity.

coderabbitai · 2025-09-02T08:19:54Z

Walkthrough

Adds a Unicode-named table to Postgres test setup, introduces an async test querying that table via the v3 API, disables identifier normalization in DataFusion session config, and updates MDL transform tests to use unquoted/Unicode identifiers and cover ambiguous table names. No public APIs changed.

Changes

Cohort / File(s)	Summary
Postgres test setup `ibis-server/tests/routers/v3/connector/postgres/conftest.py`	Within the existing setup transaction, creates and populates a table named 中文表 with columns 欄位1, 欄位2 and sample rows.
Postgres API query test `ibis-server/tests/routers/v3/connector/postgres/test_query.py`	Adds async test_query_unicode_table that builds a Unicode manifest, queries SELECT 欄位1, 欄位2 FROM 中文表 LIMIT 1 via the v3 API with caching, asserts 200, data [1, 2], and int32 dtypes.
DataFusion session config `wren-core/core/src/mdl/context.rs`	Sets datafusion.sql_parser.enable_ident_normalization = "false" in create_ctx_with_mdl; no signature or control-flow changes.
MDL transform tests and docs `wren-core/core/src/mdl/mod.rs`	Removes a doc comment; adjusts tests to unquoted identifiers (including non-ASCII), updates expectations accordingly; adds tests for ambiguous table names (customer/Customer) including an error case.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test Client
  participant API as v3 Query API
  participant M as MDL Transform
  participant DF as DataFusion Session
  participant PG as Postgres

  Note over DF: Config: enable_ident_normalization = false

  T->>API: POST /query (SELECT 欄位1, 欄位2 FROM 中文表 LIMIT 1)
  API->>M: Transform SQL using manifest (Unicode identifiers)
  M-->>API: Transformed SQL (identifiers preserved)
  API->>DF: Execute SQL (session with config)
  DF->>PG: Run query
  PG-->>DF: Rows [[1,2], ...]
  DF-->>API: Result + schema (int32,int32)
  API-->>T: 200 OK, data, dtypes
  Note right of API: Caching enabled per request

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

fix(core): fix the uppercase table name of the model tableReference #1173 — Adjusts SQL identifier handling; aligns with disabling normalization and Unicode identifier support added here.
fix(core): add missing variant and fix quoted issue for Oracle #1178 — Introduces dialect-aware quoting; related to unquoted/Unicode identifier treatment and parser configuration.
fix(core): expand the wildcard before Wren rewrite rules #1145 — Modifies MDL SQL transformation behavior/tests; overlaps with updated tests and context configuration.

Suggested reviewers

douenergy

Poem

I nibbled on bytes by the database glade,
Hopped through tables where 中文表 was laid.
欄位1, 欄位2, what a tasty view—
No lowercasing winds, just Unicode true.
Cache a carrot, fetch a row—what delight!
Queries bloom bright in the moonlit byte. 🥕✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/disable-ident-normalize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

wren-core/core/src/mdl/context.rs (1)
64-67: Disabling identifier normalization: good default; consider making it overridable.

Hardcoding false matches the PR goal. As a small enhancement, allow a session property (e.g., x-wren-enable-ident-normalization) to toggle this for engines that prefer case-insensitive behavior.

Apply minimal change:
-        .set(
-            "datafusion.sql_parser.enable_ident_normalization",
-            &ScalarValue::Utf8(Some("false".to_string())),
-        )
+        .set(
+            "datafusion.sql_parser.enable_ident_normalization",
+            &ScalarValue::Utf8(
+                properties
+                    .get("x-wren-enable-ident-normalization")
+                    .cloned() // Option<Option<String>>
+                    .unwrap_or(Some("false".to_string())),
+            ),
+        )
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1185-1225: Good Unicode/caching coverage; consider asserting cache headers.

Test validates Unicode identifiers and dtypes. Since cache is enabled, optionally assert X-Cache-Hit on a second identical request to exercise cache keys with non-ASCII SQL.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d560ac3 and 3940d94.

📒 Files selected for processing (5)

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1 hunks)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1 hunks)
wren-core/core/src/mdl/context.rs (1 hunks)
wren-core/core/src/mdl/function.rs (1 hunks)
wren-core/core/src/mdl/mod.rs (11 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-05-05T02:27:29.829Z

Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.

Applied to files:

wren-core/core/src/mdl/mod.rs

🧬 Code graph analysis (2)

ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)

ibis-server/tests/conftest.py (1)

client (18-23)

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)

connection_info (51-58)

wren-core/core/src/mdl/mod.rs (3)

wren-core/core/src/mdl/context.rs (1)

new (323-345)

wren-core-py/src/context.rs (2)

new (81-203)

default (62-70)

wren-core/core/src/logical_plan/analyze/model_anlayze.rs (2)

new (74-84)

analyze (47-66)

🪛 GitHub Actions: ibis CI

ibis-server/tests/routers/v3/connector/postgres/test_query.py

[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/test_query.py).

ibis-server/tests/routers/v3/connector/postgres/conftest.py

[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/conftest.py).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: cargo test (macos)
GitHub Check: cargo check
GitHub Check: cargo test (macos-aarch64)
GitHub Check: cargo test (win64)
GitHub Check: test
GitHub Check: Analyze (java-kotlin)

🔇 Additional comments (13)

wren-core/core/src/mdl/function.rs (1)

390-399: Rename in test is correct and consistent.

date_diff → date_test change is applied consistently in SQL and expected plan; no logic impact.

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)

43-45: Formatting applied — Ruff format has been run on ibis-server/tests/routers/v3/connector/postgres/conftest.py; CI should now pass.

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1-1225: Ruff formatting applied: The file has been reformatted with ruff format and now passes the CI check.

wren-core/core/src/mdl/mod.rs (10)

729-742: Tests updated to unquoted identifiers: LGTM.

Switching to CTest.STest.Customer (unquoted) aligns with disabled normalization; snapshots remain stable.

779-783: UDF call test uses unquoted identifiers: LGTM.

Matches new parser behavior and preserves snapshot intent.

822-839: Unicode column expressions now unquoted: LGTM.

These exercise non-ASCII identifiers under disabled normalization; good coverage.

904-909: Keep using unquoted Unicode in expressions.

Consistent with the new default; no issues.

967-967: Redundant approval for similar change.

No new concerns beyond previous comments.

1002-1002: Redundant approval for similar change.

No new concerns beyond previous comments.

1052-1052: Redundant approval for similar change.

No new concerns beyond previous comments.

1336-1337: Cast expression unquoted: LGTM.

cast(出道時間 as timestamptz) verifies parser respects Unicode identifiers without quotes.

2597-2619: RLAC with Unicode property names: solid test.

Rule uses non-ASCII session property; this aligns with prior learnings on RLAC literal validation to prevent injection. Good signal.

3119-3180: Ambiguous table name test is excellent.

Covers case-sensitive resolution for customer vs Customer and the not-found error on CUSTOMER; assertions look correct with the new normalization setting.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1203-1210: Avoid name shadowing the manifest_str fixture

Use a distinct local name to prevent confusion with the module-scoped fixture.

-    manifest_str = base64.b64encode(orjson.dumps(manifest)).decode("utf-8")
+    unicode_manifest_b64 = base64.b64encode(orjson.dumps(manifest)).decode("utf-8")
@@
-            "manifestStr": manifest_str,
+            "manifestStr": unicode_manifest_b64,

wren-core/core/src/mdl/mod.rs (1)

3465-3526: Great addition: ambiguity behavior covered for model names differing only by case

Solid coverage for:

Resolving customer vs Customer
Erroring on CUSTOMER (no matching model)

Consider adding a join case to prove both models can be referenced simultaneously.

@@
     let sql = "select * from CUSTOMER";
     match transform_sql_with_ctx(
         &ctx,
         Arc::clone(&analyzed_mdl),
         &[],
         Arc::clone(&headers),
         sql,
     )
     .await
     {
         Ok(_) => {
             panic!("Expected error, but got SQL");
         }
         Err(e) => assert_snapshot!(
             e.to_string(),
             @"Error during planning: table 'wren.test.CUSTOMER' not found"
         ),
     }
 
+    // Additional: both models in one query to assert simultaneous resolution
+    let sql = r#"select customer.c_name, "Customer"."C_name" from customer join "Customer" on customer.c_name = "Customer".c_name"#;
+    let _both = transform_sql_with_ctx(
+        &ctx,
+        Arc::clone(&analyzed_mdl),
+        &[],
+        Arc::clone(&headers),
+        sql,
+    ).await?;
+
     Ok(())

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 94be388 and 9d6ec4b.

📒 Files selected for processing (4)

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1 hunks)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1 hunks)
wren-core/core/src/mdl/context.rs (1 hunks)
wren-core/core/src/mdl/mod.rs (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

wren-core/core/src/mdl/context.rs
ibis-server/tests/routers/v3/connector/postgres/conftest.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-05-05T02:27:29.829Z

Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.

Applied to files:

wren-core/core/src/mdl/mod.rs

🧬 Code graph analysis (2)

wren-core/core/src/mdl/mod.rs (1)

wren-core/core/src/logical_plan/analyze/plan.rs (6)

new (72-85)

new (112-127)

new (778-780)

new (895-992)

new (1046-1093)

new (1152-1154)

ibis-server/tests/routers/v3/connector/postgres/test_query.py (3)

ibis-server/tests/conftest.py (1)

client (18-23)

ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)

connection_info (53-60)

ibis-server/tests/routers/v3/connector/postgres/test_validate.py (1)

manifest_str (28-29)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: test
GitHub Check: cargo test (macos)
GitHub Check: cargo test (macos-aarch64)
GitHub Check: cargo test (win64)
GitHub Check: cargo check
GitHub Check: ci
GitHub Check: Analyze (java-kotlin)

🔇 Additional comments (10)

ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)

1186-1221: Unicode identifier query path works — LGTM

Good end-to-end coverage for non-ASCII model/column names with DataFusion fallback disabled and cache path exercised.

wren-core/core/src/mdl/mod.rs (9)

737-749: Unquoted qualified identifiers test — LGTM

This validates case-sensitive resolution for CTest.STest.Customer under disabled normalization.

787-791: Remote UDF call with case-sensitive column — LGTM

Exercising unquoted Custkey on Customer matches the new default.

831-846: Unquoted Unicode identifiers in expressions — LGTM

Switching to unquoted 名字/組別/訂閱數 aligns with disabled normalization and keeps snapshots stable.

911-917: Expression parsing for simple column references — LGTM

Unquoted 名字 in expressions is appropriate with the new parser setting.

975-988: Hidden-column scenario with unquoted Unicode — LGTM

Covers CLAC + hidden source column with unquoted derived field.

1061-1071: IN-subquery with unquoted Unicode — LGTM

Confirms predicates and subqueries behave with case-sensitive identifiers.

1361-1367: Cast to timestamptz using unquoted Unicode source — LGTM

Matches the new identifier semantics.

2624-2644: RLAC with Unicode session property and unquoted SQL — LGTM

Keeps property handling intact while relying on case-sensitive identifiers.

267-274: Align expression parsing with disabled identifier normalization

sql_to_expr uses DFParser with default options, which may still normalize ASCII identifiers. That can silently break inference for expressions like C_name || C_name in manifests. Ensure the same “enable_ident_normalization = false” applies here (e.g., via parser options/APIs available in your DataFusion version).

Would you like me to draft a patch that wires parser options here once we confirm the exact DataFusion API in this repo?

douenergy · 2025-09-03T03:00:00Z

Thanks @goldmedal

github-actions bot added core ibis rust Pull requests that update Rust code python Pull requests that update Python code labels Sep 2, 2025

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

goldmedal requested a review from douenergy September 2, 2025 09:39

goldmedal added 3 commits September 3, 2025 10:33

disable ident normalization

3296098

modified and enhance tests

92d10b7

fix fmt

9d6ec4b

goldmedal force-pushed the feat/disable-ident-normalize branch from 94be388 to 9d6ec4b Compare September 3, 2025 02:35

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

douenergy approved these changes Sep 3, 2025

View reviewed changes

douenergy merged commit 131d82d into main Sep 3, 2025
18 checks passed

goldmedal deleted the feat/disable-ident-normalize branch September 15, 2025 09:20

This was referenced Sep 15, 2025

fix(core-py): extract the used tables using the case-sensitive table name #1320

Merged

Table name case sensitivity causes "table not found" errors in MySQL connector #1319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): disable `enable_ident_normalization` for case-sensitive identifier by default#1305

feat(core): disable `enable_ident_normalization` for case-sensitive identifier by default#1305
douenergy merged 3 commits intomainfrom
feat/disable-ident-normalize

goldmedal commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

douenergy commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goldmedal commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

douenergy commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goldmedal commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)