feat(core): disable enable_ident_normalization for case-sensitive identifier by default#1305
feat(core): disable enable_ident_normalization for case-sensitive identifier by default#1305
enable_ident_normalization for case-sensitive identifier by default#1305Conversation
WalkthroughAdds a Unicode-named table to Postgres test setup, introduces an async test querying that table via the v3 API, disables identifier normalization in DataFusion session config, and updates MDL transform tests to use unquoted/Unicode identifiers and cover ambiguous table names. No public APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant T as Test Client
participant API as v3 Query API
participant M as MDL Transform
participant DF as DataFusion Session
participant PG as Postgres
Note over DF: Config: enable_ident_normalization = false
T->>API: POST /query (SELECT 欄位1, 欄位2 FROM 中文表 LIMIT 1)
API->>M: Transform SQL using manifest (Unicode identifiers)
M-->>API: Transformed SQL (identifiers preserved)
API->>DF: Execute SQL (session with config)
DF->>PG: Run query
PG-->>DF: Rows [[1,2], ...]
DF-->>API: Result + schema (int32,int32)
API-->>T: 200 OK, data, dtypes
Note right of API: Caching enabled per request
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
wren-core/core/src/mdl/context.rs (1)
64-67: Disabling identifier normalization: good default; consider making it overridable.Hardcoding false matches the PR goal. As a small enhancement, allow a session property (e.g., x-wren-enable-ident-normalization) to toggle this for engines that prefer case-insensitive behavior.
Apply minimal change:
- .set( - "datafusion.sql_parser.enable_ident_normalization", - &ScalarValue::Utf8(Some("false".to_string())), - ) + .set( + "datafusion.sql_parser.enable_ident_normalization", + &ScalarValue::Utf8( + properties + .get("x-wren-enable-ident-normalization") + .cloned() // Option<Option<String>> + .unwrap_or(Some("false".to_string())), + ), + )ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
1185-1225: Good Unicode/caching coverage; consider asserting cache headers.Test validates Unicode identifiers and dtypes. Since cache is enabled, optionally assert X-Cache-Hit on a second identical request to exercise cache keys with non-ASCII SQL.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
ibis-server/tests/routers/v3/connector/postgres/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(1 hunks)wren-core/core/src/mdl/context.rs(1 hunks)wren-core/core/src/mdl/function.rs(1 hunks)wren-core/core/src/mdl/mod.rs(11 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-05T02:27:29.829Z
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.
Applied to files:
wren-core/core/src/mdl/mod.rs
🧬 Code graph analysis (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (2)
ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
connection_info(51-58)
wren-core/core/src/mdl/mod.rs (3)
wren-core/core/src/mdl/context.rs (1)
new(323-345)wren-core-py/src/context.rs (2)
new(81-203)default(62-70)wren-core/core/src/logical_plan/analyze/model_anlayze.rs (2)
new(74-84)analyze(47-66)
🪛 GitHub Actions: ibis CI
ibis-server/tests/routers/v3/connector/postgres/test_query.py
[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/test_query.py).
ibis-server/tests/routers/v3/connector/postgres/conftest.py
[error] 1-1: Ruff format check failed. This file would be reformatted by 'ruff format --check' (Would reformat: ibis-server/tests/routers/v3/connector/postgres/conftest.py).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: cargo test (macos)
- GitHub Check: cargo check
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (win64)
- GitHub Check: test
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (13)
wren-core/core/src/mdl/function.rs (1)
390-399: Rename in test is correct and consistent.date_diff → date_test change is applied consistently in SQL and expected plan; no logic impact.
ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
43-45: Formatting applied — Ruff format has been run onibis-server/tests/routers/v3/connector/postgres/conftest.py; CI should now pass.ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
1-1225: Ruff formatting applied: The file has been reformatted withruff formatand now passes the CI check.wren-core/core/src/mdl/mod.rs (10)
729-742: Tests updated to unquoted identifiers: LGTM.Switching to CTest.STest.Customer (unquoted) aligns with disabled normalization; snapshots remain stable.
779-783: UDF call test uses unquoted identifiers: LGTM.Matches new parser behavior and preserves snapshot intent.
822-839: Unicode column expressions now unquoted: LGTM.These exercise non-ASCII identifiers under disabled normalization; good coverage.
904-909: Keep using unquoted Unicode in expressions.Consistent with the new default; no issues.
967-967: Redundant approval for similar change.No new concerns beyond previous comments.
1002-1002: Redundant approval for similar change.No new concerns beyond previous comments.
1052-1052: Redundant approval for similar change.No new concerns beyond previous comments.
1336-1337: Cast expression unquoted: LGTM.cast(出道時間 as timestamptz) verifies parser respects Unicode identifiers without quotes.
2597-2619: RLAC with Unicode property names: solid test.Rule uses non-ASCII session property; this aligns with prior learnings on RLAC literal validation to prevent injection. Good signal.
3119-3180: Ambiguous table name test is excellent.Covers case-sensitive resolution for customer vs Customer and the not-found error on CUSTOMER; assertions look correct with the new normalization setting.
94be388 to
9d6ec4b
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
1203-1210: Avoid name shadowing the manifest_str fixtureUse a distinct local name to prevent confusion with the module-scoped fixture.
- manifest_str = base64.b64encode(orjson.dumps(manifest)).decode("utf-8") + unicode_manifest_b64 = base64.b64encode(orjson.dumps(manifest)).decode("utf-8") @@ - "manifestStr": manifest_str, + "manifestStr": unicode_manifest_b64,wren-core/core/src/mdl/mod.rs (1)
3465-3526: Great addition: ambiguity behavior covered for model names differing only by caseSolid coverage for:
- Resolving customer vs Customer
- Erroring on CUSTOMER (no matching model)
Consider adding a join case to prove both models can be referenced simultaneously.
@@ let sql = "select * from CUSTOMER"; match transform_sql_with_ctx( &ctx, Arc::clone(&analyzed_mdl), &[], Arc::clone(&headers), sql, ) .await { Ok(_) => { panic!("Expected error, but got SQL"); } Err(e) => assert_snapshot!( e.to_string(), @"Error during planning: table 'wren.test.CUSTOMER' not found" ), } + // Additional: both models in one query to assert simultaneous resolution + let sql = r#"select customer.c_name, "Customer"."C_name" from customer join "Customer" on customer.c_name = "Customer".c_name"#; + let _both = transform_sql_with_ctx( + &ctx, + Arc::clone(&analyzed_mdl), + &[], + Arc::clone(&headers), + sql, + ).await?; + Ok(())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
ibis-server/tests/routers/v3/connector/postgres/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py(1 hunks)wren-core/core/src/mdl/context.rs(1 hunks)wren-core/core/src/mdl/mod.rs(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- wren-core/core/src/mdl/context.rs
- ibis-server/tests/routers/v3/connector/postgres/conftest.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-05T02:27:29.829Z
Learnt from: goldmedal
PR: Canner/wren-engine#1161
File: ibis-server/app/routers/v3/connector.py:78-83
Timestamp: 2025-05-05T02:27:29.829Z
Learning: The row-level access control implementation in Wren Engine filters headers with the prefix `X_WREN_VARIABLE_PREFIX` in `EmbeddedEngineRewriter.get_session_properties` and validates session property expressions in `access_control.rs` to ensure they only contain literal values, preventing SQL injection.
Applied to files:
wren-core/core/src/mdl/mod.rs
🧬 Code graph analysis (2)
wren-core/core/src/mdl/mod.rs (1)
wren-core/core/src/logical_plan/analyze/plan.rs (6)
new(72-85)new(112-127)new(778-780)new(895-992)new(1046-1093)new(1152-1154)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (3)
ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/postgres/conftest.py (1)
connection_info(53-60)ibis-server/tests/routers/v3/connector/postgres/test_validate.py (1)
manifest_str(28-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: test
- GitHub Check: cargo test (macos)
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (win64)
- GitHub Check: cargo check
- GitHub Check: ci
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (10)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
1186-1221: Unicode identifier query path works — LGTMGood end-to-end coverage for non-ASCII model/column names with DataFusion fallback disabled and cache path exercised.
wren-core/core/src/mdl/mod.rs (9)
737-749: Unquoted qualified identifiers test — LGTMThis validates case-sensitive resolution for CTest.STest.Customer under disabled normalization.
787-791: Remote UDF call with case-sensitive column — LGTMExercising unquoted Custkey on Customer matches the new default.
831-846: Unquoted Unicode identifiers in expressions — LGTMSwitching to unquoted 名字/組別/訂閱數 aligns with disabled normalization and keeps snapshots stable.
911-917: Expression parsing for simple column references — LGTMUnquoted 名字 in expressions is appropriate with the new parser setting.
975-988: Hidden-column scenario with unquoted Unicode — LGTMCovers CLAC + hidden source column with unquoted derived field.
1061-1071: IN-subquery with unquoted Unicode — LGTMConfirms predicates and subqueries behave with case-sensitive identifiers.
1361-1367: Cast to timestamptz using unquoted Unicode source — LGTMMatches the new identifier semantics.
2624-2644: RLAC with Unicode session property and unquoted SQL — LGTMKeeps property handling intact while relying on case-sensitive identifiers.
267-274: Align expression parsing with disabled identifier normalizationsql_to_expr uses DFParser with default options, which may still normalize ASCII identifiers. That can silently break inference for expressions like C_name || C_name in manifests. Ensure the same “enable_ident_normalization = false” applies here (e.g., via parser options/APIs available in your DataFusion version).
Would you like me to draft a patch that wires parser options here once we confirm the exact DataFusion API in this repo?
|
Thanks @goldmedal |
Description
This PR changes the default behavior of Wren engine for case-sensitive identifier. I disable
datafusion.sql_parser.enable_ident_normalizationby default.Now, all the identifier won't be normalized. It means all the identifier will be case-sensitive. It match the logic for finding the Wren model but the user won't need to add the quote anymore.
Summary by CodeRabbit