Skip to content

feat(core): introduce dialect-specific function list and refactor BigQuery function lists#1366

Merged
douenergy merged 12 commits intoCanner:mainfrom
goldmedal:chore/enhance-bigquery-func
Nov 18, 2025
Merged

feat(core): introduce dialect-specific function list and refactor BigQuery function lists#1366
douenergy merged 12 commits intoCanner:mainfrom
goldmedal:chore/enhance-bigquery-func

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Nov 11, 2025

Description

To provide accurate functions, this PR redefines the function list by the signature of DataFusion. Most functions could have multiple signatures; We defined the function lists by a simple record of csv. It isn't enough to define multiple signatures.

The InnerDialect has 3 methods for the supported functions, which will be registered when the session context is created

    fn supported_udfs(&self) -> Vec<Arc<datafusion::logical_expr::ScalarUDF>> {
        scalar_functions()
    }

    fn supported_udafs(&self) -> Vec<Arc<datafusion::logical_expr::AggregateUDF>> {
        aggregate_functions()
    }

    fn supported_udwfs(&self) -> Vec<Arc<datafusion::logical_expr::WindowUDF>> {
        window_functions()
    }

We can define the supported function for each data source. If the list isn't defined, we register the default function list.

BigQuery Function list

  • Remove the CSV-based function list.
  • Try to register all the common functions of BigQuery in wren-core/core/src/mdl/function/dialect/bigquery.
    • BigQuery Function
    • Some special functions are excluded: AEAD, geo functions, search functions, ...
  • Backward support the white list function in ibis-server/resources/white_function_list/bigquery.csv

Summary by CodeRabbit

  • New Features

    • BigQuery dialect added with a large set of native scalar, aggregate and window functions for richer SQL.
  • Bug Fixes

    • HTTP 422 mapping updated for more consistent error responses.
  • Improvements

    • Session/context and rewrite flows now propagate data-source info; remote-function registration is gated by data source (BigQuery skips remote function registration).

@github-actions github-actions bot added core bigquery ibis rust Pull requests that update Rust code python Pull requests that update Python code labels Nov 11, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 11, 2025

Warning

Rate limit exceeded

@goldmedal has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 48 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0030b62 and cbe2cd1.

📒 Files selected for processing (1)
  • wren-core/core/src/mdl/dialect/inner_dialect.rs (5 hunks)

Walkthrough

Threads an optional data_source through ibis-server, wren-core-py, and wren-core; gates remote-function registration for BigQuery; adds DataSource parsing/error; exposes a BigQuery dialect with many scalar/aggregate/window UDFs; updates UDF macros, ReturnType handling, and call-sites to accept data_source.

Changes

Cohort / File(s) Summary
ibis-server: data_source threading & remote-function gating
ibis-server/app/config.py, ibis-server/app/mdl/core.py, ibis-server/app/mdl/rewriter.py, ibis-server/tools/query_local_run.py, ibis-server/wren/session/__init__.py
Thread data_source into session/context and rewriter APIs; special-case bigquery in remote function list path (return None to skip); update calls to pass data_source into EmbeddedEngineRewriter.rewrite_sync and SessionContext.
ibis-server: HTTP status constant
ibis-server/app/model/error.py
Replace HTTP_422_UNPROCESSABLE_ENTITY with HTTP_422_UNPROCESSABLE_CONTENT in import and fallback mapping.
wren-core-base: DataSource parsing & error
wren-core-base/src/mdl/manifest.rs
Add ParsedDataSourceError and FromStr impl for DataSource to parse strings → DataSource with explicit error.
wren-core-py: PySessionContext & registration gating
wren-core-py/src/context.rs, wren-core-py/src/errors.rs
PySessionContext::new gains data_source param; resolve data_source from manifest or arg; register remote functions only when appropriate (skip for BigQuery); add From → CoreError mapping.
wren-core (core): create_wren_ctx & dialect-aware UDF registration
wren-core/core/src/mdl/mod.rs, wren-core/core/src/mdl/dialect/mod.rs, wren-core/core/src/mdl/dialect/inner_dialect.rs
create_wren_ctx now accepts optional data_source; selects inner dialect and registers dialect-specific scalar/aggregate/window UDFs; make inner_dialect public; add supported_udfs/udafs/udwfs trait methods; refactor BigQuery date-diff handling.
wren-core: function dialect & utilities
wren-core/core/src/mdl/function/mod.rs, wren-core/core/src/mdl/function/dialect/mod.rs, wren-core/core/src/mdl/function/dialect/utils.rs
Add public dialect module and bigquery submodule; add build_document helper for UDF docs; expose bigquery registry entry points.
wren-core: BigQuery scalar functions (large addition)
wren-core/core/src/mdl/function/dialect/bigquery/scalar.rs
Add a comprehensive collection of BigQuery-like scalar UDF definitions (arrays, date/time, JSON, string, numeric, encodings, regex, etc.) with signatures and docs.
wren-core: BigQuery aggregate & window functions
wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs, wren-core/core/src/mdl/function/dialect/bigquery/window.rs
Add BigQuery aggregate UDAFs and window UDWFs (e.g., any_value, approx_, group_concat, percentile_), with signatures, aliases, and documentation.
wren-core: function registration, ReturnType & macros
wren-core/core/src/mdl/function/remote_function.rs, wren-core/core/src/mdl/function/macros.rs, wren-core/core/src/mdl/function/scalar/mod.rs
Add ArrayOfInputFirstArgument ReturnType; update ByPass* constructors to accept ReturnType/Signature/Documentation and aggregate aliases; rename make_udf_functionmake_datafusion_udf_function and add expression-based make_udf_function/make_udaf_function/make_udwf_function using LazyLock singletons.
Examples & tests: call-site updates
wren-core/sqllogictest/src/test_context.rs, wren-core/wren-example/examples/*.rs, wren-core-py/tests/test_modeling_core.py
Update call sites to pass new data_source argument (often None); adjust a test manifest dataSource from "bigquery""datafusion"; minor formatting adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant ibis as ibis-server
    participant wren_py as wren-core-py
    participant wren_rs as wren-core (Rust)
    participant df as DataFusion

    ibis->>wren_py: PySessionContext::new(mdl_base64, remote_path, props, data_source)
    wren_py->>wren_py: parse manifest (if present)
    wren_py->>wren_py: resolve data_source (manifest preferred)
    alt data_source == "bigquery"
        wren_py->>wren_py: skip remote function registration
    else
        wren_py->>wren_py: read & register remote functions (CSV)
    end
    wren_py->>wren_rs: create_wren_ctx(Some(config), data_source)
    wren_rs->>wren_rs: get_inner_dialect(data_source)
    alt dialect present
        wren_rs->>df: register dialect-supported scalar/aggregate/window UDFs
    else
        wren_rs->>df: register default function sets
    end
    wren_rs-->>wren_py: return SessionContext
    wren_py-->>ibis: context initialized
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas to focus during review:

  • BigQuery scalar UDFs file (.../bigquery/scalar.rs) — large volume, many signatures and docs.
  • Macros and function-registration changes (macros.rs, remote_function.rs) — macro semantics, LazyLock singletons, constructor signature migrations.
  • DataSource parsing and error mapping (manifest.rs, wren-core-py/src/errors.rs) — FromStr and error conversions.
  • PySessionContext changes (wren-core-py/src/context.rs) — data_source resolution precedence and gating of remote-function registration.
  • create_wren_ctx and cross-language call-sites — ensure consistent data_source propagation and dialect registration.

Possibly related PRs

Suggested reviewers

  • douenergy
  • wwwy3y3

Poem

🐰 I hopped through manifests and threaded tiny strings,

nudged data_source down tunnels to brighter things.
BigQuery hid its list, while UDF meadows grew,
crates, macros, and docs — a rabbit's tidy view.
Cheers for the code — carrots for every new queue!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.52% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: introducing dialect-specific function lists and refactoring BigQuery functions, which aligns with the extensive changes across dialect registration, UDF macros, and BigQuery-specific function definitions.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@goldmedal
Copy link
Copy Markdown
Contributor Author

BigQuery has been tested locally

poetry run pytest -m 'bigquery'
========================================================================================================= test session starts ==========================================================================================================
platform darwin -- Python 3.11.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 396 items / 349 deselected / 47 selected                                                                                                                                                                                     

tests/routers/v2/connector/test_bigquery.py ..................                                                                                                                                                                   [ 38%]
tests/routers/v3/connector/bigquery/test_functions.py ............                                                                                                                                                               [ 63%]
tests/routers/v3/connector/bigquery/test_query.py .................  

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ibis-server/app/mdl/rewriter.py (1)

159-169: Ensure consistency: rewrite_sync should also pass data_source.

The async rewrite method passes data_source to get_session_context (line 149), but rewrite_sync does not (line 164-166). This inconsistency could lead to different behavior between sync and async execution paths.

Apply this diff to align both methods:

     def rewrite_sync(
         self, manifest_str: str, sql: str, properties: dict | None = None
     ) -> str:
         try:
             processed_properties = self.get_session_properties(properties)
             session_context = get_session_context(
-                manifest_str, self.function_path, processed_properties
+                manifest_str,
+                self.function_path,
+                processed_properties,
+                self.data_source.name if self.data_source else None,
             )
             return session_context.transform_sql(sql)
         except Exception as e:
             raise WrenError(ErrorCode.INVALID_SQL, str(e), ErrorPhase.SQL_PLANNING)
wren-core-py/src/context.rs (1)

90-117: Honor the data_source argument when no manifest is supplied.

When mdl_base64 is None you currently discard the caller‑provided data_source. That leaves us unable to bootstrap a dialect-aware context (e.g. BigQuery) without first constructing a manifest, and it also means the later BigQuery remote-function skip never triggers. Please thread the parsed data_source through this branch as well (convert the &str to DataSource and pass it into create_wren_ctx).

🧹 Nitpick comments (2)
wren-core/core/src/mdl/function/dialect/utils.rs (1)

3-5: Consider unifying with remote_function.rs's build_document.

A similar build_document function exists in wren-core/core/src/mdl/function/remote_function.rs (lines 227-245) with richer parameter handling. If the two functions serve overlapping purposes, consider extracting a common builder utility to reduce duplication.

wren-core/core/src/mdl/dialect/mod.rs (1)

20-20: Consider re-exports instead of exposing inner_dialect directly.

Making inner_dialect public exposes its internal implementation. If only specific items from inner_dialect need to be public, consider using pub use re-exports to maintain a cleaner API surface and preserve encapsulation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 774e6fd and 83221b3.

⛔ Files ignored due to path filters (1)
  • ibis-server/resources/function_list/bigquery.csv is excluded by !**/*.csv
📒 Files selected for processing (24)
  • ibis-server/app/config.py (1 hunks)
  • ibis-server/app/mdl/core.py (1 hunks)
  • ibis-server/app/mdl/rewriter.py (3 hunks)
  • ibis-server/app/model/error.py (2 hunks)
  • ibis-server/tools/query_local_run.py (1 hunks)
  • wren-core-base/src/mdl/manifest.rs (6 hunks)
  • wren-core-py/src/context.rs (3 hunks)
  • wren-core-py/src/errors.rs (2 hunks)
  • wren-core/core/src/mdl/dialect/inner_dialect.rs (5 hunks)
  • wren-core/core/src/mdl/dialect/mod.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/bigquery/mod.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/bigquery/scalar.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/bigquery/window.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/mod.rs (1 hunks)
  • wren-core/core/src/mdl/function/dialect/utils.rs (1 hunks)
  • wren-core/core/src/mdl/function/macros.rs (2 hunks)
  • wren-core/core/src/mdl/function/mod.rs (1 hunks)
  • wren-core/core/src/mdl/function/remote_function.rs (16 hunks)
  • wren-core/core/src/mdl/function/scalar/mod.rs (1 hunks)
  • wren-core/core/src/mdl/mod.rs (79 hunks)
  • wren-core/sqllogictest/src/test_context.rs (1 hunks)
  • wren-core/wren-example/examples/plan-sql.rs (2 hunks)
  • wren-core/wren-example/examples/to-many-calculation.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
wren-core/core/src/mdl/function/dialect/utils.rs (1)
wren-core/core/src/mdl/function/remote_function.rs (1)
  • build_document (228-246)
wren-core-py/src/errors.rs (1)
wren-core-base/src/mdl/manifest.rs (1)
  • new (110-114)
wren-core/core/src/mdl/function/dialect/bigquery/window.rs (2)
wren-core/core/src/mdl/function/dialect/utils.rs (1)
  • build_document (3-5)
wren-core/core/src/mdl/function/remote_function.rs (4)
  • build_document (228-246)
  • new (187-199)
  • new (286-299)
  • new (386-398)
wren-core/wren-example/examples/to-many-calculation.rs (1)
wren-core/core/src/mdl/mod.rs (1)
  • create_wren_ctx (368-396)
ibis-server/tools/query_local_run.py (1)
wren-core/core/src/mdl/mod.rs (1)
  • data_source (327-329)
wren-core/sqllogictest/src/test_context.rs (1)
wren-core/core/src/mdl/mod.rs (1)
  • create_wren_ctx (368-396)
wren-core-py/src/context.rs (4)
wren-core/core/src/mdl/mod.rs (6)
  • mdl (198-233)
  • data_source (327-329)
  • new (129-184)
  • new (567-569)
  • default (65-73)
  • create_wren_ctx (368-396)
wren-core-base/manifest-macro/src/lib.rs (2)
  • data_source (61-116)
  • manifest (26-56)
wren-core-base/src/mdl/manifest.rs (2)
  • new (110-114)
  • from_str (158-183)
wren-core-py/src/manifest.rs (1)
  • to_manifest (18-23)
ibis-server/app/mdl/core.py (3)
wren-core/core/src/mdl/mod.rs (1)
  • data_source (327-329)
wren-core-base/manifest-macro/src/lib.rs (1)
  • data_source (61-116)
wren-core-base/src/mdl/builder.rs (1)
  • data_source (90-93)
wren-core/wren-example/examples/plan-sql.rs (2)
wren-core/core/src/mdl/mod.rs (3)
  • mdl (198-233)
  • create_wren_ctx (368-396)
  • transform_sql_with_ctx (416-476)
wren-core-base/manifest-macro/src/lib.rs (1)
  • manifest (26-56)
wren-core/core/src/mdl/function/remote_function.rs (1)
wren-core-py/src/context.rs (1)
  • new (82-226)
wren-core-base/src/mdl/manifest.rs (2)
wren-core-py/src/context.rs (1)
  • new (82-226)
wren-core-base/src/mdl/builder.rs (7)
  • new (46-58)
  • new (105-119)
  • new (190-205)
  • new (289-298)
  • new (325-337)
  • new (374-382)
  • new (404-411)
wren-core/core/src/mdl/dialect/inner_dialect.rs (5)
wren-core/core/src/mdl/dialect/utils.rs (2)
  • scalar_function_to_sql_internal (51-76)
  • args (30-48)
wren-core/core/src/mdl/function/dialect/bigquery/mod.rs (3)
  • bigquery_aggregate_functions (300-338)
  • bigquery_scalar_functions (73-298)
  • bigquery_window_functions (340-356)
wren-core/core/src/mdl/function/aggregate/mod.rs (1)
  • aggregate_functions (11-51)
wren-core/core/src/mdl/function/scalar/mod.rs (1)
  • scalar_functions (18-192)
wren-core/core/src/mdl/function/window/mod.rs (1)
  • window_functions (5-19)
ibis-server/app/mdl/rewriter.py (1)
ibis-server/app/model/data_source.py (1)
  • DataSource (60-214)
wren-core/core/src/mdl/mod.rs (5)
wren-core/core/src/mdl/dialect/inner_dialect.rs (1)
  • get_inner_dialect (112-121)
wren-core-base/manifest-macro/src/lib.rs (2)
  • data_source (61-116)
  • manifest (26-56)
wren-core-py/src/context.rs (1)
  • new (82-226)
wren-core/core/src/mdl/function/remote_function.rs (3)
  • new_with_return_type (201-211)
  • new_with_return_type (317-328)
  • new_with_return_type (400-410)
wren-core/core/src/logical_plan/utils.rs (1)
  • try_map_data_type (102-107)
wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs (2)
wren-core/core/src/mdl/function/remote_function.rs (8)
  • build_document (228-246)
  • new (187-199)
  • new (286-299)
  • new (386-398)
  • from (215-225)
  • from (332-342)
  • from (414-423)
  • new_with_alias (301-315)
wren-core/core/src/mdl/function/dialect/utils.rs (1)
  • build_document (3-5)
wren-core/core/src/mdl/function/dialect/bigquery/scalar.rs (2)
wren-core/core/src/mdl/function/remote_function.rs (4)
  • build_document (228-246)
  • new (187-199)
  • new (286-299)
  • new (386-398)
wren-core/core/src/mdl/function/dialect/utils.rs (1)
  • build_document (3-5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test
  • GitHub Check: cargo check
  • GitHub Check: ci
🔇 Additional comments (19)
ibis-server/app/config.py (1)

64-67: LGTM! BigQuery function list handled in Wren Core.

The early return for BigQuery is correct and aligns with the PR's introduction of dialect-specific function lists in Wren Core.

ibis-server/app/mdl/rewriter.py (1)

45-47: LGTM! Data source awareness added to embedded rewriter.

The change correctly threads data_source through to EmbeddedEngineRewriter for dialect-aware context creation.

ibis-server/app/model/error.py (1)

8-8: LGTM! Updated to correct Starlette constant.

The change from HTTP_422_UNPROCESSABLE_ENTITY to HTTP_422_UNPROCESSABLE_CONTENT aligns with modern Starlette conventions.

wren-core-py/src/errors.rs (1)

92-96: LGTM! Error conversion follows established patterns.

The From<ParsedDataSourceError> implementation is consistent with other error conversions in this file and properly integrates the new data source error type.

wren-core/core/src/mdl/function/dialect/mod.rs (1)

1-2: LGTM! Clean module organization.

The public bigquery module and private utils module are appropriately scoped for the dialect-specific function list feature.

wren-core/core/src/mdl/function/scalar/mod.rs (1)

12-12: LGTM! Macro rename improves clarity.

The rename from make_udf_function to make_datafusion_udf_function makes the DataFusion dependency explicit and improves code readability.

wren-core/wren-example/examples/to-many-calculation.rs (1)

20-20: LGTM: API update aligns with new data_source parameter.

The addition of the second None parameter correctly adapts to the updated create_wren_ctx signature, which now accepts an optional data_source.

wren-core/core/src/mdl/function/mod.rs (1)

2-2: LGTM: Public dialect module exposure.

The addition of pub mod dialect; appropriately exposes dialect-specific function registries for external access, supporting the BigQuery dialect implementation.

ibis-server/tools/query_local_run.py (1)

80-80: LGTM: data_source parameter threaded to SessionContext.

The addition of the data_source argument properly threads the data source context from the environment variable to the session initialization, enabling dialect-aware behavior.

wren-core/sqllogictest/src/test_context.rs (1)

66-66: LGTM: Test context updated for new API.

The addition of None as the second parameter correctly adapts the test harness to the updated create_wren_ctx signature, maintaining default (non-dialect-specific) behavior for tests.

ibis-server/app/mdl/core.py (1)

7-15: LGTM: Public API extended with data_source parameter.

The addition of the optional data_source parameter to get_session_context properly extends the public API while maintaining backward compatibility through the default None value. The parameter is correctly threaded to the underlying wren_core.SessionContext.

wren-core/wren-example/examples/plan-sql.rs (2)

7-8: LGTM: Updated imports for dialect support.

The addition of DataSource and create_wren_ctx imports correctly supports the dialect-aware context creation demonstrated in this example.


22-22: LGTM: Demonstrates BigQuery dialect usage.

The explicit use of Some(&DataSource::BigQuery) in the context creation effectively demonstrates the new dialect-aware functionality, ensuring BigQuery-specific UDFs are registered.

wren-core/core/src/mdl/function/macros.rs (4)

6-21: LGTM: Renamed macro for clarity.

Renaming the original macro to make_datafusion_udf_function clearly distinguishes it from the new expression-based make_udf_function macro, preventing confusion between the type-based (<$UDF>::new()) and expression-based approaches.


23-39: LGTM: Expression-based UDF macro added.

The new make_udf_function macro appropriately accepts a pre-constructed UDF expression ($UDF:expr) and wraps it in a singleton pattern. This enables more flexible UDF registration, particularly for dialect-specific functions.


41-57: LGTM: Expression-based UDAF macro added.

The make_udaf_function macro follows the same pattern as make_udf_function, providing consistent expression-based aggregate function registration.


59-75: LGTM: Expression-based UDWF macro added.

The make_udwf_function macro completes the set of expression-based macros, providing consistent window function registration following the same singleton pattern.

wren-core/core/src/mdl/function/dialect/bigquery/window.rs (2)

8-19: LGTM: percentile_cont window function definition.

The percentile_cont function is correctly defined with:

  • ReturnType::SameAsInput (appropriate for continuous percentiles)
  • Signature::any(2, Volatility::Immutable) (correct for 2-argument percentile function)
  • Clear documentation with SQL example

21-32: LGTM: percentile_disc window function definition.

The percentile_disc function is correctly defined with:

  • ReturnType::SameAsInput (appropriate for discrete percentiles)
  • Signature::any(2, Volatility::Immutable) (correct for 2-argument percentile function)
  • Clear documentation with SQL example

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
wren-core-py/src/context.rs (1)

116-132: Catch disagreements between manifest and explicit data_source.

Right now we silently prefer the manifest value when both sources are present. If callers accidentally pass a conflicting data_source, they won’t discover the mismatch until much later. Please fail fast in that case so we surface the configuration error immediately.

You can parse the optional argument up front, compare, and only proceed when they agree:

-        // If the manifest has a data source, use it.
-        // Otherwise, if the data_source parameter is provided, use it.
-        // Otherwise, use None.
-        let data_source = if let Some(ds) = &manifest.data_source {
-            Some(*ds)
-        } else if let Some(ds_str) = data_source {
-            Some(DataSource::from_str(ds_str).map_err(CoreError::from)?)
-        } else {
-            None
-        };
+        let param_data_source = data_source
+            .map(|ds| DataSource::from_str(ds).map_err(CoreError::from))
+            .transpose()?;
+
+        if let (Some(manifest_ds), Some(param_ds)) = (manifest.data_source, param_data_source) {
+            if manifest_ds != param_ds {
+                return Err(CoreError::new(
+                    format!(
+                        "Manifest data_source '{manifest_ds:?}' conflicts with explicit data_source '{param_ds:?}'"
+                    )
+                    .as_str(),
+                )
+                .into());
+            }
+        }
+
+        // If the manifest has a data source, use it; otherwise fall back to the parsed argument.
+        let data_source = manifest.data_source.or(param_data_source);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83221b3 and 253ffc0.

📒 Files selected for processing (4)
  • wren-core-py/src/context.rs (4 hunks)
  • wren-core-py/tests/test_modeling_core.py (2 hunks)
  • wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs (1 hunks)
  • wren-core/core/src/mdl/mod.rs (79 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
wren-core/core/src/mdl/mod.rs (3)
wren-core/core/src/mdl/dialect/inner_dialect.rs (1)
  • get_inner_dialect (112-121)
wren-core-py/src/context.rs (1)
  • new (82-208)
wren-core/core/src/mdl/function/remote_function.rs (3)
  • new_with_return_type (201-211)
  • new_with_return_type (317-328)
  • new_with_return_type (400-410)
wren-core-py/src/context.rs (6)
wren-core/core/src/mdl/mod.rs (7)
  • mdl (198-233)
  • data_source (327-329)
  • new (129-184)
  • new (567-569)
  • default (65-73)
  • create_wren_ctx (368-396)
  • register_remote_function (529-554)
wren-core-py/src/extractor.rs (2)
  • mdl_base64 (174-234)
  • new (22-29)
wren-core-base/manifest-macro/src/lib.rs (2)
  • data_source (61-116)
  • manifest (26-56)
wren-core-py/src/errors.rs (12)
  • new (18-22)
  • from (26-28)
  • from (32-34)
  • from (38-40)
  • from (44-46)
  • from (50-52)
  • from (56-65)
  • from (69-71)
  • from (75-77)
  • from (81-83)
  • from (87-89)
  • from (93-95)
wren-core-base/src/mdl/manifest.rs (2)
  • new (110-114)
  • from_str (158-183)
wren-core-py/src/manifest.rs (1)
  • to_manifest (18-23)
wren-core-py/tests/test_modeling_core.py (4)
wren-core/core/src/mdl/mod.rs (1)
  • data_source (327-329)
wren-core-base/manifest-macro/src/lib.rs (1)
  • data_source (61-116)
wren-core-base/src/mdl/py_method.rs (1)
  • data_source (49-51)
wren-core-base/src/mdl/builder.rs (1)
  • data_source (90-93)
wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs (2)
wren-core/core/src/mdl/function/remote_function.rs (8)
  • build_document (228-246)
  • new (187-199)
  • new (286-299)
  • new (386-398)
  • from (215-225)
  • from (332-342)
  • from (414-423)
  • new_with_alias (301-315)
wren-core/core/src/mdl/function/dialect/utils.rs (1)
  • build_document (3-5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ci
  • GitHub Check: test
  • GitHub Check: cargo check
🔇 Additional comments (4)
wren-core/core/src/mdl/function/dialect/bigquery/aggregate.rs (2)

93-104: Past issue resolved: array_concat_agg return type is now correct.

The previous review flagged that array_concat_agg was incorrectly using ReturnType::ArrayOfInputFirstArgument, which would double-wrap array types. The function now correctly uses ReturnType::SameAsInput, ensuring that input arrays maintain their type structure (e.g., ARRAY<INT64> stays ARRAY<INT64> rather than becoming ARRAY<ARRAY<INT64>>).


1-209: Well-structured BigQuery aggregate function definitions.

The implementation is solid:

  • All 14 UDAFs have correct return types matching BigQuery semantics
  • Signatures appropriately balance flexibility (any(n)) with type safety (coercible signatures for logical_and/logical_or)
  • Consistent, clear documentation with SQL examples for each function
  • Proper handling of the deprecated group_concat legacy function
  • The countif alias registration follows convention
wren-core-py/tests/test_modeling_core.py (1)

20-21: DataFusion fixture keeps the coverage relevant.

Switching the shared manifest to datafusion means the CSV-driven remote-function tests still exercise the registration path now that BigQuery relies on the dialect registry. Nice catch.

wren-core/core/src/mdl/mod.rs (1)

369-387: Dialect-aware registration looks solid.

Passing the manifest’s data source into create_wren_ctx and letting the inner dialect supply the UDF registries keeps the BigQuery lists isolated without disturbing the generic fallback.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ibis-server/wren/session/__init__.py (1)

29-31: Constructor call missing data_source argument for consistency.

The EmbeddedEngineRewriter constructor now accepts an optional data_source parameter (see rewriter.py:135-136), and the Context has self.data_source available. However, this constructor call doesn't pass it, relying instead on passing data_source as a parameter to rewrite_sync at line 100.

This is related to the inconsistency flagged in rewriter.py. For consistency with the Rewriter usage pattern (which passes data_source to the constructor), consider updating this call.

Apply this diff:

 self.rewriter = EmbeddedEngineRewriter(
-    get_config().get_remote_function_list_path(data_source)
+    get_config().get_remote_function_list_path(data_source),
+    data_source
 )

Then update line 100 to remove the parameter (assuming rewrite_sync is refactored per the previous comment):

 self.planned_sql = self.context.rewriter.rewrite_sync(
-    self.manifest, self.wren_sql, self.properties, self.context.data_source
+    self.manifest, self.wren_sql, self.properties
 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 253ffc0 and c17a6a5.

📒 Files selected for processing (2)
  • ibis-server/app/mdl/rewriter.py (4 hunks)
  • ibis-server/wren/session/__init__.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
ibis-server/wren/session/__init__.py (1)
wren-core-base/manifest-macro/src/lib.rs (1)
  • manifest (26-56)
ibis-server/app/mdl/rewriter.py (2)
ibis-server/app/model/data_source.py (1)
  • DataSource (60-214)
ibis-server/app/mdl/core.py (1)
  • get_session_context (7-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: ci

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
wren-core/core/src/mdl/dialect/inner_dialect.rs (1)

98-111: Consider adding documentation for the new trait methods.

The implementation correctly provides default methods for dialect-specific function registration, which aligns well with the PR objectives. However, adding doc comments would improve maintainability by explaining when and how these methods should be overridden.

Example documentation:

/// Define the supported UDFs for the dialect which will be registered in the execution context.
/// 
/// By default, returns the standard DataFusion scalar functions. Override this method
/// to provide dialect-specific scalar functions.
fn supported_udfs(&self) -> Vec<Arc<datafusion::logical_expr::ScalarUDF>> {
    scalar_functions()
}

/// Define the supported UDAFs for the dialect which will be registered in the execution context.
/// 
/// By default, returns the standard DataFusion aggregate functions. Override this method
/// to provide dialect-specific aggregate functions.
fn supported_udafs(&self) -> Vec<Arc<datafusion::logical_expr::AggregateUDF>> {
    aggregate_functions()
}

/// Define the supported UDWFs for the dialect which will be registered in the execution context.
/// 
/// By default, returns the standard DataFusion window functions. Override this method
/// to provide dialect-specific window functions.
fn supported_udwfs(&self) -> Vec<Arc<datafusion::logical_expr::WindowUDF>> {
    window_functions()
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 269eed0 and 0030b62.

📒 Files selected for processing (3)
  • ibis-server/app/mdl/rewriter.py (4 hunks)
  • ibis-server/wren/session/__init__.py (1 hunks)
  • wren-core/core/src/mdl/dialect/inner_dialect.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/app/mdl/rewriter.py
🧰 Additional context used
🧬 Code graph analysis (1)
wren-core/core/src/mdl/dialect/inner_dialect.rs (5)
wren-core/core/src/mdl/dialect/utils.rs (2)
  • scalar_function_to_sql_internal (51-76)
  • args (30-48)
wren-core/core/src/mdl/function/dialect/bigquery/mod.rs (3)
  • bigquery_aggregate_functions (300-338)
  • bigquery_scalar_functions (73-298)
  • bigquery_window_functions (340-356)
wren-core/core/src/mdl/function/aggregate/mod.rs (1)
  • aggregate_functions (11-51)
wren-core/core/src/mdl/function/scalar/mod.rs (1)
  • scalar_functions (18-192)
wren-core/core/src/mdl/function/window/mod.rs (1)
  • window_functions (5-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test
  • GitHub Check: ci
  • GitHub Check: cargo check
🔇 Additional comments (4)
ibis-server/wren/session/__init__.py (1)

99-103: LGTM! Formatting improvement enhances readability.

The multi-line formatting of the rewrite_sync call improves code readability while preserving the original semantics.

wren-core/core/src/mdl/dialect/inner_dialect.rs (3)

20-21: LGTM!

The Arc import is necessary for the new trait method return types.


23-26: LGTM!

The imports correctly bring in both BigQuery-specific and default function lists needed for the new dialect-specific registration mechanism.


152-162: LGTM!

The BigQuery dialect correctly overrides the new methods to provide BigQuery-specific function lists, enabling proper function registration for BigQuery data sources.

@goldmedal
Copy link
Copy Markdown
Contributor Author

The Oracle test is flaky in github flow. I have tested it locally

poetry run pytest -m 'oracle'
=========================================================================== test session starts ============================================================================
platform darwin -- Python 3.11.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 396 items / 377 deselected / 19 selected                                                                                                                         

tests/routers/v2/connector/test_oracle.py ...........                                                                                                                             [ 57%]
tests/routers/v3/connector/oracle/test_function.py ...                                                                                                                            [ 73%]
tests/routers/v3/connector/oracle/test_query.py .....  

@goldmedal goldmedal requested a review from douenergy November 11, 2025 08:41
@douenergy
Copy link
Copy Markdown
Contributor

Thanks @goldmedal

@douenergy douenergy merged commit 180af86 into Canner:main Nov 18, 2025
9 of 10 checks passed
nhaluc1005 pushed a commit to nhaluc1005/text2sql-practice that referenced this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery core ibis python Pull requests that update Python code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants