refactor(core): self-manage the list of the core functions#1346
refactor(core): self-manage the list of the core functions#1346douenergy merged 5 commits intoCanner:mainfrom
Conversation
WalkthroughAdds a new SessionContext factory ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Factory as mdl::create_wren_ctx
participant Session as SessionContext
participant Reg as function registries
participant Apply as mdl::context::apply_wren_on_ctx
participant MDLA as AnalyzedWrenMDL
Caller->>Factory: create_wren_ctx(Optional<SessionConfig>)
Factory->>Reg: gather scalar/agg/window/table functions
Reg-->>Factory: Vec<Arc<...>> sets
Factory->>Session: build SessionContext with registered functions
Factory-->>Caller: SessionContext
Caller->>Apply: apply_wren_on_ctx(SessionContext, MDLA, props, Mode)
Apply->>Session: apply analyzed MDL with Mode
Apply-->>Caller: Result<SessionContext>
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (10)
🚧 Files skipped from review as they are similar to previous changes (4)
🧰 Additional context used🧬 Code graph analysis (3)wren-core/core/src/mdl/mod.rs (2)
wren-core/sqllogictest/src/test_context.rs (2)
wren-core-py/src/context.rs (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (9)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
wren-core/core/src/mdl/context.rs (1)
387-396: Avoid panic in TableProvider::scan; return a proper errorunreachable! will abort the process if rules fail to replace this node. Return a DataFusionError instead.
- ) -> Result<Arc<dyn ExecutionPlan>> { - unreachable!("WrenDataSource should be replaced before physical planning") - } + ) -> Result<Arc<dyn ExecutionPlan>> { + Err(datafusion::common::DataFusionError::Internal( + "WrenDataSource should be replaced before physical planning".to_string(), + )) + }Add import (outside this hunk):
use datafusion::common::DataFusionError;wren-core/wren-example/examples/to-many-calculation.rs (1)
4-6: Fix import for Result (DataFusion 48 moved it to common)This won’t compile with 48.x. Use datafusion::common::Result.
-use datafusion::error::Result; +use datafusion::common::Result;Optionally remove unused
SessionContextimport.wren-core-py/src/context.rs (1)
324-362: Harden parsing: avoid unwrap and handle NULLs from information_schemaFunctionType::from_str(..).unwrap() can panic; description may be NULL. Handle errors and NULLs gracefully.
Apply this diff:
- async fn get_registered_functions( + async fn get_registered_functions( ctx: &wren_core::SessionContext, ) -> PyResult<Vec<RemoteFunctionDto>> { let sql = r#" SELECT DISTINCT r.routine_name as name, r.function_type, r.description FROM information_schema.routines r "#; let batches = ctx .sql(sql) .await .map_err(CoreError::from)? .collect() .await .map_err(CoreError::from)?; let mut functions = vec![]; for batch in batches { - let name_array = batch.column(0).as_string::<i32>(); - let function_type_array = batch.column(1).as_string::<i32>(); - let description_array = batch.column(2).as_string::<i32>(); + let name_array = batch.column(0).as_string::<i32>(); + let function_type_array = batch.column(1).as_string::<i32>(); + let description_array = batch.column(2).as_string::<i32>(); for row in 0..batch.num_rows() { - let name = name_array.value(row).to_string(); - let description = description_array.value(row).to_string(); - let function_type = function_type_array.value(row).to_string(); - - functions.push(RemoteFunctionDto { - name, - description: Some(description), - function_type: FunctionType::from_str(&function_type).unwrap(), - }); + // name is expected non-null + if name_array.is_null(row) || function_type_array.is_null(row) { + continue; + } + let name = name_array.value(row).to_string(); + let function_type_str = function_type_array.value(row); + let function_type = FunctionType::from_str(function_type_str) + .map_err(|e| CoreError::new(&format!("Unknown function_type '{function_type_str}': {e}")))?; + let description = if description_array.is_null(row) { + None + } else { + Some(description_array.value(row).to_string()) + }; + functions.push(RemoteFunctionDto { + name, + description, + function_type, + }); } } Ok(functions) }
🧹 Nitpick comments (9)
wren-core/core/src/mdl/function/window/mod.rs (1)
5-19: Avoid re-allocating UDFs: cache with Lazy and clone ArcsConstructing this Vec on every call is unnecessary. Cache once with Lazy, return clones (cheap).
+use once_cell::sync::Lazy; @@ -pub fn window_functions() -> Vec<Arc<WindowUDF>> { - vec![ - cume_dist::cume_dist_udwf(), - row_number::row_number_udwf(), - lead_lag::lead_udwf(), - lead_lag::lag_udwf(), - rank::rank_udwf(), - rank::dense_rank_udwf(), - rank::percent_rank_udwf(), - ntile::ntile_udwf(), - nth_value::first_value_udwf(), - nth_value::last_value_udwf(), - nth_value::nth_value_udwf(), - ] -} +static WINDOW_FUNCS: Lazy<Vec<Arc<WindowUDF>>> = Lazy::new(|| { + vec![ + cume_dist::cume_dist_udwf(), + row_number::row_number_udwf(), + lead_lag::lead_udwf(), + lead_lag::lag_udwf(), + rank::rank_udwf(), + rank::dense_rank_udwf(), + rank::percent_rank_udwf(), + ntile::ntile_udwf(), + nth_value::first_value_udwf(), + nth_value::last_value_udwf(), + nth_value::nth_value_udwf(), + ] +}); + +pub fn window_functions() -> Vec<Arc<WindowUDF>> { + WINDOW_FUNCS.clone() +}wren-core/core/src/mdl/function/scalar/mod.rs (2)
12-186: Cache the large scalar function set; clone Arcs per callThis allocates/builds 100+ UDFs each call. Cache once with Lazy to avoid startup overhead on each context creation.
+use once_cell::sync::Lazy; @@ -pub fn scalar_functions() -> Vec<Arc<ScalarUDF>> { - vec![ +static SCALAR_FUNCS: Lazy<Vec<Arc<ScalarUDF>>> = Lazy::new(|| { + vec![ // datefusion core nullif(), @@ - map_values::map_values_udf(), - ] -} + map_values::map_values_udf(), + ] +}); + +pub fn scalar_functions() -> Vec<Arc<ScalarUDF>> { + SCALAR_FUNCS.clone() +}
146-149: Remove redundant fully-qualified paths for consistencyYou already
use datafusion::functions_nested::*;. These can be shortened.- datafusion::functions_nested::concat::array_append_udf(), - datafusion::functions_nested::concat::array_prepend_udf(), - datafusion::functions_nested::concat::array_concat_udf(), + concat::array_append_udf(), + concat::array_prepend_udf(), + concat::array_concat_udf(), @@ - datafusion::functions_nested::repeat::array_repeat_udf(), + repeat::array_repeat_udf(), @@ - datafusion::functions_nested::reverse::array_reverse_udf(), + reverse::array_reverse_udf(), @@ - datafusion::functions_nested::replace::array_replace_n_udf(), - datafusion::functions_nested::replace::array_replace_all_udf(), - datafusion::functions_nested::replace::array_replace_udf(), + replace::array_replace_n_udf(), + replace::array_replace_all_udf(), + replace::array_replace_udf(),Also applies to: 166-169, 177-179
wren-core/core/src/mdl/context.rs (1)
62-66: Use typed boolean for config option (avoid Utf8 "false")Config options in DataFusion 48 are strongly typed. Set boolean as Boolean, not Utf8.
.set( "datafusion.sql_parser.enable_ident_normalization", - &ScalarValue::Utf8(Some("false".to_string())), + &ScalarValue::Boolean(Some(false)), )Based on learnings
wren-core/core/src/mdl/function/table/mod.rs (1)
8-11: Cache table functions via LazySame rationale: avoid re-creating function objects on each call.
+use once_cell::sync::Lazy; @@ -pub fn table_functions() -> Vec<Arc<TableFunction>> { - vec![generate_series(), range()] -} +static TABLE_FUNCS: Lazy<Vec<Arc<TableFunction>>> = + Lazy::new(|| vec![generate_series(), range()]); + +pub fn table_functions() -> Vec<Arc<TableFunction>> { + TABLE_FUNCS.clone() +}wren-core/wren-example/examples/to-many-calculation.rs (1)
80-85: API usage LGTM; consider avoiding shadowingctx
let ctx = ...;shadows the earlierctx. Rename towren_ctxfor clarity (optional).wren-core-py/src/context.rs (1)
171-187: Unify on apply_wren_on_ctx (deprecate create_ctx_with_mdl)To align with core API and avoid maintaining two paths, replace create_ctx_with_mdl with apply_wren_on_ctx.
Apply this diff for the two call sites:
- let unparser_ctx = runtime - .block_on(create_ctx_with_mdl( - &ctx, - Arc::clone(&analyzed_mdl), - Arc::clone(&properties_ref), - mdl::context::Mode::Unparse, - )) + let unparser_ctx = runtime + .block_on(mdl::context::apply_wren_on_ctx( + &ctx, + Arc::clone(&analyzed_mdl), + Arc::clone(&properties_ref), + mdl::context::Mode::Unparse, + )) .map_err(CoreError::from)?; - let exec_ctx = runtime - .block_on(create_ctx_with_mdl( - &ctx, - Arc::clone(&analyzed_mdl), - Arc::clone(&properties_ref), - mdl::context::Mode::LocalRuntime, - )) + let exec_ctx = runtime + .block_on(mdl::context::apply_wren_on_ctx( + &ctx, + Arc::clone(&analyzed_mdl), + Arc::clone(&properties_ref), + mdl::context::Mode::LocalRuntime, + )) .map_err(CoreError::from)?;Also update the import:
// replace use wren_core::mdl::context::create_ctx_with_mdl; // with use wren_core::mdl::context::apply_wren_on_ctx;wren-core/core/src/mdl/function/aggregate/mod.rs (1)
1-51: Avoid rebuilding the aggregate list on every contextaggregate_functions() allocates a large Vec each call. Cache once and clone.
Apply this diff:
-use std::sync::Arc; +use std::sync::{Arc, OnceLock}; @@ -pub fn aggregate_functions() -> Vec<Arc<AggregateUDF>> { - vec![ +static AGG_FUNCS: OnceLock<Vec<Arc<AggregateUDF>>> = OnceLock::new(); + +pub fn aggregate_functions() -> Vec<Arc<AggregateUDF>> { + AGG_FUNCS.get_or_init(|| { + vec![ array_agg::array_agg_udaf(), @@ - nth_value::nth_value_udaf(), - ] + nth_value::nth_value_udaf(), + ]}) + .clone() }wren-core/core/src/mdl/mod.rs (1)
366-383: Optional: set defaults explicitly for test determinismConsider wiring a default SessionConfig here (when None) to disable collect_statistics for faster/more deterministic tests and always enable information_schema.
Example:
let cfg = SessionConfig::new() .with_information_schema(true) .set_bool("datafusion.execution.collect_statistics", false); let builder = SessionStateBuilder::new().with_config(cfg) ...Based on learnings (DataFusion 48 collects stats by default).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
wren-core-py/src/context.rs(3 hunks)wren-core/core/src/mdl/context.rs(1 hunks)wren-core/core/src/mdl/function/aggregate/mod.rs(1 hunks)wren-core/core/src/mdl/function/mod.rs(1 hunks)wren-core/core/src/mdl/function/scalar/mod.rs(1 hunks)wren-core/core/src/mdl/function/table/mod.rs(1 hunks)wren-core/core/src/mdl/function/window/mod.rs(1 hunks)wren-core/core/src/mdl/mod.rs(78 hunks)wren-core/sqllogictest/src/test_context.rs(5 hunks)wren-core/wren-example/examples/to-many-calculation.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
wren-core/wren-example/examples/to-many-calculation.rs (2)
wren-core/core/src/mdl/mod.rs (2)
mdl(197-232)create_wren_ctx(367-382)wren-core/core/src/mdl/context.rs (1)
apply_wren_on_ctx(46-117)
wren-core-py/src/context.rs (2)
wren-core/core/src/mdl/mod.rs (4)
mdl(197-232)create_wren_ctx(367-382)new(128-183)new(553-555)wren-core-py/src/remote_functions.rs (2)
from(65-99)from(103-139)
wren-core/sqllogictest/src/test_context.rs (2)
wren-core/core/src/mdl/mod.rs (2)
mdl(197-232)create_wren_ctx(367-382)wren-core/core/src/mdl/context.rs (1)
apply_wren_on_ctx(46-117)
wren-core/core/src/mdl/function/mod.rs (4)
wren-core/core/src/mdl/function/aggregate/mod.rs (1)
aggregate_functions(11-51)wren-core/core/src/mdl/function/scalar/mod.rs (1)
scalar_functions(12-186)wren-core/core/src/mdl/function/table/mod.rs (1)
table_functions(9-11)wren-core/core/src/mdl/function/window/mod.rs (1)
window_functions(5-19)
wren-core/core/src/mdl/mod.rs (6)
wren-core/core/src/mdl/context.rs (3)
apply_wren_on_ctx(46-117)new(336-366)properties(93-99)wren-core-py/src/context.rs (3)
new(81-203)remote_functions(88-91)register_remote_function(282-301)wren-core/core/src/mdl/function/scalar/mod.rs (1)
scalar_functions(12-186)wren-core/core/src/mdl/function/aggregate/mod.rs (1)
aggregate_functions(11-51)wren-core/core/src/mdl/function/window/mod.rs (1)
window_functions(5-19)wren-core/core/src/mdl/function/table/mod.rs (1)
table_functions(9-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: cargo check
- GitHub Check: test
- GitHub Check: ci
🔇 Additional comments (5)
wren-core-py/src/context.rs (2)
94-95: Good switch to factory contextUsing mdl::create_wren_ctx(Some(config)) ensures core functions are pre-registered before Python-layer remote UDFs. LGTM.
229-235: OK: function introspection uses exec_ctxQuerying get_registered_functions on exec_ctx (LocalRuntime) is appropriate. LGTM.
wren-core/core/src/mdl/function/mod.rs (1)
1-10: Centralized function registry exports look goodClean module split and re-exports. LGTM.
wren-core/core/src/mdl/mod.rs (1)
415-421: LGTM: apply_wren_on_ctx adoptionMoving from create_ctx_with_mdl to apply_wren_on_ctx is consistent with the new flow; preserves registered functions via new_from_existing.
Also applies to: 480-488
wren-core/sqllogictest/src/test_context.rs (1)
66-67: Test harness migration looks correct
- Use of create_wren_ctx(Some(config)) for deterministic test config.
- Applying MDL via apply_wren_on_ctx(Mode::LocalRuntime) at setup sites is consistent.
Also applies to: 81-89, 314-321, 550-556
513f9c7 to
45e3ff9
Compare
|
Thanks @goldmedal |
Description
To support different dialects, it is better to own the function list ourselves instead of using the default list provided by DataFusion. This refactoring helps define the function signature for each dialect.
Summary by CodeRabbit
New Features
Refactor
Tests
Chores