Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SessionStateBuilder and extract out the registration of defaults #11403

Merged
merged 10 commits into from
Jul 14, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Jul 10, 2024

SessionStateBuilder has been added and the existing SessionState::new* functions updated to use the builder to construct a SessionState. Defaults previously set in the new* functions have been extracted out into SessionStateDefaults

Which issue does this PR close?

Closes #11320

Rationale for this change

Adds the ability to fine grain control how a SessionState is constructed.

What changes are included in this PR?

Code.

Are these changes tested?

All current tests pass

Are there any user-facing changes?

All the builder style functions on SessionState have been deprecated.

api change label required.

@github-actions github-actions bot added core Core DataFusion crate substrait labels Jul 10, 2024
@Omega359 Omega359 marked this pull request as ready for review July 11, 2024 13:11

new_self
SessionStateBuilder::new_with_config_rt(config, runtime)
.with_defaults(true)
Copy link
Contributor

@jayzhan211 jayzhan211 Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make new_with_config_rt with default: false. And call with_defaults() to set defaults

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to mention / point at SessionStateBuilder in the docs here

As a follow on PR it probably makes sense to deprecate SessionState::new_with_config and new_with_config... and redirect people to use SessionStateBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I waffled back and forth as whether to have default set to false or true by, um, default and decided that it would likely be best to mirror the current system behavior. That being said, it's simple to change.

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need either session state with default or not. If we have builder that has default: true , we need another function disable_default() or something similar to get the plain session state. If we have builder that has default: false, we need function like with_default to get the session state with builtin. Both are fine to me, as long as builtin is optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 💯

alamb
alamb previously approved these changes Jul 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Omega359 -- I think this PR is looking really nice 😍

I had some API improvement suggestions, but I actually think they could be done as follow on PRs as this PR maintains backwards compatibility with the existing APIs.

Specifically I am thinking we could iterate on the internal implementation of SessionStateBuilder easily.

It might make sense to take a quick look at the suggestions for public API changes (e.g. new() vs new_with_config_and_rt() and how use_defaults works) but otherwise 👍

Let me know what you think


new_self
SessionStateBuilder::new_with_config_rt(config, runtime)
.with_defaults(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to mention / point at SessionStateBuilder in the docs here

As a follow on PR it probably makes sense to deprecate SessionState::new_with_config and new_with_config... and redirect people to use SessionStateBuilder

SessionStateBuilder::new_with_config_rt(config, runtime)
.with_defaults(true)
.with_catalog_list(catalog_list)
.build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so nice ❤️

impl SessionStateBuilder {
/// Returns new [`SessionStateBuilder`] using the provided
/// [`SessionConfig`] and [`RuntimeEnv`].
pub fn new_with_config_rt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way we could do this would be simply

let builder = SessionStateBuilder::new()
  .with_config(config)
  .with_runtime_env(runtime_env);

I think the original concern with an API like this is that this would likely create a new SessionConfig::default() just to replace it with config

However, I think we could optimize this eventually (e.g. by deferring the creation of SessionState until calling build() for example) and if we are going to change the API anyways I think we should consider doing the "better" thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that originally however I decided to switch just to mirror the existing behavior. It's a simple change, just would need to document that .build() will fail if config and runtime_env are not provided (assuming that we don't just use defaults there)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if the config and runtime weren't provided then it would make the defaults

Something like this perhaps

fn build(self) -> SessionConfig {
  let Self { config, ... } = self;
  // config is None if not provided explicitly
  let config  = config.unwrap_or_else(|| SessionConfig::default);
  ...
}

Comment on lines 860 to 882
Self {
state: SessionState {
session_id,
analyzer: Analyzer::new(),
expr_planners: vec![],
optimizer: Optimizer::new(),
physical_optimizers: PhysicalOptimizer::new(),
query_planner: Arc::new(DefaultQueryPlanner {}),
catalog_list,
table_functions: HashMap::new(),
scalar_functions: HashMap::new(),
aggregate_functions: HashMap::new(),
window_functions: HashMap::new(),
serializer_registry: Arc::new(EmptySerializerRegistry),
file_formats: HashMap::new(),
table_options: TableOptions::default_from_session_config(
config.options(),
),
config,
execution_props: ExecutionProps::new(),
table_factories: HashMap::new(),
runtime_env,
function_factory: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using an inner SessionState, it might be possible to simply replicate the fields on the builder (possibly in Option) -- while this would involve replicated code, it would make doing things like deferred creation of session config and runtime env easier

However, since this is an implementation detail we could also change it as a follow on PR without much issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that however then the builder would need accessors functions so that the end user could actually query into any defaults set and update as required. Not horrendous but not typical of a normal Builder, at least in languages I've used this pattern with in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable -- we could try it as a follow on PR too

/// Returns a new [SessionStateBuilder] based on an existing [SessionState]
/// The session id for the new builder will be set to a unique value; all
/// other fields will be cloned from what is set in the provided session state
pub fn new_from_existing(existing: &SessionState) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this clones SessionState anyways, I think it would make more sense for this to take the argument by value

Suggested change
pub fn new_from_existing(existing: &SessionState) -> Self {
pub fn new_from_existing(existing: SessionState) -> Self {

The rationale being there if the caller already has an owned SessionState they can use that and avoid a clone. This API basically requires them to do a clone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this function, another nice to have UX feature would be a From impl, so users could do

let state: SessionState = get_existing_state();
let state = SessionStateBuilder::from(state)
  .with_optimizer_pass(Arc::new(MyNewOptimizer{}))
  .build()

}
}

/// Set to true (default = true) if defaults for table_factories, expr_planners, file formats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 this is perfect

I feel like defaults is a somewhat overloaded term. Perhaps we could improve the naming here. Some thoughts 🤔

  • with_built_ins()
  • with_install_defaults()
  • with_default_features()

Copy link
Contributor Author

@Omega359 Omega359 Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like .with_default_features() myself.

self.state.table_factories =
SessionStateDefaults::default_table_factories();
}
if self.state.expr_planners.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the semantics of "use_defaults" is effectively ignored if the user has provided their own pieces might be confusing

I think it would be easier to reason about if the defaults got installed as part of SessionBuilder::with_use_defaults() (rather than a flag that deferred the installation to the end) -- that way the user could decide if they wanted to modify the default lists or start from an empty list

What I am hoping long term is that we could pass in the defaults as an argument (and implement it as a trait) thus breaking the dependency between SessionState and core

let state = SessionStateBuilder::new()
  .with_defaults(SessionStateDefaults::new())?
  .build()

@Omega359 Omega359 marked this pull request as draft July 12, 2024 16:51
@alamb alamb dismissed their stale review July 12, 2024 18:36

More changes planned, will review then

@Omega359 Omega359 marked this pull request as ready for review July 12, 2024 23:15
@Omega359
Copy link
Contributor Author

I've updated the PR with the latest changes that I hope reflect all the feedback received thus far.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is awesome -- thank you so much @Omega359 and @jayzhan211

I had a few comments, but I also think we could address them as follow on PRs as well

@@ -178,13 +180,18 @@ impl SchemaProvider for DynamicFileSchemaProvider {
// to any command options so the only choice is to use an empty collection
match scheme {
"s3" | "oss" | "cos" => {
state = state.add_table_options_extension(AwsOptions::default());
if let Some(table_options) = builder.table_options() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this code previously always registered the table options and now it only registers them if another table option was registered. Is that intentional?

Maybe it doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well .... I updated all of the existing code in the project to reflect the change but I agree that it may be a bit of a surprise to anyone expecting the old behavior when upgrading. 6 of one, 1/2 dozen of another ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, sorry, misread the comment. I think it works only because the builder was created from an existing session state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can clean it up as a follow on -- it is not critical in my opinion

@@ -315,7 +320,7 @@ impl SessionContext {
}

/// Creates a new `SessionContext` using the provided [`SessionState`]
#[deprecated(since = "32.0.0", note = "Use SessionState::new_with_state")]
#[deprecated(since = "32.0.0", note = "Use SessionContext::new_with_state")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR perhaps we can remove this API entirely

We could probably change the note to say say "SessionStateBuilder"

let session_state =
SessionState::new_with_config_rt(SessionConfig::new(), runtime)
.with_query_planner(Arc::new(MyQueryPlanner {}));
let session_state = SessionStateBuilder::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just looks so much nicer

.with_query_planner(Arc::new(TopKQueryPlanner {}))
.add_optimizer_rule(Arc::new(TopKOptimizerRule {}));
state.add_analyzer_rule(Arc::new(MyAnalyzerRule {}));
.add_optimizer_rule(Arc::new(TopKOptimizerRule {}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could make these names consistent (rather than add_optimizer_rule it could be with_optimizer_rule -- a follow on PR would be totally fine too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, easy change.

@@ -89,9 +91,29 @@ use uuid::Uuid;
/// Execution context for registering data sources and executing queries.
/// See [`SessionContext`] for a higher level API.
///
/// Use the [`SessionStateBuilder`] to build a SessionState object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/// The session id for the new builder will be unset; all other fields will
/// be cloned from what is set in the provided session state
pub fn new_from_existing(existing: SessionState) -> Self {
let cloned = existing.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since existing is owned, why do we need cloned here?

I tried removing this clone locally and it seemed to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover from a previous iteration, I'll remove. thx for the catch

}
}

/// Set defaults for table_factories, file formats, expr_planners and builtin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great for this PR

As part of the larger catalog refactor, I think we'll ned to figure out how to get SessioNStateDefaults` out of this file -- but I have some ideas (specifically we can make the defaults a trait that could be passed to the builder).

Not for this PR


/// Add `physical_optimizer_rule` to the end of the list of
/// [`PhysicalOptimizerRule`]s used to rewrite queries.
pub fn add_physical_optimizer_rule(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just following the existing (inconsistent) API naming, but since we are making a new set of APIs, perhaps we could make these builder style APIs consistently start with with_


/// Add `optimizer_rule` to the end of the list of
/// [`OptimizerRule`]s used to rewrite queries.
pub fn add_optimizer_rule(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise it would be great to call this with_optimizer_rule

let config = self.config.unwrap_or_default();
let runtime_env = self.runtime_env.unwrap_or(Arc::new(RuntimeEnv::default()));

let mut state = SessionState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, another way to write this code that might be more boiler plate but would let the compiler ensure all fields are used would be:

let Self {
 session_id, 
 analyzer, 
...
 // by explicitly listing all these fields, the compiler will complain if any fields are missed
} = self;

let session_id = session_id.unwrap_or_else(|| Uuid::new_v4().to_string());
let analyzer = analyzer.unwrap_or_default(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. That's not an approach I've seen used but I can see how it could be useful. Updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched a few places in the sql planner to use it after we got a few bugs due to by newly fields being (silently) ignored.

@Omega359
Copy link
Contributor Author

Updated build() method to use suggested let Self{..} = self; approach. Add_xxx functions renamed to with_xxx.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it 🚀

@alamb
Copy link
Contributor

alamb commented Jul 14, 2024

I think technically this is not an API change (though it does deprecate some existing apis) so not marking it

@alamb alamb merged commit 8475806 into apache:main Jul 14, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 14, 2024

Thanks again everyone! this is great

@Omega359 Omega359 deleted the feature/11320_session_state_builder branch July 14, 2024 23:33
@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Thanks @Omega359 and @jayzhan211

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…pache#11403)

* Create a SessionStateBuilder and use it for creating anything but a basic SessionState.

* Updated new_from_existing to take a reference to the existing SessionState and clone it.

* Minor documentation update.

* SessionStateDefaults improvements.

* Reworked how SessionStateBuilder works from PR feedback.

* Bug fix for missing array_expressions cfg feature.

* Review feedback updates + doc fixes for SessionStateDefaults

* Cargo fmt update.
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
…pache#11403)

* Create a SessionStateBuilder and use it for creating anything but a basic SessionState.

* Updated new_from_existing to take a reference to the existing SessionState and clone it.

* Minor documentation update.

* SessionStateDefaults improvements.

* Reworked how SessionStateBuilder works from PR feedback.

* Bug fix for missing array_expressions cfg feature.

* Review feedback updates + doc fixes for SessionStateDefaults

* Cargo fmt update.
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…pache#11403)

* Create a SessionStateBuilder and use it for creating anything but a basic SessionState.

* Updated new_from_existing to take a reference to the existing SessionState and clone it.

* Minor documentation update.

* SessionStateDefaults improvements.

* Reworked how SessionStateBuilder works from PR feedback.

* Bug fix for missing array_expressions cfg feature.

* Review feedback updates + doc fixes for SessionStateDefaults

* Cargo fmt update.
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 30, 2024
…pache#11403)

* Create a SessionStateBuilder and use it for creating anything but a basic SessionState.

* Updated new_from_existing to take a reference to the existing SessionState and clone it.

* Minor documentation update.

* SessionStateDefaults improvements.

* Reworked how SessionStateBuilder works from PR feedback.

* Bug fix for missing array_expressions cfg feature.

* Review feedback updates + doc fixes for SessionStateDefaults

* Cargo fmt update.
@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

BTW @wiedld just moved InfluxDB 3.0 over to use this new API and it is 👌 -- thanks again @Omega359

kezhuw added a commit to kezhuw/datafusion that referenced this pull request Aug 6, 2024
SessionState::new_with_config_rt was deprecated in favor of
SessionStateBuilder in apache#11403 which is not shipped in 40.x.
lewiszlw pushed a commit that referenced this pull request Aug 6, 2024
…#11839)

SessionState::new_with_config_rt was deprecated in favor of
SessionStateBuilder in #11403 which is not shipped in 40.x.
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 10, 2024
The old constructor is deprecated.

Ref: apache/datafusion#11403
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 20, 2024
The old constructor is deprecated.

Ref: apache/datafusion#11403
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Aug 23, 2024
* update datafusion deps to point to githuc.com/apache/datafusion

Datafusion 41 is not yet released on crates.io.

* update TableProvider::scan

Ref: apache/datafusion#11516

* use SessionStateBuilder

The old constructor is deprecated.

Ref: apache/datafusion#11403

* update AggregateFunction

Upstream Changes:
- The field name was switched from `func_name` to func.
- AggregateFunctionDefinition was removed

Ref: apache/datafusion#11803

* update imports in catalog

Catlog API was extracted to a separate crate.

Ref: apache/datafusion#11516

* use appropriate path for approx_distinct

Ref: apache/datafusion#11644

* migrate AggregateExt to ExprFunctionExt

Also removed `sqlparser` dependency since it's re-exported upstream.

Ref: apache/datafusion#11550

* update regr_count tests for new return type

Ref: apache/datafusion#11731

* migrate from function-array to functions-nested

The package was renamed upstream.

Ref: apache/datafusion#11602

* cargo fmt

* lock datafusion deps to 41

* remove todo from cargo.toml

All the datafusion dependencies are re-exported, but I still need to figure out *why*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract registering default features from SessionState and into its own function
3 participants