Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser option enable_options_value_normalization #11330

Merged
merged 17 commits into from
Jul 25, 2024

Conversation

xinlifoobar
Copy link
Contributor

@xinlifoobar xinlifoobar commented Jul 8, 2024

Which issue does this PR close?

Closes #10853

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate labels Jul 8, 2024
@berkaysynnada
Copy link
Contributor

Thanks @xinlifoobar working on this issue.

What I suggest is that we can let people to define their own normalizations. Rather than directing all values to lowercase, people can declare their own normalization methods for specific option keys. If we choose to do that in this PR, I believe the normalization process should be done during the assignment of table options.

@xinlifoobar
Copy link
Contributor Author

xinlifoobar commented Jul 8, 2024

Thanks @xinlifoobar working on this issue.

What I suggest is that we can let people to define their own normalizations. Rather than directing all values to lowercase, people can declare their own normalization methods for specific option keys. If we choose to do that in this PR, I believe the normalization process should be done during the assignment of table options.

The APIs that could be used in the datafusion-sql crate are limited. Do you think it is a good idea to add a method, e.g., get_options_normalized_function in ContextProvider?

pub trait ContextProvider {
/// Getter for a datasource
fn get_table_source(&self, name: TableReference) -> Result<Arc<dyn TableSource>>;
fn get_file_type(&self, _ext: &str) -> Result<Arc<dyn FileType>> {
not_impl_err!("Registered file types are not supported")
}
/// Getter for a table function
fn get_table_function_source(
&self,
_name: &str,
_args: Vec<Expr>,
) -> Result<Arc<dyn TableSource>> {
not_impl_err!("Table Functions are not supported")
}
/// This provides a worktable (an intermediate table that is used to store the results of a CTE during execution)
/// We don't directly implement this in the logical plan's ['SqlToRel`]
/// because the sql code needs access to a table that contains execution-related types that can't be a direct dependency
/// of the sql crate (namely, the `CteWorktable`).
/// The [`ContextProvider`] provides a way to "hide" this dependency.
fn create_cte_work_table(
&self,
_name: &str,
_schema: SchemaRef,
) -> Result<Arc<dyn TableSource>> {
not_impl_err!("Recursive CTE is not implemented")
}
/// Getter for a UDF description
fn get_function_meta(&self, name: &str) -> Option<Arc<ScalarUDF>>;
/// Getter for a UDAF description
fn get_aggregate_meta(&self, name: &str) -> Option<Arc<AggregateUDF>>;
/// Getter for a UDWF
fn get_window_meta(&self, name: &str) -> Option<Arc<WindowUDF>>;
/// Getter for system/user-defined variable type
fn get_variable_type(&self, variable_names: &[String]) -> Option<DataType>;
/// Get configuration options
fn options(&self) -> &ConfigOptions;
/// Get all user defined scalar function names
fn udf_names(&self) -> Vec<String>;
/// Get all user defined aggregate function names
fn udaf_names(&self) -> Vec<String>;
/// Get all user defined window function names
fn udwf_names(&self) -> Vec<String>;
}

@berkaysynnada
Copy link
Contributor

Thanks @xinlifoobar working on this issue.
What I suggest is that we can let people to define their own normalizations. Rather than directing all values to lowercase, people can declare their own normalization methods for specific option keys. If we choose to do that in this PR, I believe the normalization process should be done during the assignment of table options.

The APIs that could be used in the datafusion-sql crate are limited. Do you think it is a good idea to add a method, e.g., get_options_normalized_function a good idea?

SqlParserOptions is accessible, what do you think about adding it there? But I'm not sure if we can support dynamic method registration. My idea is to provide hard-coded functions like the one below, but I'm uncertain where to bind it:

lazy_static! {
    static ref NORMALIZATION_MAP: Mutex<HashMap<String, fn(String) -> String>> = {
        let mut m = HashMap::new();
        m.insert("hf.user_access_token".to_string(), lowercase as fn(String) -> String);
        // Other normalizations
        Mutex::new(m)
    };
}

fn lowercase(s: String) -> String {
    s.to_lowercase()
}
// Other normalizations

If the key is not found, use the default normalization or do nothing (according to your enable_options_value_normalization).

@xinlifoobar
Copy link
Contributor Author

But

IMO it won't be a good idea... I thought the *configs are static values. What should you do if you would keep a config file for function values?

@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

What I suggest is that we can let people to define their own normalizations. Rather than directing all values to lowercase, people can declare their own normalization methods for specific option keys. If we choose to do that in this PR, I believe the normalization process should be done during the assignment of table options.

I wonder what is the usecase for a user defined normalization? It may be more complicated to code

I like the approach proposed in this PR. As I understand it if follows what we do for SQL: by default apply some hard coded normalization (e.g. to lower_case) and let users opt out of that normalization either by:

  1. Setting a config value like datafusion.sql_parser.enable_ident_normalization
  2. Enclose the value in "

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 10, 2024
@xinlifoobar xinlifoobar changed the title [Draft] option enable_options_value_normalization Add parser option enable_options_value_normalization Jul 10, 2024
@xinlifoobar xinlifoobar marked this pull request as ready for review July 10, 2024 13:20
@berkaysynnada
Copy link
Contributor

Hi again @xinlifoobar. The idea in my mind was something like this: synnada-ai@341b484.

I haven't been able to work on it much, and it can't be compiled right now, but you can understand what I mean as an idea. By removing the lowercase transformations we made while parsing the statement, just like in this PR, it seems that we can easily normalize the fields we want in the desired way.

cc @alamb

@xinlifoobar
Copy link
Contributor Author

Hi again @xinlifoobar. The idea in my mind was something like this: synnada-ai@341b484.

I haven't been able to work on it much, and it can't be compiled right now, but you can understand what I mean as an idea. By removing the lowercase transformations we made while parsing the statement, just like in this PR, it seems that we can easily normalize the fields we want in the desired way.

cc @alamb

Hi, I got the idea and am okay to make the change if we agree. The question is still the normalize go beyond the config itself. I am still unsure about this design because the normalize function is just for a small set of configs. in the future, if we have multiple functions for different purposes, the macro might become huge and difficult to read.

@xinlifoobar
Copy link
Contributor Author

Hi again @xinlifoobar. The idea in my mind was something like this: synnada-ai@341b484.
I haven't been able to work on it much, and it can't be compiled right now, but you can understand what I mean as an idea. By removing the lowercase transformations we made while parsing the statement, just like in this PR, it seems that we can easily normalize the fields we want in the desired way.
cc @alamb

Hi, I got the idea and am okay to make the change if we agree. The question is still the normalize go beyond the config itself. I am still unsure about this design because the normalize function is just for a small set of configs. in the future, if we have multiple functions for different purposes, the macro might become huge and difficult to read.

Hey @berkaysynnada @alamb are we all ok with this design? I plan to update this PR tomorrow.

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Hi @xinlifoobar -- I haven't had a chance to review this PR unfortunately. Perhaps @tinfoil-knight has some time to review the design as the original filer of #10853

Copy link
Contributor

@tinfoil-knight tinfoil-knight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good overall. Some minor changes requested. I prefer the current implementation that changes the normalization for all option values.

@berkaysynnada 's solution allows option-specific normalization but I don't think that kind of fine-grained control over option values is needed.


Sidenote:
We could add traits later on that'd allow users specify custom normalization behavior for identifiers and option values through SqlToRel::new_with_options.

// Example
pub trait Normalizer {
	fn normalize(&self, key: &str, value: Value) -> Result<String>;
}

datafusion/sql/tests/sql_integration.rs Outdated Show resolved Hide resolved
datafusion/sql/src/utils.rs Outdated Show resolved Hide resolved
datafusion/sql/src/statement.rs Show resolved Hide resolved
datafusion/sql/src/statement.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion logical-expr Logical plan and expressions labels Jul 17, 2024
@github-actions github-actions bot removed logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules substrait labels Jul 18, 2024
datafusion/sql/src/planner.rs Outdated Show resolved Hide resolved
datafusion/sql/src/utils.rs Outdated Show resolved Hide resolved
datafusion/sql/src/statement.rs Outdated Show resolved Hide resolved
@xinlifoobar xinlifoobar marked this pull request as ready for review July 21, 2024 03:10
Copy link
Contributor

@tinfoil-knight tinfoil-knight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb The current design and PR looks good to me. Can we get this merged?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this looks great to me. Than you @xinlifoobar and @tinfoil-knight

I'll plan to merge this tomorrow unless there are other comments.

cc @berkaysynnada

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berkaysynnada 's solution allows option-specific normalization but I don't think that kind of fine-grained control over option values is needed.

We actually do need this in our use case. Let's do the general solution to avoid API churn on such a user-visible part of DataFusion.

@xinlifoobar and @tinfoil-knight, do you have time to apply @berkaysynnada's suggestion? It's OK if not, we can work on it instead.

cc @berkaysynnada

@xinlifoobar
Copy link
Contributor Author

@berkaysynnada 's solution allows option-specific normalization but I don't think that kind of fine-grained control over option values is needed.

We actually do need this in our use case. Let's do the general solution to avoid API churn on such a user-visible part of DataFusion.

@xinlifoobar and @tinfoil-knight, do you have time to apply @berkaysynnada's suggestion? It's OK if not, we can work on it instead.

cc @berkaysynnada

Hey, since this is pending long. I could do this as a follow up after this is merged.

@ozankabak
Copy link
Contributor

Sure -- let's just make sure to avoid API churn and complete the work 🙂

@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

Sure -- let's just make sure to avoid API churn and complete the work 🙂

How about we file a ticket explaining what else is needed prior to merging this PR. I can do this but I want to make sure I have the use case down correctly

@berkaysynnada says:

What I suggest is that we can let people to define their own normalizations. Rather than directing all values to lowercase, people can declare their own normalization methods for specific option keys. If we choose to do that in this PR, I believe the normalization process should be done during the assignment of table options.

The code in synnada-ai@341b484 I think seems to show a way to add an annotation to a built in config setting value that controls how it will be normalized.

Is the use case to

  1. Provide a new set of ConfigOptions (not in DataFusion) that have custom normalization rules?
  2. Avoid normalization on some of the built in configs?

Something else?

@ozankabak
Copy link
Contributor

Both. We would like to have key-level control on what normalization means for config options, built-in and extension.

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Both. We would like to have key-level control on what normalization means for config options, built-in and extension.

In the interest of keeping the code moving, I filed #11650 to track the additional feature and I merged up from main to resolve a conflict

@ozankabak
Copy link
Contributor

Sounds good -- this is good to go from my perspective.

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Clippy failure seems unrelated to this PR. I think it is #11651

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Merged up to get the clippy fix

@alamb alamb merged commit d452d51 into apache:main Jul 25, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

🚀 thanks and sorry for the delay @xinlifoobar

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop changing the case for COPY TO option values
5 participants