Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe.with_column_rename has unintuitive behavior when using case sensitive column names #8800

Closed
Omega359 opened this issue Jan 9, 2024 · 7 comments · Fixed by #8858
Labels
bug Something isn't working

Comments

@Omega359
Copy link
Contributor

Omega359 commented Jan 9, 2024

Describe the bug

If a data frame has column names such as FileId, FileType, SequenceId attempting to rename this with calls such as
data_frame = data_frame.with_column_renamed("FileId", "file_id").unwrap();
will return success however the column will not be renamed. After some investigation by Andy Grove it was determined that the names would have to be wrapped in quotes (ala sql) to have the columns renamed correctly.
data_frame = data_frame.with_column_renamed("\"FileId\"", "file_id").unwrap();

This is neither intuitive, documented nor normal for any dataframe api I've encountered in the past.

To Reproduce

I wrote a quick test case when investigating this issue.

Expected behavior

For me the expectation is that referring to the column name in the first parameter would be 'as-is' without any configuration changes (say, to use case insensitive names or to require quotes ala sql)

Additional context

No response

@Omega359 Omega359 added the bug Something isn't working label Jan 9, 2024
@alamb
Copy link
Contributor

alamb commented Jan 9, 2024

I agree this is confusing -- in general people using the DataFrame API have been confused by this as DataFrame APIs typically are case sensitive, while SQL is not (by default)

@comphead
Copy link
Contributor

comphead commented Jan 9, 2024

It was some changes here to support case sensitive #7063

@Omega359 am I right, the concern is if the column has not been renamed, you expect None, instead of Some?

@Omega359 Omega359 changed the title dataframe.with_column_rename has unintuitive bahavior when using case sensitive column names dataframe.with_column_rename has unintuitive behavior when using case sensitive column names Jan 9, 2024
@Omega359
Copy link
Contributor Author

Omega359 commented Jan 9, 2024

It was some changes here to support case sensitive #7063

@Omega359 am I right, the concern is if the column has not been renamed, you expect None, instead of Some?

One of the concerns, yes. I also was expecting as Andrew pointed out that the names would be case sensitive. Using double quotes is a solution however imho it's not an ideal one.

@comphead
Copy link
Contributor

Ident normalization is a standard practice
However you also can use datafusion.sql_parser.enable_ident_normalization = false

+--------------------------------------------------+-------+------------------------------------------------------------------------------------------------+
| name                                             | value | description                                                                                    |
+--------------------------------------------------+-------+------------------------------------------------------------------------------------------------+
| datafusion.sql_parser.enable_ident_normalization | true  | When set to true, SQL parser will normalize ident (convert ident to lowercase when not quoted) |
+--------------------------------------------------+-------+------------------------------------------------------------------------------------------------+

parameter to override this behaviuor.

Please refer to with_column_renamed_case_sensitive test

@comphead
Copy link
Contributor

The signature and comments for with_column_renamed looks reasonable for me

    /// Rename one column by applying a new projection. This is a no-op if the column to be
    /// renamed does not exist.
    pub fn with_column_renamed(
        self,
        old_name: impl Into<String>,
        new_name: &str,
    ) -> Result<DataFrame>

@Omega359 do you think anything still left that can be improved?

@Omega359
Copy link
Contributor Author

Omega359 commented Jan 13, 2024

@Omega359 do you think anything still left that can be improved?

Yes. Hiding rename case behaviour behind a flag that as far as I can find isn't prominently noted in the docs isn't great. It's really not obvious as a user of the api why it isn't working when you provided the 'correct' name. Having to dig into the source code, while awesome to have it available, is not ideal to solving an issue like this, especially for users who are still getting up to speed on rust. I didn't have a solution to my issue until a core developer popped into the discord channel to help.

A few possible solutions I see of which I'll help with #2 time permitting:

  1. Fix the naming so that it behaves as (well, to me anyways) one would expect - either match the name regardless of case or match based on case. The current implementation seems to conflate sql behaviour with dataframe behaviour.
  2. Update the user guide to make it clear what the default behaviour is - and if that isn't the desired behaviour, how to change it. Even better would be to link each dataframe (and sql) with a page that outlines the function in more detail with examples, caveats and any corner case behaviour.
  3. Update the api docs with information related to any flags that could impact behaviour/performance/etc and link to the examples provided in # 2 (or datafusion-examples, whatever works)

@comphead
Copy link
Contributor

comphead commented Jan 13, 2024

Somehow we missed the description for this param in https://github.com/apache/arrow-datafusion/blob/main/docs/source/user-guide/configs.md

I'll create a PR as well as document the behavior for with_column_renamed method

UPD: My bad we have this param documented in https://github.com/apache/arrow-datafusion/blob/main/docs/source/user-guide/configs.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants