-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * }}} | ||
| * | ||
| * @group untypedrel | ||
| * @since 2.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
branch-2.4 is cut out. We will probably target 3.0.0 if we happen to add new APIs.
|
Can we simply call the API multiple times? I think we haven't usually added such aliases for an API unless there's strong argument for it. |
|
@HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to give reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times. So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows. StackOverflow
Spark Issues [SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead |
|
The performance issue was introduced by repeating query plan analysis, which is resolved in the current master if I am not mistaken - if you're in doubt, I would suggest to do a quick benchamrk. I think this is something we should do it with one liner helper in application side code. |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a new API is not needed especially after we improve our resolution algorithm in 2.4 release. See the commit: 4e861db
|
Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me look into it further. By the way, I still hope this conversation is open to users' voice, not limited with developers' perspective. Like me who have to do data wrangling/engineering everyday, it makes things easier. |
|
Can one of the admins verify this patch? |
|
This can be easily worked around. I think no perf issue should be there now - even if there is, I don't think that justify to add a new API. We should fix the perf issue. I'm leaving this closed unless there are other factors to consider more. |
| * @group untypedrel | ||
| * @since 3.0.0 | ||
| */ | ||
| def withColumnRenamed(columnMap: Map[String, String]): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw this won't support java's map.
What changes were proposed in this pull request?
This PR allows withColumnRenamed with a map input argument
How was this patch tested?
unit tests