Skip to content

Conversation

@goungoun
Copy link
Contributor

What changes were proposed in this pull request?

This PR allows withColumnRenamed with a map input argument

How was this patch tested?

unit tests

* }}}
*
* @group untypedrel
* @since 2.4.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch-2.4 is cut out. We will probably target 3.0.0 if we happen to add new APIs.

@HyukjinKwon
Copy link
Member

Can we simply call the API multiple times? I think we haven't usually added such aliases for an API unless there's strong argument for it.

@goungoun
Copy link
Contributor Author

goungoun commented Sep 17, 2018

@HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to give reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times.

So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows.

StackOverflow

Spark Issues
[SPARK-12225] Support adding or replacing multiple columns at once in DataFrame API
https://issues.apache.org/jira/browse/SPARK-12225

[SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead
If foldleft is used, too many columns can cause performance issue
https://issues.apache.org/jira/browse/SPARK-21582

@HyukjinKwon
Copy link
Member

The performance issue was introduced by repeating query plan analysis, which is resolved in the current master if I am not mistaken - if you're in doubt, I would suggest to do a quick benchamrk. I think this is something we should do it with one liner helper in application side code.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new API is not needed especially after we improve our resolution algorithm in 2.4 release. See the commit: 4e861db

@goungoun
Copy link
Contributor Author

goungoun commented Oct 2, 2018

Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me look into it further. By the way, I still hope this conversation is open to users' voice, not limited with developers' perspective. Like me who have to do data wrangling/engineering everyday, it makes things easier.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

This can be easily worked around. I think no perf issue should be there now - even if there is, I don't think that justify to add a new API. We should fix the perf issue.

I'm leaving this closed unless there are other factors to consider more.

* @group untypedrel
* @since 3.0.0
*/
def withColumnRenamed(columnMap: Map[String, String]): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this won't support java's map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants