[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428

goungoun · 2018-09-15T13:43:08Z

What changes were proposed in this pull request?

This PR allows withColumnRenamed with a map input argument

How was this patch tested?

unit tests

HyukjinKwon · 2018-09-17T01:49:16Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * }}}
+   *
+   * @group untypedrel
+   * @since 2.4.0


branch-2.4 is cut out. We will probably target 3.0.0 if we happen to add new APIs.

HyukjinKwon · 2018-09-17T01:50:07Z

Can we simply call the API multiple times? I think we haven't usually added such aliases for an API unless there's strong argument for it.

goungoun · 2018-09-17T04:41:40Z

@HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to give reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times.

So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows.

StackOverflow

Spark Issues
[SPARK-12225] Support adding or replacing multiple columns at once in DataFrame API
https://issues.apache.org/jira/browse/SPARK-12225

[SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead
If foldleft is used, too many columns can cause performance issue
https://issues.apache.org/jira/browse/SPARK-21582

HyukjinKwon · 2018-09-17T04:48:00Z

The performance issue was introduced by repeating query plan analysis, which is resolved in the current master if I am not mistaken - if you're in doubt, I would suggest to do a quick benchamrk. I think this is something we should do it with one liner helper in application side code.

gatorsmile

Adding a new API is not needed especially after we improve our resolution algorithm in 2.4 release. See the commit: 4e861db

goungoun · 2018-10-02T06:07:52Z

Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me look into it further. By the way, I still hope this conversation is open to users' voice, not limited with developers' perspective. Like me who have to do data wrangling/engineering everyday, it makes things easier.

AmplabJenkins · 2019-01-11T21:03:24Z

Can one of the admins verify this patch?

HyukjinKwon · 2019-01-12T16:22:50Z

This can be easily worked around. I think no perf issue should be there now - even if there is, I don't think that justify to add a new API. We should fix the perf issue.

I'm leaving this closed unless there are other factors to consider more.

HyukjinKwon · 2019-01-12T16:25:02Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group untypedrel
+   * @since 3.0.0
+   */
+  def withColumnRenamed(columnMap: Map[String, String]): DataFrame = {


Btw this won't support java's map.

[SPARK-18073][SQL] Add map parameter for withColumnRenamed

eb08589

HyukjinKwon reviewed Sep 17, 2018

View reviewed changes

target fix

2cd4774

gatorsmile reviewed Oct 2, 2018

View reviewed changes

HyukjinKwon closed this Jan 12, 2019

HyukjinKwon reviewed Jan 12, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428

[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428

Uh oh!

goungoun commented Sep 15, 2018

Uh oh!

HyukjinKwon Sep 17, 2018

Uh oh!

HyukjinKwon commented Sep 17, 2018

Uh oh!

goungoun commented Sep 17, 2018 •

edited

Loading

Uh oh!

HyukjinKwon commented Sep 17, 2018

Uh oh!

gatorsmile left a comment

Uh oh!

goungoun commented Oct 2, 2018

Uh oh!

AmplabJenkins commented Jan 11, 2019

Uh oh!

HyukjinKwon commented Jan 12, 2019

Uh oh!

HyukjinKwon Jan 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428

[SPARK-25430][SQL] Add map parameter for withColumnRenamed #22428

Uh oh!

Conversation

goungoun commented Sep 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Sep 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 17, 2018

Uh oh!

goungoun commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 17, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

goungoun commented Oct 2, 2018

Uh oh!

AmplabJenkins commented Jan 11, 2019

Uh oh!

HyukjinKwon commented Jan 12, 2019

Uh oh!

HyukjinKwon Jan 12, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goungoun commented Sep 17, 2018 •

edited

Loading