Add an option to enable operations on different DataFrames #633

HyukjinKwon · 2019-08-09T09:43:10Z

This PR proposes to add an environment variable, OPS_ON_DIFF_FRAMES, to enable operations on different DataFrames.

To use this feature, set OPS_ON_DIFF_FRAMES environment variable to true and run Koalas codes.

The changes here are a bit big to match the behaviours with pandas, and to generalize them. However, how it works is pretty straightforward. Basically, it joins in index columns and do the operations. See the rough Spark examples below:

>>> df1.show()

+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  1|
|                1|  2|
|                2|  3|
+-----------------+---+

>>> df2.show()

+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  1|
|                1|  2|
|                2|  3|
+-----------------+---+

>>> df1.join(df2, on="__index_level_0__", how="full").show()

+-----------------+---+---+
|__index_level_0__|  a|  a|
+-----------------+---+---+
|                0|  1|  1|
|                1|  2|  2|
|                2|  3|  3|
+-----------------+---+---+

Annoying part here is that how to resolve duplicated column names a. In the current implementation, it's kind of tricky to use DataFrame's alias.

In this PR, I had to workaround by aliasing it. For instance:

+-----------------+--------+--------+
|__index_level_0__|__this_a|__that_a|
+-----------------+--------+--------+
|                0|       1|       1|
|                1|       2|       2|
|                2|       3|       3|
+-----------------+--------+--------+

and then, perform the operations, e.g.:

+-----------------+---------------------+
|__index_level_0__|(__this_a + __that_a)|
+-----------------+---------------------+
|                0|                    2|
|                1|                    4|
|                2|                    6|
+-----------------+---------------------+

and alias back:

+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  2|
|                1|  4|
|                2|  6|
+-----------------+---+

Resolves #624

HyukjinKwon · 2019-08-09T09:52:30Z

databricks/koalas/frame.py

+            # Different DataFrames
+            def apply_op(kdf, this_columns, that_columns):
+                for this_column, that_column in zip(this_columns, that_columns):
+                    yield getattr(kdf[this_column], op)(kdf[that_column])


There's some docs at align_diff_frames about how the function should be given to align_diff_frames.

codecov-io · 2019-08-09T13:09:27Z

Codecov Report

Merging #633 into master will increase coverage by 0.07%.
The diff coverage is 95.77%.

@@            Coverage Diff             @@
##           master     #633      +/-   ##
==========================================
+ Coverage   92.95%   93.02%   +0.07%     
==========================================
  Files          31       31              
  Lines        5093     5222     +129     
==========================================
+ Hits         4734     4858     +124     
- Misses        359      364       +5

Impacted Files	Coverage Δ
databricks/koalas/base.py	`92.07% <100%> (+0.24%)`	⬆️
databricks/koalas/frame.py	`94.57% <89.65%> (-0.07%)`	⬇️
databricks/koalas/utils.py	`97.9% <97.11%> (-2.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e2e41...18e883c. Read the comment docs.

softagram-bot · 2019-08-12T02:05:29Z

Softagram Impact Report for pull/633 (head commit: `18e883c`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/633

Give feedback on this report to [email protected]

HyukjinKwon · 2019-08-16T00:34:07Z

I am merging this to unblock #624

Currently the master build is failing. Seems like there are the conflict changes between #633 and #639. I'd skip the test for now to unblock other PRs.

HyukjinKwon mentioned this pull request Aug 9, 2019

Cannot operate on 2 different ks.DataFrames #624

Closed

HyukjinKwon requested a review from ueshin August 9, 2019 09:45

HyukjinKwon commented Aug 9, 2019

View reviewed changes

HyukjinKwon force-pushed the ops-diff-dfs branch 3 times, most recently from 39a55a1 to 6e85375 Compare August 9, 2019 10:04

HyukjinKwon force-pushed the ops-diff-dfs branch from 05f9fcd to ec4de53 Compare August 12, 2019 01:33

Add an option to enable operations on different DataFrames

18e883c

HyukjinKwon force-pushed the ops-diff-dfs branch from ec4de53 to 18e883c Compare August 12, 2019 02:04

HyukjinKwon mentioned this pull request Aug 14, 2019

Allow to omit type hint in GroupBy.transform, filter, apply #646

Merged

HyukjinKwon merged commit c97c5f1 into databricks:master Aug 16, 2019

ueshin mentioned this pull request Aug 16, 2019

Skip a test OpsOnDiffFramesEnabledTest.test_no_index. #651

Merged

ueshin added a commit that referenced this pull request Aug 16, 2019

Skip a test OpsOnDiffFramesEnabledTest.test_no_index. (#651)

f0f1859

Currently the master build is failing. Seems like there are the conflict changes between #633 and #639. I'd skip the test for now to unblock other PRs.

HyukjinKwon mentioned this pull request Aug 20, 2019

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Merged

HyukjinKwon deleted the ops-diff-dfs branch November 6, 2019 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to enable operations on different DataFrames #633

Add an option to enable operations on different DataFrames #633

HyukjinKwon commented Aug 9, 2019 •

edited

Loading

HyukjinKwon Aug 9, 2019

codecov-io commented Aug 9, 2019 •

edited

Loading

softagram-bot commented Aug 12, 2019

HyukjinKwon commented Aug 16, 2019

Add an option to enable operations on different DataFrames #633

Add an option to enable operations on different DataFrames #633

Conversation

HyukjinKwon commented Aug 9, 2019 • edited Loading

HyukjinKwon Aug 9, 2019

Choose a reason for hiding this comment

codecov-io commented Aug 9, 2019 • edited Loading

Codecov Report

softagram-bot commented Aug 12, 2019

Softagram Impact Report for pull/633 (head commit: 18e883c)

⭐ Change Overview

⭐ Details of Dependency Changes

📄 Full report

HyukjinKwon commented Aug 16, 2019

HyukjinKwon commented Aug 9, 2019 •

edited

Loading

codecov-io commented Aug 9, 2019 •

edited

Loading

Softagram Impact Report for pull/633 (head commit: `18e883c`)