Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to enable operations on different DataFrames #633

Merged
merged 1 commit into from
Aug 16, 2019

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 9, 2019

This PR proposes to add an environment variable, OPS_ON_DIFF_FRAMES, to enable operations on different DataFrames.

To use this feature, set OPS_ON_DIFF_FRAMES environment variable to true and run Koalas codes.

The changes here are a bit big to match the behaviours with pandas, and to generalize them. However, how it works is pretty straightforward. Basically, it joins in index columns and do the operations. See the rough Spark examples below:

>>> df1.show()
+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  1|
|                1|  2|
|                2|  3|
+-----------------+---+
>>> df2.show()
+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  1|
|                1|  2|
|                2|  3|
+-----------------+---+
>>> df1.join(df2, on="__index_level_0__", how="full").show()
+-----------------+---+---+
|__index_level_0__|  a|  a|
+-----------------+---+---+
|                0|  1|  1|
|                1|  2|  2|
|                2|  3|  3|
+-----------------+---+---+

Annoying part here is that how to resolve duplicated column names a. In the current implementation, it's kind of tricky to use DataFrame's alias.

In this PR, I had to workaround by aliasing it. For instance:

+-----------------+--------+--------+
|__index_level_0__|__this_a|__that_a|
+-----------------+--------+--------+
|                0|       1|       1|
|                1|       2|       2|
|                2|       3|       3|
+-----------------+--------+--------+

and then, perform the operations, e.g.:

+-----------------+---------------------+
|__index_level_0__|(__this_a + __that_a)|
+-----------------+---------------------+
|                0|                    2|
|                1|                    4|
|                2|                    6|
+-----------------+---------------------+

and alias back:

+-----------------+---+
|__index_level_0__|  a|
+-----------------+---+
|                0|  2|
|                1|  4|
|                2|  6|
+-----------------+---+

Resolves #624

# Different DataFrames
def apply_op(kdf, this_columns, that_columns):
for this_column, that_column in zip(this_columns, that_columns):
yield getattr(kdf[this_column], op)(kdf[that_column])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some docs at align_diff_frames about how the function should be given to align_diff_frames.

@HyukjinKwon HyukjinKwon force-pushed the ops-diff-dfs branch 3 times, most recently from 39a55a1 to 6e85375 Compare August 9, 2019 10:04
@codecov-io
Copy link

codecov-io commented Aug 9, 2019

Codecov Report

Merging #633 into master will increase coverage by 0.07%.
The diff coverage is 95.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #633      +/-   ##
==========================================
+ Coverage   92.95%   93.02%   +0.07%     
==========================================
  Files          31       31              
  Lines        5093     5222     +129     
==========================================
+ Hits         4734     4858     +124     
- Misses        359      364       +5
Impacted Files Coverage Δ
databricks/koalas/base.py 92.07% <100%> (+0.24%) ⬆️
databricks/koalas/frame.py 94.57% <89.65%> (-0.07%) ⬇️
databricks/koalas/utils.py 97.9% <97.11%> (-2.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e2e41...18e883c. Read the comment docs.

@softagram-bot
Copy link

Softagram Impact Report for pull/633 (head commit: 18e883c)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to [email protected]

@HyukjinKwon
Copy link
Member Author

I am merging this to unblock #624

@HyukjinKwon HyukjinKwon merged commit c97c5f1 into databricks:master Aug 16, 2019
ueshin added a commit that referenced this pull request Aug 16, 2019
Currently the master build is failing.
Seems like there are the conflict changes between #633 and #639.
I'd skip the test for now to unblock other PRs.
@HyukjinKwon HyukjinKwon deleted the ops-diff-dfs branch November 6, 2019 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot operate on 2 different ks.DataFrames
3 participants