-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to enable operations on different DataFrames #633
Conversation
# Different DataFrames | ||
def apply_op(kdf, this_columns, that_columns): | ||
for this_column, that_column in zip(this_columns, that_columns): | ||
yield getattr(kdf[this_column], op)(kdf[that_column]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some docs at align_diff_frames
about how the function should be given to align_diff_frames
.
39a55a1
to
6e85375
Compare
Codecov Report
@@ Coverage Diff @@
## master #633 +/- ##
==========================================
+ Coverage 92.95% 93.02% +0.07%
==========================================
Files 31 31
Lines 5093 5222 +129
==========================================
+ Hits 4734 4858 +124
- Misses 359 364 +5
Continue to review full report at Codecov.
|
05f9fcd
to
ec4de53
Compare
ec4de53
to
18e883c
Compare
Softagram Impact Report for pull/633 (head commit: 18e883c)⭐ Change Overview
⭐ Details of Dependency Changes
📄 Full report
Give feedback on this report to [email protected] |
I am merging this to unblock #624 |
This PR proposes to add an environment variable,
OPS_ON_DIFF_FRAMES
, to enable operations on different DataFrames.To use this feature, set
OPS_ON_DIFF_FRAMES
environment variable totrue
and run Koalas codes.The changes here are a bit big to match the behaviours with pandas, and to generalize them. However, how it works is pretty straightforward. Basically, it joins in index columns and do the operations. See the rough Spark examples below:
Annoying part here is that how to resolve duplicated column names
a
. In the current implementation, it's kind of tricky to use DataFrame's alias.In this PR, I had to workaround by aliasing it. For instance:
and then, perform the operations, e.g.:
and alias back:
Resolves #624