-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.join for MultiIndex #1771
Conversation
I don't think this is a right fix. Instead, we should check the index names as well as the column names. E.g., >>> toy_pd = pd.DataFrame(columns = ['day','item','size'], data = [[5, 0, 500],[5, 0, 550],[5, 1, 1500],[5, 1, 700],[5, 1, 900],
... [6, 0, 400],[6, 0, 300],[6, 0, 600], [6, 1, 800],[6, 1, 200],
... [7, 0, 600],[7, 1, 700],[7, 1, 700], [7, 2, 750],[7, 2, 500]])
>>>
>>> toy_pd1 = toy_pd.set_index('day')
>>> toy_pd2 = toy_pd.set_index('day')
>>> toy_pd1.join(toy_pd2, on='day', rsuffix='r')
item size itemr sizer
day
5 0 500 0 500
5 0 500 0 550
5 0 500 1 1500
5 0 500 1 700
5 0 500 1 900
.. ... ... ... ...
7 2 500 0 600
7 2 500 1 700
7 2 500 1 700
7 2 500 2 750
7 2 500 2 500
[75 rows x 4 columns] whereas with the current fix: >>> ks.from_pandas(toy_pd1).join(ks.from_pandas(toy_pd2), on ='day', rsuffix='r')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 6953, in join
self = self.set_index(on)
File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 3207, in set_index
raise KeyError(key)
KeyError: 'day' |
Addressed and added related tests. Thanks, @ueshin ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
Thanks! merging. |
'len(left_on) must equal the number of levels in the index of "right"' | ||
) | ||
|
||
need_set_index = len(set(on) & set(self.index.names)) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic Would you please help me understand this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks
Sure!
This line checks if the given join keys are already included in the Index
or not.
If not (True, in this statement, because we're checking if the intersection count of the set is 0
), we need to set the given join keys as an Index
using set_index
below.
If you have any questions more, please feel free to ask ! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Your explanation is so clear :)!
May I ask for the reason for set the given join keys as an Index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks
Sure!
This is because we uses merge
after then.
If the given join keys are not in Index, the result of merge
will be not correct.
For example, let's say we have two DataFrames as below.
>>> kdf1
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K3 A3
>>> kdf2
B
key
K0 B0
K1 B1
K2 B2
And we can expect the result of join
with on='keys'
as below.
>>> kdf1.join(kdf2, on=['key'])
key A B
0 K3 A3 None
1 K0 A0 B0
2 K1 A1 B1
3 K2 A2 B2
We can make the same result with merge
as below.
>>> kdf1.set_index('key').merge(kdf2, left_index=True, right_index=True, how='left').reset_index()
key A B
0 K3 A3 None
1 K0 A0 B0
2 K1 A1 B1
3 K2 A2 B2
At this point, If we didn't kdf1.set_index('key')
, the result will be different as below.
>>> kdf1.merge(kdf2, left_index=True, right_index=True, how='left').reset_index()
index key A B
0 0 K0 A0 None
1 1 K1 A1 None
2 3 K3 A3 None
3 2 K2 A2 None
So, that's why we need set_index
here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @itholic ! That's so clear :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks np! Glad to know I helped :)
This should resolve #1770