Fix DataFrame.join for MultiIndex #1771

itholic · 2020-09-14T15:29:28Z

This should resolve #1770

>>> toy_pd = pd.DataFrame(columns = ['day','item','size'], data = [[5, 0, 500],[5, 0, 550],[5, 1, 1500],[5, 1, 700],[5, 1, 900],
... [6, 0, 400],[6, 0, 300],[6, 0, 600], [6, 1, 800],[6, 1, 200],
... [7, 0, 600],[7, 1, 700],[7, 1, 700], [7, 2, 750],[7, 2, 500]])

>>> toy_ks1 = ks.from_pandas(toy_pd).groupby(['day','item']).agg({'size':'mean'})
>>> toy_ks2 = ks.from_pandas(toy_pd).groupby(['day','item']).agg({'size':'mean'})

>>> toy_ks1.join(toy_ks2, on = ['day','item'], rsuffix='r')
                 size        sizer
day item
5   1     1033.333333  1033.333333
7   1      700.000000   700.000000
    2      625.000000   625.000000
6   1      500.000000   500.000000
7   0      600.000000   600.000000
6   0      433.333333   433.333333
5   0      525.000000   525.000000

ueshin · 2020-09-15T17:53:43Z

I don't think this is a right fix. Instead, we should check the index names as well as the column names.

E.g.,

>>> toy_pd = pd.DataFrame(columns = ['day','item','size'], data = [[5, 0, 500],[5, 0, 550],[5, 1, 1500],[5, 1, 700],[5, 1, 900],
...                                                                [6, 0, 400],[6, 0, 300],[6, 0, 600], [6, 1, 800],[6, 1, 200],
...                                                                [7, 0, 600],[7, 1, 700],[7, 1, 700], [7, 2, 750],[7, 2, 500]])
>>>
>>> toy_pd1 = toy_pd.set_index('day')
>>> toy_pd2 = toy_pd.set_index('day')
>>> toy_pd1.join(toy_pd2, on='day', rsuffix='r')
     item  size  itemr  sizer
day
5       0   500      0    500
5       0   500      0    550
5       0   500      1   1500
5       0   500      1    700
5       0   500      1    900
..    ...   ...    ...    ...
7       2   500      0    600
7       2   500      1    700
7       2   500      1    700
7       2   500      2    750
7       2   500      2    500

[75 rows x 4 columns]

whereas with the current fix:

>>> ks.from_pandas(toy_pd1).join(ks.from_pandas(toy_pd2), on ='day', rsuffix='r')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 6953, in join
    self = self.set_index(on)
  File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 3207, in set_index
    raise KeyError(key)
KeyError: 'day'

itholic · 2020-09-16T06:11:38Z

Addressed and added related tests. Thanks, @ueshin !

ueshin

Otherwise, LGTM.

databricks/koalas/frame.py

ueshin · 2020-09-21T21:19:21Z

Thanks! merging.

xinrong-meng · 2021-01-13T01:19:03Z

databricks/koalas/frame.py

+                    'len(left_on) must equal the number of levels in the index of "right"'
+                )
+
+            need_set_index = len(set(on) & set(self.index.names)) == 0


@itholic Would you please help me understand this line?

@xinrong-databricks

Sure!

This line checks if the given join keys are already included in the Index or not.

If not (True, in this statement, because we're checking if the intersection count of the set is 0), we need to set the given join keys as an Index using set_index below.

If you have any questions more, please feel free to ask ! :)

Thank you! Your explanation is so clear :)!

May I ask for the reason for set the given join keys as an Index?

@xinrong-databricks

Sure!

This is because we uses merge after then.

If the given join keys are not in Index, the result of merge will be not correct.

For example, let's say we have two DataFrames as below.

>>> kdf1 key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 >>> kdf2 B key K0 B0 K1 B1 K2 B2

And we can expect the result of join with on='keys' as below.

>>> kdf1.join(kdf2, on=['key']) key A B 0 K3 A3 None 1 K0 A0 B0 2 K1 A1 B1 3 K2 A2 B2

We can make the same result with merge as below.

>>> kdf1.set_index('key').merge(kdf2, left_index=True, right_index=True, how='left').reset_index() key A B 0 K3 A3 None 1 K0 A0 B0 2 K1 A1 B1 3 K2 A2 B2

At this point, If we didn't kdf1.set_index('key'), the result will be different as below.

>>> kdf1.merge(kdf2, left_index=True, right_index=True, how='left').reset_index() index key A B 0 0 K0 A0 None 1 1 K1 A1 None 2 3 K3 A3 None 3 2 K2 A2 None

So, that's why we need set_index here!

Thank you @itholic ! That's so clear :).

@xinrong-databricks np! Glad to know I helped :)

Fix DataFrame.join for MultiIndex

8ab8474

itholic added 3 commits September 16, 2020 14:56

Check the index names

99ae58b

Remove unused import

a3d8701

Fix mypy

e236af6

ueshin reviewed Sep 16, 2020

View reviewed changes

databricks/koalas/frame.py Show resolved Hide resolved

minor test fix

6bad029

ueshin approved these changes Sep 21, 2020

View reviewed changes

ueshin merged commit 4241205 into databricks:master Sep 21, 2020

itholic deleted the fix_f_join branch October 6, 2020 00:17

xinrong-meng reviewed Jan 13, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataFrame.join for MultiIndex #1771

Fix DataFrame.join for MultiIndex #1771

itholic commented Sep 14, 2020

ueshin commented Sep 15, 2020

itholic commented Sep 16, 2020 •

edited

Loading

ueshin left a comment

ueshin commented Sep 21, 2020

xinrong-meng Jan 13, 2021

itholic Jan 13, 2021

xinrong-meng Jan 13, 2021

itholic Jan 14, 2021 •

edited

Loading

xinrong-meng Jan 15, 2021

itholic Jan 15, 2021

Fix DataFrame.join for MultiIndex #1771

Fix DataFrame.join for MultiIndex #1771

Conversation

itholic commented Sep 14, 2020

ueshin commented Sep 15, 2020

itholic commented Sep 16, 2020 • edited Loading

ueshin left a comment

Choose a reason for hiding this comment

ueshin commented Sep 21, 2020

xinrong-meng Jan 13, 2021

Choose a reason for hiding this comment

itholic Jan 13, 2021

Choose a reason for hiding this comment

xinrong-meng Jan 13, 2021

Choose a reason for hiding this comment

itholic Jan 14, 2021 • edited Loading

Choose a reason for hiding this comment

xinrong-meng Jan 15, 2021

Choose a reason for hiding this comment

itholic Jan 15, 2021

Choose a reason for hiding this comment

itholic commented Sep 16, 2020 •

edited

Loading

itholic Jan 14, 2021 •

edited

Loading