Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame.join for MultiIndex #1771

Merged
merged 5 commits into from
Sep 21, 2020
Merged

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Sep 14, 2020

This should resolve #1770

>>> toy_pd = pd.DataFrame(columns = ['day','item','size'], data = [[5, 0, 500],[5, 0, 550],[5, 1, 1500],[5, 1, 700],[5, 1, 900],
... [6, 0, 400],[6, 0, 300],[6, 0, 600], [6, 1, 800],[6, 1, 200],
... [7, 0, 600],[7, 1, 700],[7, 1, 700], [7, 2, 750],[7, 2, 500]])

>>> toy_ks1 = ks.from_pandas(toy_pd).groupby(['day','item']).agg({'size':'mean'})
>>> toy_ks2 = ks.from_pandas(toy_pd).groupby(['day','item']).agg({'size':'mean'})

>>> toy_ks1.join(toy_ks2, on = ['day','item'], rsuffix='r')
                 size        sizer
day item
5   1     1033.333333  1033.333333
7   1      700.000000   700.000000
    2      625.000000   625.000000
6   1      500.000000   500.000000
7   0      600.000000   600.000000
6   0      433.333333   433.333333
5   0      525.000000   525.000000

@ueshin
Copy link
Collaborator

ueshin commented Sep 15, 2020

I don't think this is a right fix. Instead, we should check the index names as well as the column names.

E.g.,

>>> toy_pd = pd.DataFrame(columns = ['day','item','size'], data = [[5, 0, 500],[5, 0, 550],[5, 1, 1500],[5, 1, 700],[5, 1, 900],
...                                                                [6, 0, 400],[6, 0, 300],[6, 0, 600], [6, 1, 800],[6, 1, 200],
...                                                                [7, 0, 600],[7, 1, 700],[7, 1, 700], [7, 2, 750],[7, 2, 500]])
>>>
>>> toy_pd1 = toy_pd.set_index('day')
>>> toy_pd2 = toy_pd.set_index('day')
>>> toy_pd1.join(toy_pd2, on='day', rsuffix='r')
     item  size  itemr  sizer
day
5       0   500      0    500
5       0   500      0    550
5       0   500      1   1500
5       0   500      1    700
5       0   500      1    900
..    ...   ...    ...    ...
7       2   500      0    600
7       2   500      1    700
7       2   500      1    700
7       2   500      2    750
7       2   500      2    500

[75 rows x 4 columns]

whereas with the current fix:

>>> ks.from_pandas(toy_pd1).join(ks.from_pandas(toy_pd2), on ='day', rsuffix='r')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 6953, in join
    self = self.set_index(on)
  File "/Users/ueshin/workspace/databricks-koalas/worktrees/work/databricks/koalas/frame.py", line 3207, in set_index
    raise KeyError(key)
KeyError: 'day'

@itholic
Copy link
Contributor Author

itholic commented Sep 16, 2020

Addressed and added related tests. Thanks, @ueshin !

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

databricks/koalas/frame.py Show resolved Hide resolved
@ueshin
Copy link
Collaborator

ueshin commented Sep 21, 2020

Thanks! merging.

@ueshin ueshin merged commit 4241205 into databricks:master Sep 21, 2020
@itholic itholic deleted the fix_f_join branch October 6, 2020 00:17
'len(left_on) must equal the number of levels in the index of "right"'
)

need_set_index = len(set(on) & set(self.index.names)) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic Would you please help me understand this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinrong-databricks

Sure!

This line checks if the given join keys are already included in the Index or not.

If not (True, in this statement, because we're checking if the intersection count of the set is 0), we need to set the given join keys as an Index using set_index below.

If you have any questions more, please feel free to ask ! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Your explanation is so clear :)!

May I ask for the reason for set the given join keys as an Index?

Copy link
Contributor Author

@itholic itholic Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinrong-databricks

Sure!

This is because we uses merge after then.

If the given join keys are not in Index, the result of merge will be not correct.

For example, let's say we have two DataFrames as below.

>>> kdf1
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3

>>> kdf2
      B
key
K0   B0
K1   B1
K2   B2

And we can expect the result of join with on='keys' as below.

>>> kdf1.join(kdf2, on=['key'])
  key   A     B
0  K3  A3  None
1  K0  A0    B0
2  K1  A1    B1
3  K2  A2    B2

We can make the same result with merge as below.

>>> kdf1.set_index('key').merge(kdf2, left_index=True, right_index=True, how='left').reset_index()
  key   A     B
0  K3  A3  None
1  K0  A0    B0
2  K1  A1    B1
3  K2  A2    B2

At this point, If we didn't kdf1.set_index('key'), the result will be different as below.

>>> kdf1.merge(kdf2, left_index=True, right_index=True, how='left').reset_index()
   index key   A     B
0      0  K0  A0  None
1      1  K1  A1  None
2      3  K3  A3  None
3      2  K2  A2  None

So, that's why we need set_index here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @itholic ! That's so clear :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinrong-databricks np! Glad to know I helped :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError when joining on index level
3 participants