Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.join for MultiIndex #1771
Fix DataFrame.join for MultiIndex #1771
Changes from all commits
8ab8474
99ae58b
a3d8701
e236af6
6bad029
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic Would you please help me understand this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks
Sure!
This line checks if the given join keys are already included in the
Index
or not.If not (True, in this statement, because we're checking if the intersection count of the set is
0
), we need to set the given join keys as anIndex
usingset_index
below.If you have any questions more, please feel free to ask ! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Your explanation is so clear :)!
May I ask for the reason for
set the given join keys as an Index
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks
Sure!
This is because we uses
merge
after then.If the given join keys are not in Index, the result of
merge
will be not correct.For example, let's say we have two DataFrames as below.
And we can expect the result of
join
withon='keys'
as below.We can make the same result with
merge
as below.At this point, If we didn't
kdf1.set_index('key')
, the result will be different as below.So, that's why we need
set_index
here!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @itholic ! That's so clear :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-databricks np! Glad to know I helped :)