Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame.merge to work properly #2060

Merged
merged 4 commits into from
Feb 22, 2021
Merged

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Feb 19, 2021

This should resolve #2055

Before:

>>> kdf = ks.DataFrame(
...     {
...         "lkey": ["foo", "bar", "baz", "foo", "bar", "l"],
...         "rkey": ["baz", "foo", "bar", "baz", "foo", "r"],
...         "value": [1, 1, 3, 5, 6, 7],
...         "x": list("abcdef"),
...         "y": list("efghij"),
...     },
...     columns=["lkey", "rkey", "value", "x", "y"],
... )

>>> left_kdf = kdf[["lkey", "value", "x"]]
>>> right_kdf = kdf[["rkey", "value", "y"]]

>>> ks.merge(left_kdf, right_kdf, on="value")
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Resolved attribute(s) y#801,rkey#798 missing from lkey#797,y#842,rkey#839,value#799L,__index_level_0__#837L,__natural_order__#808L,__natural_order__#836L,x#800,__index_level_0__#796L,value#840L in operator !Project [lkey#797, value#799L, x#800, rkey#798, y#801]. Attribute(s) with the same name appear in the operation: y,rkey. Please check if the right attribute(s) are used.;;
...

After:

>>> kdf = ks.DataFrame(
...     {
...         "lkey": ["foo", "bar", "baz", "foo", "bar", "l"],
...         "rkey": ["baz", "foo", "bar", "baz", "foo", "r"],
...         "value": [1, 1, 3, 5, 6, 7],
...         "x": list("abcdef"),
...         "y": list("efghij"),
...     },
...     columns=["lkey", "rkey", "value", "x", "y"],
... )

>>> left_kdf = kdf[["lkey", "value", "x"]]
>>> right_kdf = kdf[["rkey", "value", "y"]]

>>> ks.merge(left_kdf, right_kdf, on="value")
  lkey  value  x rkey  y
0    l      7  f    r  j
1  bar      6  e  foo  i
2  foo      5  d  baz  h
3  foo      1  a  baz  e
4  foo      1  a  foo  f
5  bar      1  b  baz  e
6  bar      1  b  foo  f
7  baz      3  c  bar  g

@itholic
Copy link
Contributor Author

itholic commented Feb 19, 2021

FYI: I believe the issue introduced in #2055 should also resolved with this change.

>>> kdf=ks.DataFrame({'transport_order_number': {11059585: ('696530708053'),  36538499: '696530708053',  41914814: '696530708053',  58878846: '696530708053',  83502171: '696530708053',  87335732: '696530708053',  89651819: '696530708053'},
...  'event_description': {11059585: 'PIEZA EN RUTA AL DESTINATARIO',  36538499: 'TRANSFERENCIA RUTA (OTBCS)',  41914814: 'RECEPCION TRANS. PIEZA',  58878846: 'RETIRO DESDE PDT',  83502171: 'RECEPCION TRANS. CONT.',  87335732: 'RECEPCIONADA',  89651819: 'PIEZA ENTREGADA A DESTINATARIO'},
...  'event_date': {11059585: ('2020-12-15 09:05:12.743000'),  36538499: ('2020-12-15 06:42:22.477000'),  41914814: ('2020-12-15 06:42:34.083000'),  58878846: ('2020-12-14 13:41:00'),  83502171: ('2020-12-15 06:42:00'),  87335732: ('2020-12-14 14:41:00'),
...   89651819: ('2020-12-15 12:53:00')}})
>>> recepcion=kdf.loc[kdf['event_description']=='RECEPCIONADA']
>>> retiro=kdf.loc[kdf['event_description']=='RETIRO DESDE PDT']

>>> ks.merge(recepcion, retiro, on='transport_order_number', how='outer', suffixes=('_recepcion','_retiro'))
  transport_order_number event_description_recepcion event_date_recepcion event_description_retiro    event_date_retiro
0           696530708053                RECEPCIONADA  2020-12-14 14:41:00         RETIRO DESDE PDT  2020-12-14 13:41:00

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for two nits.

databricks/koalas/frame.py Outdated Show resolved Hide resolved
databricks/koalas/frame.py Outdated Show resolved Hide resolved
@ueshin ueshin mentioned this pull request Feb 20, 2021
@itholic
Copy link
Contributor Author

itholic commented Feb 21, 2021

Thanks for the review, @ueshin

@codecov-io
Copy link

codecov-io commented Feb 21, 2021

Codecov Report

Merging #2060 (16c664d) into master (2618c52) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2060   +/-   ##
=======================================
  Coverage   94.74%   94.75%           
=======================================
  Files          54       54           
  Lines       11675    11683    +8     
=======================================
+ Hits        11062    11070    +8     
  Misses        613      613           
Impacted Files Coverage Δ
databricks/koalas/frame.py 96.58% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2618c52...16c664d. Read the comment docs.

@ueshin
Copy link
Collaborator

ueshin commented Feb 22, 2021

Thanks! merging.

@ueshin ueshin merged commit 1af51a1 into databricks:master Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

join/merge problems
3 participants