Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control duplicate rows in joins #5413

Closed
Nico-Rojas opened this issue Jul 15, 2020 · 0 comments · Fixed by #5910
Closed

Control duplicate rows in joins #5413

Nico-Rojas opened this issue Jul 15, 2020 · 0 comments · Fixed by #5910
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@Nico-Rojas
Copy link

It would be incredible if you could specify which kind of duplicates in the join variable(s) is(are) acceptable as in the STATA merge command.

This means having the possibility to decide whether it's ok to have:

  • Duplicates in the join variables of the master data but not in the join variables of the using data (a m:1 join)
  • Duplicates in the join variables of the using data but not in the join variables of the master data (a 1:m join)
  • Duplicates in both cases (an m:m join)
  • No duplicates (a 1:1 join)

I work in education, and some times data-sets were poorly digitized because some students appear to have been enrolled in more than one school (duplicates in the join variable). When that happens both in the master and using data, those students multiply exponentially in the resulting data-frame, as it happens now with, for example, left_join. You may not want that. Also, if you are not cautious enough, you may not pay attention to duplicate students, that "ninjaly" add more observations to your results.

STATA used to have the same issue and only had two commands "merge" or "join" somehow similar to R, and did not let you decide when duplicates are acceptable (i.e duplicate information in the using data when merging students to school characteristics, school id is the join variable) and when it may not be acceptable (i.e duplicate information in the master and using data when merging a student list to a student characteristic, student id is the join variable). At some point, they made that feature feasible and it's great because it allows you to merge and make quality control in one single step.

Thank you for your time and amazing work. Hope this helps.

@romainfrancois romainfrancois added the tables 🧮 joins and set operations label Aug 11, 2020
@hadley hadley added the feature a feature request or enhancement label Nov 16, 2020
@hadley hadley changed the title Duplicate control of rows in join commands Control duplicate rows in joins Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants