Control duplicate rows in joins #5413

Nico-Rojas · 2020-07-15T20:41:24Z

It would be incredible if you could specify which kind of duplicates in the join variable(s) is(are) acceptable as in the STATA merge command.

This means having the possibility to decide whether it's ok to have:

Duplicates in the join variables of the master data but not in the join variables of the using data (a m:1 join)
Duplicates in the join variables of the using data but not in the join variables of the master data (a 1:m join)
Duplicates in both cases (an m:m join)
No duplicates (a 1:1 join)

I work in education, and some times data-sets were poorly digitized because some students appear to have been enrolled in more than one school (duplicates in the join variable). When that happens both in the master and using data, those students multiply exponentially in the resulting data-frame, as it happens now with, for example, left_join. You may not want that. Also, if you are not cautious enough, you may not pay attention to duplicate students, that "ninjaly" add more observations to your results.

STATA used to have the same issue and only had two commands "merge" or "join" somehow similar to R, and did not let you decide when duplicates are acceptable (i.e duplicate information in the using data when merging students to school characteristics, school id is the join variable) and when it may not be acceptable (i.e duplicate information in the master and using data when merging a student list to a student characteristic, student id is the join variable). At some point, they made that feature feasible and it's great because it allows you to merge and make quality control in one single step.

Thank you for your time and amazing work. Hope this helps.

romainfrancois added the tables 🧮 joins and set operations label Aug 11, 2020

hadley added the feature a feature request or enhancement label Nov 16, 2020

hadley changed the title ~~Duplicate control of rows in join commands~~ Control duplicate rows in joins Nov 16, 2020

DavisVaughan mentioned this issue Jun 18, 2021

Flexible joins #5910

Merged

DavisVaughan closed this as completed in #5910 May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control duplicate rows in joins #5413

Control duplicate rows in joins #5413

Nico-Rojas commented Jul 15, 2020

Control duplicate rows in joins #5413

Control duplicate rows in joins #5413

Comments

Nico-Rojas commented Jul 15, 2020