You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be incredible if you could specify which kind of duplicates in the join variable(s) is(are) acceptable as in the STATA merge command.
This means having the possibility to decide whether it's ok to have:
Duplicates in the join variables of the master data but not in the join variables of the using data (a m:1 join)
Duplicates in the join variables of the using data but not in the join variables of the master data (a 1:m join)
Duplicates in both cases (an m:m join)
No duplicates (a 1:1 join)
I work in education, and some times data-sets were poorly digitized because some students appear to have been enrolled in more than one school (duplicates in the join variable). When that happens both in the master and using data, those students multiply exponentially in the resulting data-frame, as it happens now with, for example, left_join. You may not want that. Also, if you are not cautious enough, you may not pay attention to duplicate students, that "ninjaly" add more observations to your results.
STATA used to have the same issue and only had two commands "merge" or "join" somehow similar to R, and did not let you decide when duplicates are acceptable (i.e duplicate information in the using data when merging students to school characteristics, school id is the join variable) and when it may not be acceptable (i.e duplicate information in the master and using data when merging a student list to a student characteristic, student id is the join variable). At some point, they made that feature feasible and it's great because it allows you to merge and make quality control in one single step.
Thank you for your time and amazing work. Hope this helps.
The text was updated successfully, but these errors were encountered:
It would be incredible if you could specify which kind of duplicates in the join variable(s) is(are) acceptable as in the STATA merge command.
This means having the possibility to decide whether it's ok to have:
I work in education, and some times data-sets were poorly digitized because some students appear to have been enrolled in more than one school (duplicates in the join variable). When that happens both in the master and using data, those students multiply exponentially in the resulting data-frame, as it happens now with, for example, left_join. You may not want that. Also, if you are not cautious enough, you may not pay attention to duplicate students, that "ninjaly" add more observations to your results.
STATA used to have the same issue and only had two commands "merge" or "join" somehow similar to R, and did not let you decide when duplicates are acceptable (i.e duplicate information in the using data when merging students to school characteristics, school id is the join variable) and when it may not be acceptable (i.e duplicate information in the master and using data when merging a student list to a student characteristic, student id is the join variable). At some point, they made that feature feasible and it's great because it allows you to merge and make quality control in one single step.
Thank you for your time and amazing work. Hope this helps.
The text was updated successfully, but these errors were encountered: