Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to validate the cardinality in joins #5914

Closed
lorenzwalthert opened this issue Jun 10, 2021 · 2 comments · Fixed by #5910
Closed

Allow users to validate the cardinality in joins #5914

lorenzwalthert opened this issue Jun 10, 2021 · 2 comments · Fixed by #5910
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@lorenzwalthert
Copy link
Contributor

lorenzwalthert commented Jun 10, 2021

One thing I often do in production code after a merge is validating my assumptions on unique keys and relationship between them in both merge inputs (i.e. cardinality checks). For example, if I have duplicated keys in one data frame and perform an inner join, I end up with more rows that both input tables had, which might come as a surprise and can cause downstream problems if I don't manually assert this.

In a script, checking these conditions and raising a (helpful) error can take a few lines of code quickly, let alone the verbosity if you have multiple sequential joins. For that reason, I propose to offer the user the option to opt in for cardinality validation as part of the join, very much like Python's pandas implemented this:

import pandas
pandas.merge(..., validate = 'one_to_one')

In {dplyr}, it could be

dplyr::inner_join(..., validate = "1:N")

Maybe we could the official abbreviations (not sure they even exist, maybe case insensitive), e.g. 1:n, m:n etc.

If this is out of scope for {dplyr}, maybe it's something for @krlmlr in {dm}?

@lorenzwalthert lorenzwalthert changed the title Validate cardinality for join functions Allow users to validate the cardinality in joins Jun 10, 2021
@hadley hadley added feature a feature request or enhancement tables 🧮 joins and set operations labels Sep 16, 2021
@alexpghayes
Copy link

I would be thrilled to see this in dplyr, this kind of check would have saved me an incredible amount of time on numerous projects. It would also be nice if there were an option() allowing users to set custom default values for this, so that in my .Rprofile I could opt-in to one-to-one checking by default.

@lorenzwalthert
Copy link
Contributor Author

lorenzwalthert commented Jan 13, 2022

Also see {powerjoin} for how other people implemented that functionality as a {dplyr} extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants