You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing I often do in production code after a merge is validating my assumptions on unique keys and relationship between them in both merge inputs (i.e. cardinality checks). For example, if I have duplicated keys in one data frame and perform an inner join, I end up with more rows that both input tables had, which might come as a surprise and can cause downstream problems if I don't manually assert this.
In a script, checking these conditions and raising a (helpful) error can take a few lines of code quickly, let alone the verbosity if you have multiple sequential joins. For that reason, I propose to offer the user the option to opt in for cardinality validation as part of the join, very much like Python's pandas implemented this:
I would be thrilled to see this in dplyr, this kind of check would have saved me an incredible amount of time on numerous projects. It would also be nice if there were an option() allowing users to set custom default values for this, so that in my .Rprofile I could opt-in to one-to-one checking by default.
One thing I often do in production code after a merge is validating my assumptions on unique keys and relationship between them in both merge inputs (i.e. cardinality checks). For example, if I have duplicated keys in one data frame and perform an inner join, I end up with more rows that both input tables had, which might come as a surprise and can cause downstream problems if I don't manually assert this.
In a script, checking these conditions and raising a (helpful) error can take a few lines of code quickly, let alone the verbosity if you have multiple sequential joins. For that reason, I propose to offer the user the option to opt in for cardinality validation as part of the join, very much like Python's pandas implemented this:
In {dplyr}, it could be
Maybe we could the official abbreviations (not sure they even exist, maybe case insensitive), e.g.
1:n
,m:n
etc.If this is out of scope for {dplyr}, maybe it's something for @krlmlr in {dm}?
The text was updated successfully, but these errors were encountered: