Skip to content

Conversation

delsner
Copy link
Member

@delsner delsner commented Jun 25, 2025

Motivation

filter_relationship_one_to_one does only check for a 1:1 relationship in case both data frames are unique w.r.t. the provided join key. Also, the docstrings are incorrect as the join columns cannot be inferred.

Changes

  • Add keep_only_unique to allow filtering for 1:1 relationships in case join columns do not uniquely identify rows
  • Fix docstrings

@delsner delsner self-assigned this Jun 25, 2025
@github-actions github-actions bot added the fix label Jun 25, 2025
@delsner
Copy link
Member Author

delsner commented Jun 25, 2025

@borchero @AndreasAlbertQC wdyt of this fix? One could argue that these functions are only to be used inside dy.filter where you would typically pass the common primary key as join key, but I still find it unintuitive that we don't actually filter for a 1:1 relationship.

Happy to adjust the API if you have a better idea.

Copy link

codecov bot commented Jun 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (491de2e) to head (5c6b97b).
Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #67   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           41        41           
  Lines         2207      2210    +3     
=========================================
+ Hits          2207      2210    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@delsner delsner requested review from borchero and robinholzi June 25, 2025 17:03
@delsner delsner marked this pull request as ready for review July 1, 2025 19:53
@delsner delsner requested a review from AndreasAlbertQC as a code owner July 1, 2025 19:53
@AndreasAlbertQC AndreasAlbertQC changed the title fix: Add drop_non_unique parameter to filter_relationship_one_to_one fix: Add keep_only_unique parameter to filter_relationship_one_to_one Jul 2, 2025
@AndreasAlbertQC AndreasAlbertQC changed the title fix: Add keep_only_unique parameter to filter_relationship_one_to_one fix: Add keep_only_unique parameter to filter_relationship_one_to_one Jul 2, 2025
@AndreasAlbertQC
Copy link
Collaborator

@delsner I think your change is good, but I wonder if we need to do something else to mitigate confusing results if the dataframes are not unique on the join keys already. If I don't set the new flag, I will get nonsensical results, right?
Possible solutions:

  1. check for uniqueness and if its violated, do one of the following:
    1.1 hard error
    1.2 warning
    1.3 automatically activate the new flag plus warning
  2. Don't check anything, and document that you will get nonsensical results if you don't ensure that the keys are unique already, so the user has to take care of it.

I am hesitating between 1.3 and 2.

@delsner
Copy link
Member Author

delsner commented Jul 2, 2025

If I don't set the new flag, I will get nonsensical results, right?

Potentially, yes.

I am hesitating between 1.3 and 2.

2 is easy, I can just update the docstring.
IMO 1.3 is the right way to do it, but this is a breaking change, albeit a bug-fixing one. I think hardly anyone is using this feature since it's undocumented, so I'd also be fine to go with 1.3. Wdyt?

Any thoughts @borchero?

@AndreasAlbertQC
Copy link
Collaborator

@delsner I'd be happy with counting option 1.3 as a bug fix bc I think the current behavior generates incorrect output data

@borchero
Copy link
Member

borchero commented Jul 2, 2025

I made the current choice deliberately as additional uniqueness validations are really expensive. Hence, my initial thought would be to go for (2).

Nevertheless, it potentially makes sense to make this behavior opt-in, i.e. require the user to set some kind of flag to skip the additional validation step.

However, I'm not sure I love the extension of the filter_relationship_one_to_one method; we're somehow evaluating a "many-to-many" relationship here, don't we? 🤔

@borchero borchero added the v2 label Jul 15, 2025
@borchero borchero added this to the v2 milestone Jul 15, 2025
@borchero borchero removed the v2 label Jul 15, 2025
@delsner
Copy link
Member Author

delsner commented Jul 15, 2025

@borchero and I decided to rename these functions to require_* and also will update the one_to_at_least_one function. We'll also rename the parameter to is_unique and change its default to False.

@delsner delsner marked this pull request as draft July 15, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants