-
Notifications
You must be signed in to change notification settings - Fork 134
Fix for delta with retry when multiple error rows with same id
#1310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR replaces the use of Sequence diagram for delta retry chain with diff instead of subtractsequenceDiagram
participant RetryChain
participant DiffChain
participant Result
RetryChain->>DiffChain: diff(on="id", added=true, same=true, modified=false, deleted=false)
DiffChain-->>Result: Returns all error rows, including duplicates with same id
Result-->>RetryChain: Retry chain proceeds with correct error rows
Class diagram for RetryChain and DiffChain interaction updateclassDiagram
class RetryChain {
+diff(chain, on, added, same, modified, deleted)
}
class DiffChain
RetryChain --> DiffChain: uses diff()
%% Previously: RetryChain --> DiffChain: uses subtract()
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tests/unit/lib/test_datachain.py:2234` </location>
<code_context>
assert set(chain4.subtract(chain5, on="d", right_on="a").to_list()) == {(3, "z")}
+def test_subtract_duplicated_rows(test_session):
+ chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
+ chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
+ sub = chain1.subtract(chain2, on="id")
+ assert set(sub.to_list()) == {(1, "1"), (1, "1")}
+
+
</code_context>
<issue_to_address>
Consider adding assertions for edge cases with more complex duplicates.
Adding tests with duplicate rows that differ in 'name', or with duplicates in the right chain, will better validate the subtract logic against varied duplication scenarios.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_subtract_duplicated_rows(test_session):
chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
sub = chain1.subtract(chain2, on="id")
assert set(sub.to_list()) == {(1, "1"), (1, "1")}
=======
def test_subtract_duplicated_rows(test_session):
# Left chain has duplicate rows with same id and name
chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
sub = chain1.subtract(chain2, on="id")
assert set(sub.to_list()) == {(1, "1"), (1, "1")}
def test_subtract_left_duplicates_different_names(test_session):
# Left chain has duplicate ids but different names
chain1 = dc.read_values(id=[1, 1, 2], name=["a", "b", "c"], session=test_session)
chain2 = dc.read_values(id=[2], name=["c"], session=test_session)
sub = chain1.subtract(chain2, on="id")
# Only rows with id=1 should remain, both "a" and "b"
assert set(sub.to_list()) == {(1, "a"), (1, "b")}
def test_subtract_right_duplicates(test_session):
# Right chain has duplicate rows
chain1 = dc.read_values(id=[1, 2, 3], name=["x", "y", "z"], session=test_session)
chain2 = dc.read_values(id=[2, 2], name=["y", "y"], session=test_session)
sub = chain1.subtract(chain2, on="id")
# Only rows with id=1 and id=3 should remain
assert set(sub.to_list()) == {(1, "x"), (3, "z")}
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_subtract_duplicated_rows(test_session): | ||
| chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session) | ||
| chain2 = dc.read_values(id=[2], name=["2"], session=test_session) | ||
| sub = chain1.subtract(chain2, on="id") | ||
| assert set(sub.to_list()) == {(1, "1"), (1, "1")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Consider adding assertions for edge cases with more complex duplicates.
Adding tests with duplicate rows that differ in 'name', or with duplicates in the right chain, will better validate the subtract logic against varied duplication scenarios.
| def test_subtract_duplicated_rows(test_session): | |
| chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session) | |
| chain2 = dc.read_values(id=[2], name=["2"], session=test_session) | |
| sub = chain1.subtract(chain2, on="id") | |
| assert set(sub.to_list()) == {(1, "1"), (1, "1")} | |
| def test_subtract_duplicated_rows(test_session): | |
| # Left chain has duplicate rows with same id and name | |
| chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session) | |
| chain2 = dc.read_values(id=[2], name=["2"], session=test_session) | |
| sub = chain1.subtract(chain2, on="id") | |
| assert set(sub.to_list()) == {(1, "1"), (1, "1")} | |
| def test_subtract_left_duplicates_different_names(test_session): | |
| # Left chain has duplicate ids but different names | |
| chain1 = dc.read_values(id=[1, 1, 2], name=["a", "b", "c"], session=test_session) | |
| chain2 = dc.read_values(id=[2], name=["c"], session=test_session) | |
| sub = chain1.subtract(chain2, on="id") | |
| # Only rows with id=1 should remain, both "a" and "b" | |
| assert set(sub.to_list()) == {(1, "a"), (1, "b")} | |
| def test_subtract_right_duplicates(test_session): | |
| # Right chain has duplicate rows | |
| chain1 = dc.read_values(id=[1, 2, 3], name=["x", "y", "z"], session=test_session) | |
| chain2 = dc.read_values(id=[2, 2], name=["y", "y"], session=test_session) | |
| sub = chain1.subtract(chain2, on="id") | |
| # Only rows with id=1 and id=3 should remain | |
| assert set(sub.to_list()) == {(1, "x"), (3, "z")} |
| .gen(func, output={"id": int, "name": str, "error": str}) | ||
| .save("processed_data") | ||
| ) | ||
| return dc.read_dataset("processed_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Remove unreachable code (remove-unreachable-code)
| return dc.read_dataset("processed_data") |
| @@ -124,7 +124,13 @@ def _get_retry_chain( | |||
| # Subtract also diff chain since some items might be picked | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment need an update
| # result dataset atm) | ||
| return retry_chain.subtract(diff_chain, on=on) if retry_chain else None | ||
| return ( | ||
| retry_chain.diff( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want this semantics? do we really want to retry multiple times per item? it seems wrong in most case tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is just replacing .subtract() with .diff() which should produce the same output (nothing should change). The reason I added this is because for some, still unexplained, reason in CLI (SQLite) subtract here works differently than expected (wrong). Because of this strange behavior it was accidentally removing duplicates in that client's example and I thought that Studio / CH code was the issue and CLI / SQLite was ok but it was other way around.
So to break it down:
- Client issue is not actually an issue / bug. Duplicates are expected because of their code, please look at my explanation to them here
- On the other hand, there is an issue in CLI / SQLite caused by this wrong subtract behavior that I accidentally found when debugging client "issue" and is fixed by using
.diff()instead which works as expected and now CH and SQLite are working the same / consistent. - I couldn't reproduce this wrong
.subtract()SQLite behavior in isolated test (added it in this PR) which means it's probably something related to specific context indeltalogic but I don't think we should waste more time on that (already I waste too much IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client issue is not actually an issue / bug. Duplicates are expected because of their code, please look at my explanation to them here
I think I understand the reason, my question is - do we really want to keep that semantics or would it better to deduplicate records based on on. So, even if the target table has multiple errors witt the same measurement_ids, we take it only once. I think it is more expected.
This will solve the issue from the product perspective and remove this discrepancy anyways. (though it is an interested thing - might hurt us somewhere else down the road)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's interesting question but I think we should not do that as we might disrupt specific client logic with it. Note that those error rows from client issue are not completely the same (they differ in one column) and the first question is which one to discard and which one to leave. If they don't want this kind of behavior, they can easily fix it in their business logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, looking more into this, I think retrying once is the only right option here. We are retrying input items. And it is a single one in this case. Unless im missing something. We are not retrying outputs.
They could indeed change logic, but it might be very inconvenient. It is fine to use gen and produce multiple items per a single row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also doesn't make sense that shape keeps changing on a retry run with a very regular gen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we are not retrying input items, the whole implementation now gets error rows from result, which makes sense as that's where the errors will be. Output is not 1-1 with input, specially if you have gen which can generate totally random number of various rows (including error rows) and input rows are not kept after it. We cannot know from which input we got error at the end in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they didn't generate 2 error rows from one input row, there wouldn't be any issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, sorry, we do merge source records with error rows based on on column and we use source schema ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They might need multiple errors. Moreover it can be a mix (errors and not). It would wrong and complicated to change it to generate a single row in case of any errors
tests/func/test_retry.py
Outdated
| def test_repeating_errors(test_session): | ||
| from collections.abc import Iterator | ||
|
|
||
| def create_input(num_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse existing numerous helpers in this file:
_create_sample_data
_simple_process
_process_with_errors
etc etc ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added _create_sample_data. Others are not useful for this test
|
So, does subtract itself work fine or not? is it expected behavior? I'm not sure I fully understand the description ... |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1310 +/- ##
=======================================
Coverage 88.74% 88.74%
=======================================
Files 155 155
Lines 14148 14149 +1
Branches 1999 1999
=======================================
+ Hits 12556 12557 +1
Misses 1125 1125
Partials 467 467
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Deploying datachain-documentation with
|
| Latest commit: |
e0c0937
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://764521ce.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-12068-duplicated-row.datachain-documentation.pages.dev |
Currently, if there are multiple error rows generated with same
id(name of the column for matching) then, for some unexplained reason, subtract filters out "duplicates" and leaves out only one error row in result. I've also added test for subtract that should've fail as well but for some, yet unknown, reason test passes which means that subtract behaves differently in test and in retry delta code.Instead of
.subtract(),.diff()was used which works as expected. Note that.subtract()was working fine with Clickhouse DB (also not explainable right now).This issue is related to https://github.com/iterative/studio/issues/12068
Summary by Sourcery
Fix delta retry logic to correctly include multiple error rows with the same id by replacing subtract with diff and add tests to validate duplicate handling
Bug Fixes:
Tests: