-
Notifications
You must be signed in to change notification settings - Fork 135
Fix for delta with retry when multiple error rows with same id
#1310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -125,7 +125,15 @@ def _get_retry_chain( | |
| # Subtract also diff chain since some items might be picked | ||
| # up by `delta=True` itself (e.g. records got modified AND are missing in the | ||
| # result dataset atm) | ||
| return retry_chain.subtract(diff_chain, on=on) if retry_chain else None | ||
| on = [on] if isinstance(on, str) else on | ||
|
|
||
| return ( | ||
| retry_chain.diff( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we really want this semantics? do we really want to retry multiple times per item? it seems wrong in most case tbh
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line is just replacing So to break it down:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think I understand the reason, my question is - do we really want to keep that semantics or would it better to deduplicate records based on This will solve the issue from the product perspective and remove this discrepancy anyways. (though it is an interested thing - might hurt us somewhere else down the road)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, it's interesting question but I think we should not do that as we might disrupt specific client logic with it. Note that those error rows from client issue are not completely the same (they differ in one column) and the first question is which one to discard and which one to leave. If they don't want this kind of behavior, they can easily fix it in their business logic.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tbh, looking more into this, I think retrying once is the only right option here. We are retrying input items. And it is a single one in this case. Unless im missing something. We are not retrying outputs. They could indeed change logic, but it might be very inconvenient. It is fine to use gen and produce multiple items per a single row.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also doesn't make sense that shape keeps changing on a retry run with a very regular gen.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we are not retrying input items, the whole implementation now gets error rows from result, which makes sense as that's where the errors will be. Output is not 1-1 with input, specially if you have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If they didn't generate 2 error rows from one input row, there wouldn't be any issue.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aha, sorry, we do merge source records with error rows based on
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They might need multiple errors. Moreover it can be a mix (errors and not). It would wrong and complicated to change it to generate a single row in case of any errors |
||
| diff_chain, on=on, added=True, same=True, modified=False, deleted=False | ||
| ).distinct(*on) | ||
| if retry_chain | ||
| else None | ||
| ) | ||
|
|
||
|
|
||
| def _get_source_info( | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -1,3 +1,4 @@ | ||||
| from collections.abc import Iterator | ||||
| from datetime import datetime, timezone | ||||
| from typing import TYPE_CHECKING | ||||
|
|
||||
|
|
@@ -425,3 +426,42 @@ def test_delta_and_delta_retry_no_duplicates(test_session): | |||
| assert len(ids_in_result) == 4 | ||||
| assert len(set(ids_in_result)) == 4 # No duplicate IDs | ||||
| assert set(ids_in_result) == {1, 2, 3, 4} | ||||
|
|
||||
|
|
||||
| def test_repeating_errors(test_session): | ||||
| def run_delta(): | ||||
| def func(id) -> Iterator[tuple[int, str, str]]: | ||||
| yield id, "name1", "error" | ||||
| yield id, "name2", "error" | ||||
|
|
||||
| return ( | ||||
| dc.read_dataset( | ||||
| "sample_data", | ||||
| delta=True, | ||||
| delta_on="id", | ||||
| delta_result_on="id", | ||||
| delta_retry="error", | ||||
| session=test_session, | ||||
| ) | ||||
| .gen(func, output={"id": int, "name": str, "error": str}) | ||||
| .save("processed_data") | ||||
| ) | ||||
| return dc.read_dataset("processed_data") | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion (code-quality): Remove unreachable code (
Suggested change
|
||||
|
|
||||
| _create_sample_data( | ||||
| test_session, ids=list(range(1)), contents=[str(i) for i in range(1)] | ||||
| ) | ||||
| ch1 = run_delta() | ||||
| assert sorted(ch1.collect("id")) == [0, 0] | ||||
|
|
||||
| _create_sample_data( | ||||
| test_session, ids=list(range(2)), contents=[str(i) for i in range(2)] | ||||
| ) | ||||
| ch2 = run_delta() | ||||
| assert sorted(ch2.collect("id")) == [0, 0, 1, 1] | ||||
|
|
||||
| _create_sample_data( | ||||
| test_session, ids=list(range(3)), contents=[str(i) for i in range(3)] | ||||
| ) | ||||
| ch3 = run_delta() | ||||
| assert sorted(ch3.collect("id")) == [0, 0, 1, 1, 2, 2] | ||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -2230,6 +2230,13 @@ def test_subtract(test_session): | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| assert set(chain4.subtract(chain5, on="d", right_on="a").to_list()) == {(3, "z")} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| def test_subtract_duplicated_rows(test_session): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| chain2 = dc.read_values(id=[2], name=["2"], session=test_session) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| sub = chain1.subtract(chain2, on="id") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| assert set(sub.to_list()) == {(1, "1"), (1, "1")} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+2233
to
+2237
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion (testing): Consider adding assertions for edge cases with more complex duplicates. Adding tests with duplicate rows that differ in 'name', or with duplicates in the right chain, will better validate the subtract logic against varied duplication scenarios.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| def test_subtract_error(test_session): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| chain1 = dc.read_values(a=[1, 1, 2], b=["x", "y", "z"], session=test_session) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| chain2 = dc.read_values(a=[1, 2], b=["x", "y"], session=test_session) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment need an update