Fix for delta with retry when multiple error rows with same `id` #1310

ilongin · 2025-08-27T14:57:15Z

Currently, if there are multiple error rows generated with same id (name of the column for matching) then, for some unexplained reason, subtract filters out "duplicates" and leaves out only one error row in result. I've also added test for subtract that should've fail as well but for some, yet unknown, reason test passes which means that subtract behaves differently in test and in retry delta code.
Instead of .subtract(), .diff() was used which works as expected. Note that .subtract() was working fine with Clickhouse DB (also not explainable right now).

This issue is related to https://github.com/iterative/studio/issues/12068

Summary by Sourcery

Fix delta retry logic to correctly include multiple error rows with the same id by replacing subtract with diff and add tests to validate duplicate handling

Bug Fixes:

Preserve repeated error rows in delta retry by using diff instead of subtract

Tests:

Add functional test for repeating errors in delta retry to verify multiple rows per id
Add unit test to ensure subtract retains duplicated rows by id

sourcery-ai · 2025-08-27T14:57:24Z

Reviewer's Guide

This PR replaces the use of .subtract() with .diff() in the delta retry chain to correctly handle multiple error rows sharing the same id, and it adds both functional and unit tests to ensure duplicate rows are preserved.

Sequence diagram for delta retry chain with diff instead of subtract

sequenceDiagram
    participant RetryChain
    participant DiffChain
    participant Result
    RetryChain->>DiffChain: diff(on="id", added=true, same=true, modified=false, deleted=false)
    DiffChain-->>Result: Returns all error rows, including duplicates with same id
    Result-->>RetryChain: Retry chain proceeds with correct error rows

Class diagram for RetryChain and DiffChain interaction update

classDiagram
    class RetryChain {
        +diff(chain, on, added, same, modified, deleted)
    }
    class DiffChain
    RetryChain --> DiffChain: uses diff()
    %% Previously: RetryChain --> DiffChain: uses subtract()

File-Level Changes

Change	Details	Files
Switch retry logic from subtract to diff to preserve duplicate error rows	Replaced retry_chain.subtract(...) with retry_chain.diff(...) Configured diff to include added and same rows only (added=True, same=True)	`src/datachain/delta.py`
Add functional test covering repeating error rows in delta retry	Introduced test_repeating_errors with helper functions create_input and run_delta Asserted correct counts for duplicate ids across multiple runs	`tests/func/test_retry.py`
Add unit test for subtract method with duplicated rows	Added test_subtract_duplicated_rows to validate subtract preserves duplicates Asserted subtract output contains both identical rows	`tests/unit/lib/test_datachain.py`

Possibly linked issues

remove docstring from DataModel.__pydantic__init_subclass__ #123: The PR replaces subtract with diff in delta_retry to correctly handle duplicate error rows, preventing duplicates as described in the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `tests/unit/lib/test_datachain.py:2234` </location>
<code_context>
     assert set(chain4.subtract(chain5, on="d", right_on="a").to_list()) == {(3, "z")}


+def test_subtract_duplicated_rows(test_session):
+    chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
+    chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
+    sub = chain1.subtract(chain2, on="id")
+    assert set(sub.to_list()) == {(1, "1"), (1, "1")}
+
+
</code_context>

<issue_to_address>
Consider adding assertions for edge cases with more complex duplicates.

Adding tests with duplicate rows that differ in 'name', or with duplicates in the right chain, will better validate the subtract logic against varied duplication scenarios.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_subtract_duplicated_rows(test_session):
    chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
    chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
    sub = chain1.subtract(chain2, on="id")
    assert set(sub.to_list()) == {(1, "1"), (1, "1")}
=======
def test_subtract_duplicated_rows(test_session):
    # Left chain has duplicate rows with same id and name
    chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
    chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
    sub = chain1.subtract(chain2, on="id")
    assert set(sub.to_list()) == {(1, "1"), (1, "1")}

def test_subtract_left_duplicates_different_names(test_session):
    # Left chain has duplicate ids but different names
    chain1 = dc.read_values(id=[1, 1, 2], name=["a", "b", "c"], session=test_session)
    chain2 = dc.read_values(id=[2], name=["c"], session=test_session)
    sub = chain1.subtract(chain2, on="id")
    # Only rows with id=1 should remain, both "a" and "b"
    assert set(sub.to_list()) == {(1, "a"), (1, "b")}

def test_subtract_right_duplicates(test_session):
    # Right chain has duplicate rows
    chain1 = dc.read_values(id=[1, 2, 3], name=["x", "y", "z"], session=test_session)
    chain2 = dc.read_values(id=[2, 2], name=["y", "y"], session=test_session)
    sub = chain1.subtract(chain2, on="id")
    # Only rows with id=1 and id=3 should remain
    assert set(sub.to_list()) == {(1, "x"), (3, "z")}
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-27T14:58:09Z

tests/unit/lib/test_datachain.py

+def test_subtract_duplicated_rows(test_session):
+    chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)
+    chain2 = dc.read_values(id=[2], name=["2"], session=test_session)
+    sub = chain1.subtract(chain2, on="id")
+    assert set(sub.to_list()) == {(1, "1"), (1, "1")}


suggestion (testing): Consider adding assertions for edge cases with more complex duplicates.

Adding tests with duplicate rows that differ in 'name', or with duplicates in the right chain, will better validate the subtract logic against varied duplication scenarios.

Suggested change

def test_subtract_duplicated_rows(test_session):

chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)

chain2 = dc.read_values(id=[2], name=["2"], session=test_session)

sub = chain1.subtract(chain2, on="id")

assert set(sub.to_list()) == {(1, "1"), (1, "1")}

def test_subtract_duplicated_rows(test_session):

# Left chain has duplicate rows with same id and name

chain1 = dc.read_values(id=[1, 1], name=["1", "1"], session=test_session)

chain2 = dc.read_values(id=[2], name=["2"], session=test_session)

sub = chain1.subtract(chain2, on="id")

assert set(sub.to_list()) == {(1, "1"), (1, "1")}

def test_subtract_left_duplicates_different_names(test_session):

# Left chain has duplicate ids but different names

chain1 = dc.read_values(id=[1, 1, 2], name=["a", "b", "c"], session=test_session)

chain2 = dc.read_values(id=[2], name=["c"], session=test_session)

sub = chain1.subtract(chain2, on="id")

# Only rows with id=1 should remain, both "a" and "b"

assert set(sub.to_list()) == {(1, "a"), (1, "b")}

def test_subtract_right_duplicates(test_session):

# Right chain has duplicate rows

chain1 = dc.read_values(id=[1, 2, 3], name=["x", "y", "z"], session=test_session)

chain2 = dc.read_values(id=[2, 2], name=["y", "y"], session=test_session)

sub = chain1.subtract(chain2, on="id")

# Only rows with id=1 and id=3 should remain

assert set(sub.to_list()) == {(1, "x"), (3, "z")}

sourcery-ai · 2025-08-27T14:58:09Z

tests/func/test_retry.py

+            .gen(func, output={"id": int, "name": str, "error": str})
+            .save("processed_data")
+        )
+        return dc.read_dataset("processed_data")


suggestion (code-quality): Remove unreachable code (remove-unreachable-code)

Suggested change

return dc.read_dataset("processed_data")

shcheklein · 2025-08-27T15:00:45Z

src/datachain/delta.py

@@ -124,7 +124,13 @@ def _get_retry_chain(
    # Subtract also diff chain since some items might be picked


comment need an update

shcheklein · 2025-08-27T15:01:34Z

src/datachain/delta.py

    # result dataset atm)
-    return retry_chain.subtract(diff_chain, on=on) if retry_chain else None
+    return (
+        retry_chain.diff(


do we really want this semantics? do we really want to retry multiple times per item? it seems wrong in most case tbh

This line is just replacing .subtract() with .diff() which should produce the same output (nothing should change). The reason I added this is because for some, still unexplained, reason in CLI (SQLite) subtract here works differently than expected (wrong). Because of this strange behavior it was accidentally removing duplicates in that client's example and I thought that Studio / CH code was the issue and CLI / SQLite was ok but it was other way around.

So to break it down:

Client issue is not actually an issue / bug. Duplicates are expected because of their code, please look at my explanation to them here

On the other hand, there is an issue in CLI / SQLite caused by this wrong subtract behavior that I accidentally found when debugging client "issue" and is fixed by using .diff() instead which works as expected and now CH and SQLite are working the same / consistent.

I couldn't reproduce this wrong .subtract() SQLite behavior in isolated test (added it in this PR) which means it's probably something related to specific context in delta logic but I don't think we should waste more time on that (already I waste too much IMO).

Client issue is not actually an issue / bug. Duplicates are expected because of their code, please look at my explanation to them here

I think I understand the reason, my question is - do we really want to keep that semantics or would it better to deduplicate records based on on. So, even if the target table has multiple errors witt the same measurement_ids, we take it only once. I think it is more expected.

This will solve the issue from the product perspective and remove this discrepancy anyways. (though it is an interested thing - might hurt us somewhere else down the road)

Yea, it's interesting question but I think we should not do that as we might disrupt specific client logic with it. Note that those error rows from client issue are not completely the same (they differ in one column) and the first question is which one to discard and which one to leave. If they don't want this kind of behavior, they can easily fix it in their business logic.

Tbh, looking more into this, I think retrying once is the only right option here. We are retrying input items. And it is a single one in this case. Unless im missing something. We are not retrying outputs.

They could indeed change logic, but it might be very inconvenient. It is fine to use gen and produce multiple items per a single row.

It also doesn't make sense that shape keeps changing on a retry run with a very regular gen.

But we are not retrying input items, the whole implementation now gets error rows from result, which makes sense as that's where the errors will be. Output is not 1-1 with input, specially if you have gen which can generate totally random number of various rows (including error rows) and input rows are not kept after it. We cannot know from which input we got error at the end in that case.

If they didn't generate 2 error rows from one input row, there wouldn't be any issue.

Aha, sorry, we do merge source records with error rows based on on column and we use source schema ..

They might need multiple errors. Moreover it can be a mix (errors and not). It would wrong and complicated to change it to generate a single row in case of any errors

shcheklein · 2025-08-27T15:03:04Z

tests/func/test_retry.py

+def test_repeating_errors(test_session):
+    from collections.abc import Iterator
+
+    def create_input(num_values):


can we reuse existing numerous helpers in this file:

_create_sample_data
_simple_process
_process_with_errors

etc etc ....

Added _create_sample_data. Others are not useful for this test

shcheklein · 2025-08-27T15:04:29Z

So, does subtract itself work fine or not? is it expected behavior? I'm not sure I fully understand the description ...

codecov · 2025-08-27T15:05:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.74%. Comparing base (80b5787) to head (e0c0937).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1310   +/-   ##
=======================================
  Coverage   88.74%   88.74%           
=======================================
  Files         155      155           
  Lines       14148    14149    +1     
  Branches     1999     1999           
=======================================
+ Hits        12556    12557    +1     
  Misses       1125     1125           
  Partials      467      467

Flag	Coverage Δ
datachain	`88.68% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/delta.py	`92.59% <100.00%> (+0.09%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-08-28T09:35:10Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`e0c0937`
Status:	✅ Deploy successful!
Preview URL:	https://764521ce.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-12068-duplicated-row.datachain-documentation.pages.dev

View logs

added fix for delta retry

5fc21f7

sourcery-ai bot reviewed Aug 27, 2025

View reviewed changes

shcheklein reviewed Aug 27, 2025

View reviewed changes

fixing test

40b3a9d

ilongin requested a review from shcheklein August 28, 2025 09:41

ilongin added 2 commits August 29, 2025 23:23

Merge branch 'main' into ilongin/12068-duplicated-rows-delta-retry

21d5399

distinct duplicated error rows

e0c0937

shcheklein approved these changes Aug 29, 2025

View reviewed changes

ilongin merged commit 3607c0b into main Aug 29, 2025
37 of 38 checks passed

ilongin deleted the ilongin/12068-duplicated-rows-delta-retry branch August 29, 2025 22:01

		@@ -124,7 +124,13 @@ def _get_retry_chain(
		# Subtract also diff chain since some items might be picked

Fix for delta with retry when multiple error rows with same id #1310

Fix for delta with retry when multiple error rows with same id #1310

Uh oh!

Conversation

ilongin commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for delta retry chain with diff instead of subtract

Class diagram for RetryChain and DiffChain interaction update

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein commented Aug 27, 2025

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix for delta with retry when multiple error rows with same `id` #1310

Fix for delta with retry when multiple error rows with same `id` #1310

ilongin commented Aug 27, 2025 •

edited

Loading

sourcery-ai bot commented Aug 27, 2025 •

edited

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 28, 2025 •

edited

Loading