fix delta expecting sys columns in apply steps by shcheklein · Pull Request #1412 · datachain-ai/datachain

shcheklein · 2025-10-19T18:20:19Z

One more sys_id fix.

We are applying steps in delta directly on top of newly generated step. Generally speaking this is not safe (we are not rebuilding steps properly, we are applying cached versions). Now as union drops columns we are reintroducing them here.

Ideally, we should be generating steps in the first place where sys__ids don't exist.

Summary by Sourcery

Regenerate system columns before applying delta steps to ensure cached steps have required sys columns

New Features:

Add _RegenerateSystemColumnsStep to re-add system columns via warehouse._regenerate_system_columns
Modify _append_steps to insert the regeneration step before merging delta steps

Bug Fixes:

Fix missing system columns in delta apply steps by regenerating them

Tests:

Add test verifying that delta replay regenerates system columns and preserves measurement_id

sourcery-ai · 2025-10-19T18:20:25Z

Reviewer's Guide

Introduces a regeneration step for system columns in delta application to avoid using cached steps and adds a functional test to verify that system columns are correctly regenerated when replaying a delta.

Sequence diagram for applying _RegenerateSystemColumnsStep in DataChain

sequenceDiagram
    participant DataChain
    participant "_RegenerateSystemColumnsStep"
    participant QueryGenerator
    participant Catalog
    participant Warehouse
    DataChain->>"_RegenerateSystemColumnsStep": apply(query_generator, temp_tables)
    "_RegenerateSystemColumnsStep"->>QueryGenerator: select()
    "_RegenerateSystemColumnsStep"->>Catalog: warehouse._regenerate_system_columns(selectable, keep_existing_columns=True, regenerate_columns=None)
    Catalog->>Warehouse: _regenerate_system_columns(...)
    "_RegenerateSystemColumnsStep"->>QueryGenerator: step_result(q, regenerated.selected_columns)
    "_RegenerateSystemColumnsStep"-->>DataChain: result

Class diagram for the new _RegenerateSystemColumnsStep class

classDiagram
    class Step {
    }
    class _RegenerateSystemColumnsStep {
        +Catalog catalog
        +hash_inputs() str
        +apply(query_generator: QueryGenerator, temp_tables: list[str])
    }
    _RegenerateSystemColumnsStep --|> Step
    _RegenerateSystemColumnsStep o-- "1" Catalog

File-Level Changes

Change	Details	Files
Regenerate system columns in delta apply steps	Imported hashlib, attrs.frozen, Catalog, QueryGenerator to support the new step Defined _RegenerateSystemColumnsStep with hash_inputs and apply methods invoking catalog.warehouse._regenerate_system_columns Injected the regeneration step in _append_steps before appending other steps	`src/datachain/delta.py`
Add functional test for system column regeneration on delta replay	Introduced test_delta_replay_regenerates_system_columns to test delta replay behavior Built chains with and without delta, replayed delta, and asserted measurement_id values	`tests/func/test_delta.py`

Possibly linked issues

remove docstring from DataModel.__pydantic__init_subclass__ #123: The PR adds a step to regenerate sys__id columns, resolving the 'no such column: sys__id' error in the issue.
#None: The PR fixes the KeyError: 'sys__id' by regenerating system columns during delta operations, directly addressing the issue's reported error.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Make the regenerate step’s hash_inputs incorporate pipeline‐specific details (e.g. input schema or column names) instead of a static hash to avoid cache collisions across different chains.
Rather than unconditionally appending the _RegenerateSystemColumnsStep in _append_steps, only inject it when the downstream chain actually needs system columns to prevent redundant regeneration overhead.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Make the regenerate step’s hash_inputs incorporate pipeline‐specific details (e.g. input schema or column names) instead of a static hash to avoid cache collisions across different chains.
- Rather than unconditionally appending the _RegenerateSystemColumnsStep in _append_steps, only inject it when the downstream chain actually needs system columns to prevent redundant regeneration overhead.

## Individual Comments

### Comment 1
<location> `tests/func/test_delta.py:228-237` </location>
<code_context>
+def test_delta_replay_regenerates_system_columns(test_session):
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding assertions to verify the presence and correctness of regenerated system columns.

Add assertions to confirm that regenerated system columns, such as sys__id, exist and contain correct values after replay.
</issue_to_address>

### Comment 2
<location> `tests/func/test_delta.py:241-242` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-10-19T18:21:01Z

tests/func/test_delta.py

+def test_delta_replay_regenerates_system_columns(test_session):
+    source_name = f"regen_source_{uuid.uuid4().hex[:8]}"
+    result_name = f"regen_result_{uuid.uuid4().hex[:8]}"
+
+    dc.read_values(
+        measurement_id=[1, 2],
+        err=["", ""],
+        num=[1, 2],
+        session=test_session,
+    ).save(source_name)


suggestion (testing): Consider adding assertions to verify the presence and correctness of regenerated system columns.

Add assertions to confirm that regenerated system columns, such as sys__id, exist and contain correct values after replay.

codecov · 2025-10-19T18:27:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.73%. Comparing base (f5320b1) to head (94ed794).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1412   +/-   ##
=======================================
  Coverage   87.73%   87.73%           
=======================================
  Files         160      160           
  Lines       15126    15128    +2     
  Branches     2171     2172    +1     
=======================================
+ Hits        13271    13273    +2     
  Misses       1356     1356           
  Partials      499      499

Flag	Coverage Δ
datachain	`87.69% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/lib/dc/datasets.py	`95.23% <100.00%> (+0.11%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-10-19T18:39:05Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`d4af84a`
Status:	✅ Deploy successful!
Preview URL:	https://f4ad9117.datachain-documentation.pages.dev
Branch Preview URL:	https://fix-sys-id-delta.datachain-documentation.pages.dev

View logs

dmpetrov

LG - haven't look deep

fix delta expecting sys columns in apply steps

d4af84a

sourcery-ai bot reviewed Oct 19, 2025

View reviewed changes

shcheklein force-pushed the fix-sys-id-delta branch from 94ed794 to d4af84a Compare October 19, 2025 18:52

shcheklein requested review from a team and 0x2b3bfa0 October 19, 2025 19:02

dmpetrov approved these changes Oct 19, 2025

View reviewed changes

shcheklein merged commit a25e715 into main Oct 19, 2025
62 of 68 checks passed

shcheklein deleted the fix-sys-id-delta branch October 19, 2025 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix delta expecting sys columns in apply steps#1412

fix delta expecting sys columns in apply steps#1412
shcheklein merged 1 commit intomainfrom
fix-sys-id-delta

shcheklein commented Oct 19, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Oct 19, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Oct 19, 2025

Uh oh!

codecov bot commented Oct 19, 2025 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Oct 19, 2025 •

edited

Loading

Uh oh!

dmpetrov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shcheklein commented Oct 19, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for applying _RegenerateSystemColumnsStep in DataChain

Class diagram for the new _RegenerateSystemColumnsStep class

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shcheklein commented Oct 19, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 19, 2025 •

edited

Loading

codecov bot commented Oct 19, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 19, 2025 •

edited

Loading