fix delta expecting sys columns in apply steps#1412
Conversation
Reviewer's GuideIntroduces a regeneration step for system columns in delta application to avoid using cached steps and adds a functional test to verify that system columns are correctly regenerated when replaying a delta. Sequence diagram for applying _RegenerateSystemColumnsStep in DataChainsequenceDiagram
participant DataChain
participant "_RegenerateSystemColumnsStep"
participant QueryGenerator
participant Catalog
participant Warehouse
DataChain->>"_RegenerateSystemColumnsStep": apply(query_generator, temp_tables)
"_RegenerateSystemColumnsStep"->>QueryGenerator: select()
"_RegenerateSystemColumnsStep"->>Catalog: warehouse._regenerate_system_columns(selectable, keep_existing_columns=True, regenerate_columns=None)
Catalog->>Warehouse: _regenerate_system_columns(...)
"_RegenerateSystemColumnsStep"->>QueryGenerator: step_result(q, regenerated.selected_columns)
"_RegenerateSystemColumnsStep"-->>DataChain: result
Class diagram for the new _RegenerateSystemColumnsStep classclassDiagram
class Step {
}
class _RegenerateSystemColumnsStep {
+Catalog catalog
+hash_inputs() str
+apply(query_generator: QueryGenerator, temp_tables: list[str])
}
_RegenerateSystemColumnsStep --|> Step
_RegenerateSystemColumnsStep o-- "1" Catalog
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Make the regenerate step’s hash_inputs incorporate pipeline‐specific details (e.g. input schema or column names) instead of a static hash to avoid cache collisions across different chains.
- Rather than unconditionally appending the _RegenerateSystemColumnsStep in _append_steps, only inject it when the downstream chain actually needs system columns to prevent redundant regeneration overhead.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Make the regenerate step’s hash_inputs incorporate pipeline‐specific details (e.g. input schema or column names) instead of a static hash to avoid cache collisions across different chains.
- Rather than unconditionally appending the _RegenerateSystemColumnsStep in _append_steps, only inject it when the downstream chain actually needs system columns to prevent redundant regeneration overhead.
## Individual Comments
### Comment 1
<location> `tests/func/test_delta.py:228-237` </location>
<code_context>
+def test_delta_replay_regenerates_system_columns(test_session):
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding assertions to verify the presence and correctness of regenerated system columns.
Add assertions to confirm that regenerated system columns, such as sys__id, exist and contain correct values after replay.
</issue_to_address>
### Comment 2
<location> `tests/func/test_delta.py:241-242` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_delta_replay_regenerates_system_columns(test_session): | ||
| source_name = f"regen_source_{uuid.uuid4().hex[:8]}" | ||
| result_name = f"regen_result_{uuid.uuid4().hex[:8]}" | ||
|
|
||
| dc.read_values( | ||
| measurement_id=[1, 2], | ||
| err=["", ""], | ||
| num=[1, 2], | ||
| session=test_session, | ||
| ).save(source_name) |
There was a problem hiding this comment.
suggestion (testing): Consider adding assertions to verify the presence and correctness of regenerated system columns.
Add assertions to confirm that regenerated system columns, such as sys__id, exist and contain correct values after replay.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1412 +/- ##
=======================================
Coverage 87.73% 87.73%
=======================================
Files 160 160
Lines 15126 15128 +2
Branches 2171 2172 +1
=======================================
+ Hits 13271 13273 +2
Misses 1356 1356
Partials 499 499
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Deploying datachain-documentation with
|
| Latest commit: |
d4af84a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://f4ad9117.datachain-documentation.pages.dev |
| Branch Preview URL: | https://fix-sys-id-delta.datachain-documentation.pages.dev |
94ed794 to
d4af84a
Compare
One more sys_id fix.
We are applying steps in delta directly on top of newly generated step. Generally speaking this is not safe (we are not rebuilding steps properly, we are applying cached versions). Now as union drops columns we are reintroducing them here.
Ideally, we should be generating steps in the first place where sys__ids don't exist.
Summary by Sourcery
Regenerate system columns before applying delta steps to ensure cached steps have required sys columns
New Features:
Bug Fixes:
Tests: