Conversation
Reviewer's GuideThis PR introduces a new Sequence diagram for method invocation with delta_unsafe overridesequenceDiagram
participant User
participant DataChain
participant DeltaGuard
User->>DataChain: Call merge() on delta DataChain
DataChain->>DeltaGuard: Check delta and delta_unsafe
alt delta_unsafe is False
DeltaGuard-->>User: Raise NotImplementedError
else delta_unsafe is True
DeltaGuard-->>DataChain: Allow merge()
end
Class diagram for DataChain with delta_unsafe supportclassDiagram
class DataChain {
- _delta: bool
- _delta_unsafe: bool
+ delta: bool
+ delta_unsafe: bool
+ _as_delta(..., delta_unsafe: bool)
}
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Deploying datachain-documentation with
|
| Latest commit: |
dedd402
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1ab6a9dc.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1276-unsafe-delta.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Hey @ilongin - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tests/func/test_delta.py:179` </location>
<code_context>
+ ])
+ """
+
+ assert (dc.read_dataset(ds_name, version="1.0.0").order_by("file.path")).to_list(
+ "file.path", "value"
+ ) == [
+ ("img1.jpg", 1),
+ ("img2.jpg", 2),
</code_context>
<issue_to_address>
Consider adding a test for error conditions when delta_unsafe is not used.
Please add a test to confirm that calling `merge` in delta mode without `delta_unsafe()` raises the expected exception, ensuring the safety mechanism works as intended.
</issue_to_address>
### Comment 2
<location> `tests/func/test_delta.py:170` </location>
<code_context>
+ # second version of delta dataset
+ create_delta_dataset(ds_name)
+
+ """
+ assert _get_dependencies(catalog, ds_name, "1.0.1") == [
+ (dependency_ds_name, "1.0.1")
</code_context>
<issue_to_address>
Remove or clarify commented-out assertions.
If the commented-out assertions are obsolete, please remove them. If they're still needed, clarify their intent or mark them with a TODO.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
tests/func/test_delta.py
Outdated
| # second version of delta dataset | ||
| create_delta_dataset(ds_name) | ||
|
|
||
| """ |
There was a problem hiding this comment.
nitpick: Remove or clarify commented-out assertions.
If the commented-out assertions are obsolete, please remove them. If they're still needed, clarify their intent or mark them with a TODO.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1279 +/- ##
==========================================
- Coverage 88.75% 88.74% -0.01%
==========================================
Files 155 155
Lines 14143 14146 +3
Branches 1999 1999
==========================================
+ Hits 12552 12554 +2
- Misses 1124 1125 +1
Partials 467 467
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@ilongin I don't think it is a priority atm. |
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Consider adding the delta_unsafe flag to the chain's _setup dictionary so it's persisted when serializing the chain.
- The docstrings for delta_unsafe list some allowed methods but omit 'merge'; update the documentation to include all permitted operations.
- The test_delta_update_unsafe function is lengthy and repeats setup code—refactor it using fixtures or helper functions to improve clarity and reduce duplication.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding the delta_unsafe flag to the chain's _setup dictionary so it's persisted when serializing the chain.
- The docstrings for delta_unsafe list some allowed methods but omit 'merge'; update the documentation to include all permitted operations.
- The test_delta_update_unsafe function is lengthy and repeats setup code—refactor it using fixtures or helper functions to improve clarity and reduce duplication.
## Individual Comments
### Comment 1
<location> `tests/func/test_delta.py:96` </location>
<code_context>
create_delta_dataset(ds_name)
</code_context>
<issue_to_address>
Suggestion to test other unsafe operations beyond merge.
Please add tests for agg, union, group_by, and distinct to verify they work as expected with delta_unsafe enabled.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
create_delta_dataset(ds_name)
=======
create_delta_dataset(ds_name)
def test_delta_agg_unsafe(test_session, tmp_dir, tmp_path):
catalog = test_session.catalog
ds_name = "agg_ds"
create_delta_dataset(ds_name)
ds = catalog.get_dataset(ds_name)
result = ds.agg({"value": "sum"}, delta_unsafe=True)
assert result is not None
def test_delta_union_unsafe(test_session, tmp_dir, tmp_path):
catalog = test_session.catalog
ds_name1 = "union_ds1"
ds_name2 = "union_ds2"
create_delta_dataset(ds_name1)
create_delta_dataset(ds_name2)
ds1 = catalog.get_dataset(ds_name1)
ds2 = catalog.get_dataset(ds_name2)
result = ds1.union(ds2, delta_unsafe=True)
assert result is not None
def test_delta_group_by_unsafe(test_session, tmp_dir, tmp_path):
catalog = test_session.catalog
ds_name = "groupby_ds"
create_delta_dataset(ds_name)
ds = catalog.get_dataset(ds_name)
result = ds.group_by("category").agg({"value": "mean"}, delta_unsafe=True)
assert result is not None
def test_delta_distinct_unsafe(test_session, tmp_dir, tmp_path):
catalog = test_session.catalog
ds_name = "distinct_ds"
create_delta_dataset(ds_name)
ds = catalog.get_dataset(ds_name)
result = ds.distinct(delta_unsafe=True)
assert result is not None
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @@ -96,6 +96,101 @@ def create_delta_dataset(ds_name): | |||
| create_delta_dataset(ds_name) | |||
There was a problem hiding this comment.
suggestion (testing): Suggestion to test other unsafe operations beyond merge.
Please add tests for agg, union, group_by, and distinct to verify they work as expected with delta_unsafe enabled.
| create_delta_dataset(ds_name) | |
| create_delta_dataset(ds_name) | |
| def test_delta_agg_unsafe(test_session, tmp_dir, tmp_path): | |
| catalog = test_session.catalog | |
| ds_name = "agg_ds" | |
| create_delta_dataset(ds_name) | |
| ds = catalog.get_dataset(ds_name) | |
| result = ds.agg({"value": "sum"}, delta_unsafe=True) | |
| assert result is not None | |
| def test_delta_union_unsafe(test_session, tmp_dir, tmp_path): | |
| catalog = test_session.catalog | |
| ds_name1 = "union_ds1" | |
| ds_name2 = "union_ds2" | |
| create_delta_dataset(ds_name1) | |
| create_delta_dataset(ds_name2) | |
| ds1 = catalog.get_dataset(ds_name1) | |
| ds2 = catalog.get_dataset(ds_name2) | |
| result = ds1.union(ds2, delta_unsafe=True) | |
| assert result is not None | |
| def test_delta_group_by_unsafe(test_session, tmp_dir, tmp_path): | |
| catalog = test_session.catalog | |
| ds_name = "groupby_ds" | |
| create_delta_dataset(ds_name) | |
| ds = catalog.get_dataset(ds_name) | |
| result = ds.group_by("category").agg({"value": "mean"}, delta_unsafe=True) | |
| assert result is not None | |
| def test_delta_distinct_unsafe(test_session, tmp_dir, tmp_path): | |
| catalog = test_session.catalog | |
| ds_name = "distinct_ds" | |
| create_delta_dataset(ds_name) | |
| ds = catalog.get_dataset(ds_name) | |
| result = ds.distinct(delta_unsafe=True) | |
| assert result is not None |
There was a problem hiding this comment.
If you've added tests for the other unsafe operations (agg, union, group_by, distinct) with delta_unsafe enabled and verified they pass, then yes, you can resolve this suggestion. If not, it's best to add those tests to ensure full coverage of the new functionality before resolving.
Let me know if you need help with the test implementations or want to discuss specific cases!
There was a problem hiding this comment.
Great change. I know it's not fully done but provided some comments.
Main point: unit tests are missing. We should focus on unit tests over functional tests here. The current tests feel heavier than needed - they check merge dependencies and merge value correctness, while what we really need is to test if unsafe blocks the exception.
It might be a great idea to check the dependencies and merge - please create separate tests for this. It's not related to unsafe logic. Unit tests are strongly preffered.
src/datachain/lib/dc/datasets.py
Outdated
| will only fetch the dataset from Studio if it is not found locally. | ||
| delta_unsafe: If True, allows the use of methods that are normally unsafe | ||
| and forbidden during a delta update. The following methods will be | ||
| permitted: `merge`, `agg`, `union`, `group_by`, and `distinct`. |
There was a problem hiding this comment.
It needs to be shorten - too many words. Or We can put more light on the reasons. Like:
Allow restricted ops in delta: `merge`, `agg`, `union`, `group_by`, `distinct`.
Caller must ensure datasets are consistent and not partially updated.
There was a problem hiding this comment.
We have also a guide for delta updates - I think we can put a separate paragpraph there ## Using delta with merge, agg, etc ... Where we can explain some reasoning behind it
There was a problem hiding this comment.
Yea, that's good idea. I've added new section in docs and simplified this function docs at the same time.
tests/func/test_delta.py
Outdated
| def test_delta_update_unsafe(test_session, tmp_dir, tmp_path): | ||
| catalog = test_session.catalog | ||
| default_namespace_name = catalog.metastore.default_namespace_name | ||
| default_project_name = catalog.metastore.default_project_name |
There was a problem hiding this comment.
Do we need to deal with the namespaces here? Can we just run it in the default namespace?
There was a problem hiding this comment.
I need to check full dataset name and since this tests must run in Studio as well, I cannot just hardcode local as default namespace is different there.
tests/func/test_delta.py
Outdated
| # first version of delta dataset | ||
| create_delta_dataset(ds_name) | ||
| assert sorted(_get_dependencies(catalog, ds_name, "1.0.0")) == sorted( | ||
| [(dependency_ds_name, "1.0.0"), (dependency_ds_merge_name, "1.0.0")] |
There was a problem hiding this comment.
How dependency check is relevant to delta unsafe? can we avoid testing it here?
There was a problem hiding this comment.
Looked deeper - yes, the dependency deserved it's own test. But it should not be combined with the unsafe test. PLease extract or remove (if it's exist already)
There was a problem hiding this comment.
I would keep dependency check in this test since this is func test. I can create unit tests to test more granular details. This needs to be checked with unsafe as we are using things like merge that affects dependencies
…chain into ilongin/1276-unsafe-delta
src/datachain/lib/dc/datasets.py
Outdated
| some version of the dataset exists locally already. If False (default), it | ||
| will only fetch the dataset from Studio if it is not found locally. | ||
| delta_unsafe: Allow restricted ops in delta: merge, agg, union, group_by, | ||
| distinct. Caller must ensure datasets are consistent and not partially |
There was a problem hiding this comment.
can we clarify that we mean results datasets?
There was a problem hiding this comment.
I'm not sure I'm following what you meant by results dataset. Actually, when I read this sentence I'm not sure I completely understand it in general. @dmpetrov can you clarify what did you mean by it?
These operations are not safe from different reasons, e.g
- union / merge -> producing duplicates
- distinct -> must be called on whole dataset which conflicts with delta logic which runs delta part on new + updated rows and then merges results
There was a problem hiding this comment.
When you mention datasets in the message, what datasets are we talking about? It should be clear to end users
There was a problem hiding this comment.
Removed that sentence all together (see #1279 (comment))
| delta_unsafe=True, | ||
| ).merge(merge_ds, on="id", inner=True).save(ds_name) | ||
|
|
||
| assert set(_get_dependencies(catalog, ds_name, "1.0.1")) == { |
There was a problem hiding this comment.
probably we need a proper method for gependencies? or even chain.dataset should be returning a record that can be expanded to dependencies ...
There was a problem hiding this comment.
We already have method to return dependencies in Catalog but this one is just helper for tests to extract dataset name + version out of it as that ones are being asserted.
There was a problem hiding this comment.
Don’t u think users might need the same? Not sure why this can’t be accessed via chain itslef
There was a problem hiding this comment.
We didn't have this kind of request yet. If needed, we can add it to chain (public API) but I would do it in separate issue / PR as it's really not related to this one.
tests/func/test_delta.py
Outdated
| dependency_ds_name = ( | ||
| f"{default_namespace_name}.{default_project_name}.{starting_ds_name}" | ||
| ) | ||
| dependency_ds_merge_name = ( |
There was a problem hiding this comment.
is it only to use (_get_dependencies(catalog, ds_name, "1.0.1") ? (quite unfortunate tbh)
There was a problem hiding this comment.
It's to check if dependencies are ok but still you are right it's not needed. I don't think test of correct namespace / project is in scope of this tests so I will just return pure dataset name from that helper method and assert with that one in tests.
shcheklein
left a comment
There was a problem hiding this comment.
some minor comments, otherwise good to go I think
dmpetrov
left a comment
There was a problem hiding this comment.
Approving to move faster. Nut unit-test issue is still here - too much in functional test without a need and no unit tests.
|
Just to I understand the terminology and be on the same page in the terminology we use: https://github.com/iterative/datachain/pull/1279/files#diff-2a81289be4719523a37fa0369702d59af21a78b3522f155ebbbfb48c4e4ed2e8R382 - is this one unit or not, for example? (not matter what directory it is in). |
yes, these are unit and should be in the unit test dir. the previous one is not. |
yep, I don't call them unit (I call unit tests those that usually test only a very particular function, module in isolation ... usually with mocks, etc, etc), but that's fine ... I'm totally fine to call them whatever and place them in unit or func.
you mean test_delta_update_unsafe or something else? |
yep |
okay, I see my 2cs on this - that one looks heavy indeed (some cleanup can be done?) - but it is a single test for the whole unsafe stuff and it is also using in-memory db AFAIU, no file systems, etc, etc. Some complexity comes from the delta itself - requires multiple versions of source / destinations, etc, etc. Not sure what would be a better alternative here tbh if we want to run delta + merge. |
|
|
||
| dep = dependencies[0] | ||
| if not dep: | ||
| source_ds_dep = next((d for d in dependencies if d.name == source_ds.name), None) |
There was a problem hiding this comment.
was it catched by the test?
There was a problem hiding this comment.
yes, test was inconsistent and there was a bug when things like merge used - we didn't expect more than 1 dependencies before but I guess now it can happen
Added ability to use methods like
merge,unionetc. with delta updates, which are normally disabled.Summary by Sourcery
Enable unsafe datachain operations during delta updates by introducing a delta_unsafe flag to override method restrictions and ensure correct handling in chain evolution and dataset reads.
New Features:
Tests: