Incremental (delta) update by ilongin · Pull Request #928 · datachain-ai/datachain

ilongin · 2025-02-19T15:12:55Z

Adding the ability to do incremental, or delta updates with dataset. The way user will run delta updates is to just re-run the whole script where it creates dataset, with one small modification - adding delta=True on DataChain.save() method.

General idea behind delta update is to not re-build the whole dataset from source (or sources ) once the source has some changes (new or modified files are considered changed), but to run the whole chain in "diff" mode and union that with latest version of resulting dataset. Running chain in "diff" mode means that we run the whole chain as is, but every starting step (.from_storage() or .from_dataset()) returns diff between latest version and the one with which last time chain was run and created.

Facts:

Starting points mentioned above become direct dataset dependencies.
Dataset can have multiple direct dependencies only if .union() or .merge() was used in the chain.
Indirect dependencies are the ones 2 or more steps away from our dataset (see picture below). They can happen if one of the direct dependencies is actual dataset (not bucket listing dataset) as that dataset can have it's own direct (and indirect) dependencies etc.

!! NOTE that in this PR we implemented only subset of above - situation where we have only one direct dataset dependency which means .union() won't work as expected - it will create duplicates and .merge() might work as expected if some circumstances are met. There will be follow-up PRs for the rest.

Example chain:

# query.py
(
    DataChain.from_storage( "gs://datachain-demo/50k-laion-files/")
    .save("laion", delta=True)
)

(
    DataChain.from_storage( "s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
    .save("dogs-cats", delta=True)
)

(
    DataChain.from_dataset("dogs-cats")
    .union(DataChain.from_dataset("laion"))
    .save("dogs-cats-laion", delta=True)
)

In above example we have 2 "starting points" when creating dogs-cats-laion dataset:

DataChain.from_dataset("dogs-cats")
DataChain.from_dataset("laion")

In first run maybe we created chain with dogs-cats dataset at version v3 and on next chain run dataset dogs-cats is in version v5 so DataChain.from_dataset("dogs-cats") will return diff(dogs-cats@v5, dogs-cats@v3). The same goes with all other sources / dependencies which can be datasets or direct listings (also datasets in the back)

Note that from the standpoint of delta update of chain, only direct dataset dependencies are visible - indirect dependencies are not taken into consideration as they are created in another, non related chain (that could be in another query file as well).
For example, this is a dependency graph of above example:

In above, final dogs-cats-laion dataset has 2 direct dataset dependencies. Those later on have listing datasets as direct dependencies, but from the point of view of this chain it's irrelevant.
If user want's to update dogs-cats-laion dataset taking into consideration changes in s3 and gcs buckets, which are not it's direct dependencies, then it must run chains for those dogs-cats and laion datasets before (he can put those chains to utilize delta update as well). Easiest is to put those in the same query file and just re-run the whole thing like in above example.

Comparison between current approach and final / ideal one:

Comparison	1. Current approach	3. Apply diff on every starting step
Description	From dataset dependencies of resulting dataset we get starting dataset name and version (listing or any other "normal" dataset) and calculate diff between that and last version of starting dataset. We then apply chain functions on that diff and merge results with current version of resulting dataset to create new version of it	Similar to current approach but doing diff for every starting point of chain / direct dependency. Idea is to run whole chain in special "delta" mode which would ensure that on each starting step (reading from listing or normal dataset) we don't read whole data but calculate diff and run on that. We would then union that "delta" chain with current last version of dataset. The problem is that this is not currently possible to implement because of structure of our codebase - diff is implemented in `DataChain` or "upper" level and it would need to be used in `DatasetQuery` or "lower" level ... we need to refactor / remove `DatasetQuery` first.
Someone removes starting dataset and all of it's versions (e.g listing dataset)	We wouldn't have exact listing version for delta -> we would need to re-create the whole dataset as it was ran the first time (without delta performance gains)	We wouldn't have exact listing version for delta -> we would need to re-create the whole dataset as it was ran the first time (without delta performance gains)
File object is removed in resulting dataset (e.g using one of the chain functions)	Delta update doesn't work	Delta update doesn't work
How it will look in UI	The same as in local (just adding `delta=True` flag in `.save()`)	The same as in local (just adding `delta=True` flag in `.save()`).
Not supported `DataChain` methods	`union()`	/
`agg(...)`	This function works fine when partition values are impossible to be found in both delta and old dataset set (all partition values are localized in it's own set). This is important because of how delta update works -> we run the chain for new + changed rows in source and then union with current dataset version to create new version. Example that works fine is in examples/computer_vision/openimage-detect.py	Same as first column
`group_by(...)`	Similar as with `.agg()`, it works only if identical groups are not found in both sets (delta part and old dataset)	Same as first column
`distinct(...)`	Similar as with `.agg()`, it works only if distinct group values are not found in both sets (delta part and old dataset)	Same as first column
`union()`	This will always produce duplicates as we are always doing union between diff and other full non diff dataset.	Union works as expected as all parts in union will be just diffs.
`merge()`	Merge works only if `inner=True` is used and similar with `agg()` if rows to be merged are not found in both sets (delta / diff and old dataset) but are isolated in only one of those	Same as first column except it should work for `inner=False` as well

Q&A

Q: In the second approach the idea was to use delta on the source side (per each source). This would allow to have a single file source for example. Both approaches don't allow that atm AFAIU. Are there cases like this?
A: In both approaches when we speak about "source" we actually mean on source or starting dataset from which resulting dataset is created. It can be listing dataset if .from_storage() is used or just a regular dataset if .from_dataset() is used for example. Note that there can be multiple starting datasets in chain as in above example with .union(). In the question, if I understood correctly, suggestion is to get distinct values from source column and then do diff for each of those and apply chain to each diff and then union with current dataset. This approach is similar to the ideal solution described in description and table which should be done in future. We need to look at starting datasets or direct dataset dependencies and not distinct source values as some of those sources could have come from indirect dependency (another dataset down the tree which is created in another, not related chain)

Regarding the one file, yes that one is tricky. With current implementation it doesn't break but every time dataset is calculated from start (delta doesn't work as there is no dependency in created dataset as there is no listing created since we just extract data from file and create dataset rows) .. example:

    DF_DATA = {
        "first_name": ["Alice", "Bob", "Charlie", "David", "Eva"],
        "age": [25, 30, 35, 40, 45],
        "city": ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"],
    }
    pd.DataFrame(DF_DATA).to_parquet(Path(os.path.join(os.path.abspath(os.getcwd()), "test.parquet")))
    DataChain.from_storage(path.as_uri()).parse_tabular().save("tabular", delta=True)

cloudflare-workers-and-pages · 2025-02-19T15:13:58Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`ee27458`
Status:	✅ Deploy successful!
Preview URL:	https://549e375b.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-798-incremental-upda.datachain-documentation.pages.dev

View logs

dmpetrov · 2025-02-19T19:36:12Z

@ilongin it would be great to extract all logic outside of the fat file dc.py to increment.py or dc_incremental.py

Also, should we call it incremental or delta? :) Delta seems better but I don't like it do to a conflict with Delta Lake. Any ideas? :)

cloudflare-workers-and-pages · 2025-02-21T02:21:49Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`8fa1534`
Status:	✅ Deploy successful!
Preview URL:	https://c897b9bc.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-798-incremental-upda.datachain-documentation.pages.dev

View logs

ilongin · 2025-02-21T12:18:48Z

@ilongin it would be great to extract all logic outside of the fat file dc.py to increment.py or dc_incremental.py

Also, should we call it incremental or delta? :) Delta seems better but I don't like it do to a conflict with Delta Lake. Any ideas? :)

@dmpetrov one question just to be 100% sure. How do we deal with different statuses : added, modified, removed, same?

My assumption is to:

Added records are appended to previous dataset (current last version of it)
Modified records are replacing those matched from previous dataset in new dataset
Deleted records > Do nothing about it, but maybe we should remove them in new dataset??
Same -> nothing to do here

Currently DataChain.diff() returns only added and changed records by default...for other statuses explicit flags must be set.

Regarding the name, delta makes more sense if we are not just appending new ones, otherwise it's more like incremental, but I don't have strong opinion here...both sound reasonable to me.

cloudflare-workers-and-pages · 2025-02-21T13:25:08Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`67824e6`
Status:	⚡️ Build in progress...

View logs

cloudflare-workers-and-pages · 2025-02-21T13:26:30Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`67824e6`
Status:	✅ Deploy successful!
Preview URL:	https://b84a6f31.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-798-incremental-upda.datachain-documentation.pages.dev

View logs

dmpetrov · 2025-02-22T22:37:05Z

Currently DataChain.diff() returns only added and changed records by default...

Let's use the same default for the incremental update.

delta makes more sense

Then let's use Delta 🙂

codecov · 2025-02-24T14:42:54Z

Codecov Report

Attention: Patch coverage is 94.82759% with 6 lines in your changes missing coverage. Please review.

Project coverage is 87.97%. Comparing base (41af6ff) to head (ee27458).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/delta.py	95.55%	1 Missing and 1 partial ⚠️
src/datachain/diff/__init__.py	71.42%	1 Missing and 1 partial ⚠️
src/datachain/lib/dc/datachain.py	93.93%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
+ Coverage   87.93%   87.97%   +0.04%     
==========================================
  Files         147      148       +1     
  Lines       12655    12747      +92     
  Branches     1772     1783      +11     
==========================================
+ Hits        11128    11214      +86     
- Misses       1091     1094       +3     
- Partials      436      439       +3

Flag	Coverage Δ
datachain	`87.90% <94.82%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/datachain/lib/dc/datachain.py

shcheklein · 2025-04-30T19:09:59Z

src/datachain/lib/dc/datachain.py

        self.print_schema(file=file)
        return file.getvalue()

+    def _as_delta(self, delta: bool = False) -> "Self":


can we make it a regular ctr parameter (like we do with all other attrs) ... why do we need a special way of setting this up?

User should never set this parameter so I wanted to "hide" it a little bit by making it a private attribute and special private method to set it to True

I think users never actually use DataChain ctr anyway - it is kinda private already (and probably it has some attributes already that are technical, not use facing)

shcheklein · 2025-04-30T19:10:37Z

src/datachain/lib/dc/datachain.py

+        return self
+
+    @property
+    def delta(self) -> bool:


is it public? do we want it to be public?

I don't see a reason why it shouldn't be public. User could check if some chain is in "delta" mode or not. It is also used in some other internal methods for which it doesn't need to be public.
I can make it private as well, don't have strong opinion.

no, that's fine .. but if we keep it public we need proper docs for it then ... and an example if you have something in mind

one minor things that is left here @ilongin ... let's please take care of it

src/datachain/lib/dc/datachain.py

shcheklein

Can we test it on the cattle care scenario? Video files, we extract and save frames for them / or extract and save subvideos ... (it is similar to one of the tutorias). Should it be working for such scenario? (let me know if you need access to the code)

ilongin · 2025-04-30T19:56:29Z

Can we test it on the cattle care scenario? Video files, we extract and save frames for them / or extract and save subvideos ... (it is similar to one of the tutorias). Should it be working for such scenario? (let me know if you need access to the code)

Yes, please give me access to the code as I dont think I have it

shcheklein · 2025-04-30T20:00:07Z

Yes, please give me access to the code as I dont think I have it

DMed you the link!

shcheklein · 2025-05-08T20:00:01Z

src/datachain/lib/dc/datasets.py

+            If two rows have the same values, they are considered the same (e.g., they
+            could be different versions of the same row in a versioned source).
+            This is used in the delta update to calculate the diff.
+        delta_right_on: A list of fields in the final dataset that correspond to the


what is the final dataset? is it the result itself of the whole chain?

should we then specify this in save? 🤔

Yes, that's it. I don't think we should specify this in .save() as these fields can be specific to source. For example when we will be able to handle multiple sources in delta update each source could have it's own match fields that correspond to different fields in final dataset. I was thinking to rename this field but haven't find better name...

yep, right_on (why right, why not left?) is not clear a all

we will be able to handle multiple sources in delta update each source could have it's own match fields that correspond to different fields in final dataset

are we going to do subtract per source? is it pretty much about changed / removed objects only?

yep, right_on (why right, why not left?) is not clear a all

I've updated it to delta_result_on and added better explanation in docs. Let me know if this is better.

are we going to do subtract per source? is it pretty much about changed / removed objects only?

Currently delta works only for one (first) "starting" point e.g from_storage(...) or .from_dataset(...) but in future it will work for any starting point in chain e.g here we have 2 "starting points":

( chain.from_storage("s3://first-bucket", delta=True, delta_on=["id_first"], delta_result_on=["id"]) .map(...) .union(dc.from_storage("s3://second-bucket", delta=True, delta_on=["id_second"], delta_result_on=["id"])) .mutate(id=ifelse(isnone(chain.C("id_first") ), chain.C("id_second"), chain.C("id_first))) .select_except("id_first", "id_second") )

In this example id_first and id_second are all normalized to just id in final dataset and then we need to use that to match "diff" datasets with final dataset. Matching "diff" datasets with final datasets is important to keep only latest modified files i.e. remove old ones from final dataset.

shcheklein · 2025-05-12T19:11:12Z

src/datachain/lib/dc/storage.py

+            This is used in the delta update to calculate the diff.
+        delta_result_on: A list of fields in the resulting dataset that correspond
+            to the `delta_on` fields from the source.
+            This is needed to identify rows that have changed in the source but are


do we / can we detect deletions btw?

at beginning we decided to ignore deletions for now. I do think we should re-consider this

shcheklein

If we tested at least on some real scenarios, let's merge it.

ilongin · 2025-05-13T15:50:41Z

If we tested at least on some real scenarios, let's merge it.

Testing today on cattle-care for some scripts... found some issue with dataset dependencies, pushed a fix. I want to test a little bit more today evening and then merge

ilongin and others added 3 commits February 19, 2025 14:21

adding incremental update

fa09d0b

continued working on incremental

99a5327

finixhed first test

f01b3a2

ilongin marked this pull request as draft February 19, 2025 15:13

ilongin linked an issue Feb 19, 2025 that may be closed by this pull request

Incremental update #798

Closed

added from storage incremental update test

8fa1534

refactoring

67824e6

ilongin added 2 commits February 24, 2025 15:32

merging with main

c11d797

using delta instead of incremental

ee6640d

ilongin added 5 commits February 25, 2025 00:27

added check for modification

5e446b5

added another test

71c3469

refactoring

83366aa

added comment

a22916c

split tests in new file

d9e4f26

ilongin marked this pull request as ready for review February 25, 2025 15:34

Merge branch 'main' into ilongin/798-incremental-update

c622ba4

ilongin requested review from amritghimire, dmpetrov, dreadatour, shcheklein and skshetry February 25, 2025 15:34

updated docs

58c27f0

ilongin added 2 commits April 28, 2025 16:08

adding schema to diff instead of appending

b0470ce

Merge branch 'main' into ilongin/798-incremental-update

5470104

ilongin requested a review from shcheklein April 30, 2025 13:22

shcheklein reviewed Apr 30, 2025

View reviewed changes

src/datachain/lib/dc/datachain.py Outdated Show resolved Hide resolved

shcheklein reviewed Apr 30, 2025

View reviewed changes

src/datachain/lib/dc/datachain.py Outdated Show resolved Hide resolved

shcheklein reviewed Apr 30, 2025

View reviewed changes

src/datachain/lib/dc/datachain.py Outdated Show resolved Hide resolved

shcheklein reviewed Apr 30, 2025

View reviewed changes

ilongin added 4 commits May 6, 2025 11:31

Merge branch 'main' into ilongin/798-incremental-update

114e80a

moving delta_disabled to delta.py

2ab1759

moved is_empty to property empty

594ef7d

adding custom fields to calculate diff in delta update

567d63f

shcheklein reviewed May 8, 2025

View reviewed changes

ilongin added 5 commits May 9, 2025 12:07

merge with main

5ca0689

Merge branch 'main' into ilongin/798-incremental-update

3debff1

Merge branch 'main' into ilongin/798-incremental-update

10b95cd

fixing semver

e1f60c7

renamed field

be704a2

shcheklein reviewed May 12, 2025

View reviewed changes

shcheklein approved these changes May 12, 2025

View reviewed changes

fixing dataset dependencies in delta update

5decfeb

ilongin added 3 commits May 14, 2025 16:44

fixing small issues with deleted

ab9f9a3

Merge branch 'main' into ilongin/798-incremental-update

e2f5bf3

Merge branch 'main' into ilongin/798-incremental-update

ee27458

ilongin merged commit b57275a into main May 15, 2025
35 checks passed

ilongin deleted the ilongin/798-incremental-update branch May 15, 2025 12:51

Comments

Conversation

ilongin commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

dmpetrov commented Feb 19, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Feb 21, 2025

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

ilongin commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Feb 21, 2025

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

cloudflare-workers-and-pages bot commented Feb 21, 2025

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

dmpetrov commented Feb 22, 2025

Uh oh!

codecov bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

ilongin commented Apr 30, 2025

Uh oh!

shcheklein commented Apr 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

ilongin commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ilongin commented Feb 19, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 19, 2025 •

edited

Loading

ilongin commented Feb 21, 2025 •

edited

Loading

codecov bot commented Feb 24, 2025 •

edited

Loading

ilongin commented May 13, 2025 •

edited

Loading