Conversation
Deploying datachain-documentation with
|
| Latest commit: |
ee27458
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://549e375b.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
|
@ilongin it would be great to extract all logic outside of the fat file Also, should we call it incremental or delta? :) Delta seems better but I don't like it do to a conflict with Delta Lake. Any ideas? :) |
Deploying datachain-documentation with
|
| Latest commit: |
8fa1534
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://c897b9bc.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
@dmpetrov one question just to be 100% sure. How do we deal with different statuses : added, modified, removed, same? My assumption is to:
Currently Regarding the name, delta makes more sense if we are not just appending new ones, otherwise it's more like incremental, but I don't have strong opinion here...both sound reasonable to me. |
Deploying datachain-documentation with
|
| Latest commit: |
67824e6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://b84a6f31.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
Let's use the same default for the incremental update.
Then let's use Delta 🙂 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #928 +/- ##
==========================================
+ Coverage 87.93% 87.97% +0.04%
==========================================
Files 147 148 +1
Lines 12655 12747 +92
Branches 1772 1783 +11
==========================================
+ Hits 11128 11214 +86
- Misses 1091 1094 +3
- Partials 436 439 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/datachain/lib/dc/datachain.py
Outdated
| self.print_schema(file=file) | ||
| return file.getvalue() | ||
|
|
||
| def _as_delta(self, delta: bool = False) -> "Self": |
There was a problem hiding this comment.
can we make it a regular ctr parameter (like we do with all other attrs) ... why do we need a special way of setting this up?
There was a problem hiding this comment.
User should never set this parameter so I wanted to "hide" it a little bit by making it a private attribute and special private method to set it to True
There was a problem hiding this comment.
I think users never actually use DataChain ctr anyway - it is kinda private already (and probably it has some attributes already that are technical, not use facing)
| return self | ||
|
|
||
| @property | ||
| def delta(self) -> bool: |
There was a problem hiding this comment.
is it public? do we want it to be public?
There was a problem hiding this comment.
I don't see a reason why it shouldn't be public. User could check if some chain is in "delta" mode or not. It is also used in some other internal methods for which it doesn't need to be public.
I can make it private as well, don't have strong opinion.
There was a problem hiding this comment.
no, that's fine .. but if we keep it public we need proper docs for it then ... and an example if you have something in mind
There was a problem hiding this comment.
one minor things that is left here @ilongin ... let's please take care of it
shcheklein
left a comment
There was a problem hiding this comment.
Can we test it on the cattle care scenario? Video files, we extract and save frames for them / or extract and save subvideos ... (it is similar to one of the tutorias). Should it be working for such scenario? (let me know if you need access to the code)
Yes, please give me access to the code as I dont think I have it |
DMed you the link! |
src/datachain/lib/dc/datasets.py
Outdated
| If two rows have the same values, they are considered the same (e.g., they | ||
| could be different versions of the same row in a versioned source). | ||
| This is used in the delta update to calculate the diff. | ||
| delta_right_on: A list of fields in the final dataset that correspond to the |
There was a problem hiding this comment.
what is the final dataset? is it the result itself of the whole chain?
should we then specify this in save? 🤔
There was a problem hiding this comment.
Yes, that's it. I don't think we should specify this in .save() as these fields can be specific to source. For example when we will be able to handle multiple sources in delta update each source could have it's own match fields that correspond to different fields in final dataset. I was thinking to rename this field but haven't find better name...
There was a problem hiding this comment.
yep, right_on (why right, why not left?) is not clear a all
we will be able to handle multiple sources in delta update each source could have it's own match fields that correspond to different fields in final dataset
are we going to do subtract per source? is it pretty much about changed / removed objects only?
There was a problem hiding this comment.
yep,
right_on(why right, why not left?) is not clear a all
I've updated it to delta_result_on and added better explanation in docs. Let me know if this is better.
are we going to do subtract per source? is it pretty much about changed / removed objects only?
Currently delta works only for one (first) "starting" point e.g from_storage(...) or .from_dataset(...) but in future it will work for any starting point in chain e.g here we have 2 "starting points":
(
chain.from_storage("s3://first-bucket", delta=True, delta_on=["id_first"], delta_result_on=["id"])
.map(...)
.union(dc.from_storage("s3://second-bucket", delta=True, delta_on=["id_second"], delta_result_on=["id"]))
.mutate(id=ifelse(isnone(chain.C("id_first") ), chain.C("id_second"), chain.C("id_first)))
.select_except("id_first", "id_second")
)In this example id_first and id_second are all normalized to just id in final dataset and then we need to use that to match "diff" datasets with final dataset. Matching "diff" datasets with final datasets is important to keep only latest modified files i.e. remove old ones from final dataset.
| This is used in the delta update to calculate the diff. | ||
| delta_result_on: A list of fields in the resulting dataset that correspond | ||
| to the `delta_on` fields from the source. | ||
| This is needed to identify rows that have changed in the source but are |
There was a problem hiding this comment.
do we / can we detect deletions btw?
There was a problem hiding this comment.
at beginning we decided to ignore deletions for now. I do think we should re-consider this
shcheklein
left a comment
There was a problem hiding this comment.
If we tested at least on some real scenarios, let's merge it.
Testing today on cattle-care for some scripts... found some issue with dataset dependencies, pushed a fix. I want to test a little bit more today evening and then merge |
Adding the ability to do incremental, or delta updates with dataset. The way user will run delta updates is to just re-run the whole script where it creates dataset, with one small modification - adding
delta=TrueonDataChain.save()method.General idea behind delta update is to not re-build the whole dataset from source (or sources ) once the source has some changes (new or modified files are considered changed), but to run the whole chain in "diff" mode and union that with latest version of resulting dataset. Running chain in "diff" mode means that we run the whole chain as is, but every starting step (
.from_storage()or.from_dataset()) returns diff between latest version and the one with which last time chain was run and created.Facts:
.union()or.merge()was used in the chain.!! NOTE that in this PR we implemented only subset of above - situation where we have only one direct dataset dependency which means
.union()won't work as expected - it will create duplicates and.merge()might work as expected if some circumstances are met. There will be follow-up PRs for the rest.Example chain:
In above example we have 2 "starting points" when creating
dogs-cats-laiondataset:DataChain.from_dataset("dogs-cats")DataChain.from_dataset("laion")In first run maybe we created chain with
dogs-catsdataset at versionv3and on next chain run datasetdogs-catsis in versionv5soDataChain.from_dataset("dogs-cats")will returndiff(dogs-cats@v5, dogs-cats@v3). The same goes with all other sources / dependencies which can be datasets or direct listings (also datasets in the back)Note that from the standpoint of delta update of chain, only direct dataset dependencies are visible - indirect dependencies are not taken into consideration as they are created in another, non related chain (that could be in another query file as well).
For example, this is a dependency graph of above example:
In above, final
dogs-cats-laiondataset has 2 direct dataset dependencies. Those later on have listing datasets as direct dependencies, but from the point of view of this chain it's irrelevant.If user want's to update
dogs-cats-laiondataset taking into consideration changes ins3andgcsbuckets, which are not it's direct dependencies, then it must run chains for thosedogs-catsandlaiondatasets before (he can put those chains to utilizedeltaupdate as well). Easiest is to put those in the same query file and just re-run the whole thing like in above example.Comparison between current approach and final / ideal one:
DataChainor "upper" level and it would need to be used inDatasetQueryor "lower" level ... we need to refactor / removeDatasetQueryfirst.delta=Trueflag in.save())delta=Trueflag in.save()).DataChainmethodsunion()agg(...)group_by(...).agg(), it works only if identical groups are not found in both sets (delta part and old dataset)distinct(...).agg(), it works only if distinct group values are not found in both sets (delta part and old dataset)union()merge()inner=Trueis used and similar withagg()if rows to be merged are not found in both sets (delta / diff and old dataset) but are isolated in only one of thoseinner=Falseas wellQ&A
Q: In the second approach the idea was to use delta on the source side (per each source). This would allow to have a single file source for example. Both approaches don't allow that atm AFAIU. Are there cases like this?
A: In both approaches when we speak about "source" we actually mean on source or starting dataset from which resulting dataset is created. It can be listing dataset if
.from_storage()is used or just a regular dataset if.from_dataset()is used for example. Note that there can be multiple starting datasets in chain as in above example with.union(). In the question, if I understood correctly, suggestion is to get distinct values fromsourcecolumn and then dodifffor each of those and apply chain to each diff and then union with current dataset. This approach is similar to the ideal solution described in description and table which should be done in future. We need to look at starting datasets or direct dataset dependencies and not distinctsourcevalues as some of those sources could have come from indirect dependency (another dataset down the tree which is created in another, not related chain)Regarding the one file, yes that one is tricky. With current implementation it doesn't break but every time dataset is calculated from start (delta doesn't work as there is no dependency in created dataset as there is no listing created since we just extract data from file and create dataset rows) .. example: