-
Notifications
You must be signed in to change notification settings - Fork 24
Add bulk_remote_exists #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Definitely interested, and open to contributions. I actually implemented something similar a few months ago in DVC: That approach, however, broke And I'd like to maintain the compatibility.
Regarding the implementation, for bulk checks, we should leverage It can make batched asyncio calls to Utilizing that, I think we can implement batched remote exists check that would be fast enough for all cases. |
Do you use DVC pushes |
|
Hi @skshetry! Sami asked me to take over his PR for now. I've significantly rewritten the code1 to fit with your suggestions. So to summarize:
Changes are available here (these are links to diffs from the current respective
I can make a new PR with the other two repos later on, but I first wanted to comment here and see if you'd accept this approach at all or if there are any bigger changes I should implement first. Footnotes
|
|
Contributions are always welcome. Please go ahead and open the pull requests, and we can discuss details during review. |
- use return_exceptions=True for batch retrieval - skip unnecessary network calls by accepting cached_info - do a single fs.info call, then pass that info to build_entry - we group storage instances by their underlying ODB path to unify batches and perform the fs.info call for the entire batch
62d09ad to
881f6b4
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #637 +/- ##
==========================================
+ Coverage 72.42% 73.63% +1.21%
==========================================
Files 67 65 -2
Lines 5132 5307 +175
Branches 605 618 +13
==========================================
+ Hits 3717 3908 +191
+ Misses 1227 1206 -21
- Partials 188 193 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
881f6b4 to
c6dc3d1
Compare
|
Thanks for the review, comments should be addressed. (The commit history is a bit messy by now, tell me if I should squash/rebase before we merge.) |
|
Thank you for contributing. I will create new releases of dvc-objects and dvc-data tomorrow, and then review the PR in dvc side. |
NOTE: The code below was entirely LLM-written. Will gladly clean up / rewrite if a contribution of this type would be accepted. Please read on for context.
Our org uses DVC for a bunch of stuff. In most of our pipeline repos we have a CI check that verifies that the pipeline has been fully reproduced (
dvc repro --dry --allow-missing) and all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking. I thought I knew what the bottleneck was (inefficient remote checking), confirmed with a profiler, then stuck models on the problem. I've included flamegraphs and cProfile logs from before and after for comparison.So, is there a version of this change that you'd accept? Note that it also requires a small change to DVC itself.
Cheers!
BEFORE

AFTER

dvc_cprofile.zip