Skip to content

Conversation

@sjawhar
Copy link
Contributor

@sjawhar sjawhar commented Dec 3, 2025

NOTE: The code below was entirely LLM-written. Will gladly clean up / rewrite if a contribution of this type would be accepted. Please read on for context.

Our org uses DVC for a bunch of stuff. In most of our pipeline repos we have a CI check that verifies that the pipeline has been fully reproduced (dvc repro --dry --allow-missing) and all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking. I thought I knew what the bottleneck was (inefficient remote checking), confirmed with a profiler, then stuck models on the problem. I've included flamegraphs and cProfile logs from before and after for comparison.

So, is there a version of this change that you'd accept? Note that it also requires a small change to DVC itself.

Cheers!

BEFORE
dvc_flamegraph

AFTER
dvc_profile_final

dvc_cprofile.zip

@github-project-automation github-project-automation bot moved this to Backlog in DVC Dec 3, 2025
@CLAassistant
Copy link

CLAassistant commented Dec 3, 2025

CLA assistant check
All committers have signed the CLA.

@skshetry
Copy link
Collaborator

skshetry commented Dec 3, 2025

Definitely interested, and open to contributions. I actually implemented something similar a few months ago in DVC:

That approach, however, broke --no-remote-refresh, which I didn’t want to affect in the minor releases. The full implementation turned out to be more complex, so I ended up reverting the PR. If we remove the --no-remote-refresh flag, the implementation becomes simpler, but breaks compatibility.

And I'd like to maintain the compatibility.

build_entry() internally does fs.info() call, so if we can pass info to it, it would not call fs.info() again. Which it does in your implementation.

Regarding the implementation, for bulk checks, we should leverage fs.info() in batches. This functionality already exists in some form:

https://github.com/treeverse/dvc-objects/blob/0c04cec4c0d97416fad9535e19d0de39f288556a/src/dvc_objects/fs/base.py#L587

It can make batched asyncio calls to fs._info() or falls back to using fs.info() in a threadpool executor. The only issue is that it currently raises an error if a file is missing, even in batch mode which is something we’d need to handle, maybe by extending fs.info() with return_exceptions=True|False or other mechanisms.

Utilizing that, I think we can implement batched remote exists check that would be fast enough for all cases.

@skshetry
Copy link
Collaborator

skshetry commented Dec 3, 2025

all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking.

Do you use --not-in-remote with --granular?
If not, how many .dvc files or output do you have? Because without --granular, dvc only makes one single request per output. For tracked directories, it won't check files inside, just the .dir file.

DVC pushes .dir file at the end after all the entries tracked by that .dir is pushed, so the result should be same in an ideal condition.

@falko17
Copy link
Contributor

falko17 commented Dec 7, 2025

Hi @skshetry! Sami asked me to take over his PR for now.

I've significantly rewritten the code1 to fit with your suggestions. So to summarize:

  • fs.info now has a return_exceptions parameter, which is used by the bulk_*_exists methods
  • We group storage instances by their underlying ODB path to unify batches, then perform the fs.info call for the entire batch and pass the resulting info to build_entry.
  • This also uses the existing batch functionality from fs.info instead of using another ThreadPool on top of it.
  • Finally, there are some smaller fixes/changes, like that the progress bar now updates correctly for bulk calls.

Changes are available here (these are links to diffs from the current respective treeverse:main):

I can make a new PR with the other two repos later on, but I first wanted to comment here and see if you'd accept this approach at all or if there are any bigger changes I should implement first.

Footnotes

  1. It's still a bit messy and could be improved, but I wanted to get your opinion on the approach first.

@skshetry
Copy link
Collaborator

skshetry commented Dec 7, 2025

Contributions are always welcome. Please go ahead and open the pull requests, and we can discuss details during review.

sjawhar and others added 3 commits December 14, 2025 02:55
- use return_exceptions=True for batch retrieval
- skip unnecessary network calls by accepting cached_info
- do a single fs.info call, then pass that info to build_entry
- we group storage instances by their underlying ODB path to unify
  batches and perform the fs.info call for the entire batch
@sjawhar sjawhar force-pushed the feature/bulk-remote-exists branch 5 times, most recently from 62d09ad to 881f6b4 Compare December 14, 2025 03:32
@codecov
Copy link

codecov bot commented Dec 14, 2025

Codecov Report

❌ Patch coverage is 96.37306% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.63%. Comparing base (e38c8c2) to head (76a4a5d).
⚠️ Report is 62 commits behind head on main.

Files with missing lines Patch % Lines
src/dvc_data/index/index.py 89.39% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #637      +/-   ##
==========================================
+ Coverage   72.42%   73.63%   +1.21%     
==========================================
  Files          67       65       -2     
  Lines        5132     5307     +175     
  Branches      605      618      +13     
==========================================
+ Hits         3717     3908     +191     
+ Misses       1227     1206      -21     
- Partials      188      193       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sjawhar sjawhar force-pushed the feature/bulk-remote-exists branch from 881f6b4 to c6dc3d1 Compare December 14, 2025 03:40
@falko17
Copy link
Contributor

falko17 commented Dec 17, 2025

Thanks for the review, comments should be addressed. (The commit history is a bit messy by now, tell me if I should squash/rebase before we merge.)

@skshetry skshetry changed the title Add optional bulk_remote_exists Add bulk_remote_exists Dec 18, 2025
@skshetry skshetry merged commit df839dc into treeverse:main Dec 18, 2025
25 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in DVC Dec 18, 2025
@skshetry
Copy link
Collaborator

Thank you for contributing. I will create new releases of dvc-objects and dvc-data tomorrow, and then review the PR in dvc side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants