-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc checkout
: Checkout takes a huge amount of time despite using hardlink cache type and having multiple .dvc files for each data folder
#10491
Comments
Can you attach the profiling data? |
I added the picture of the snakeviz visualization of the |
You can try zip it and upload. |
There you go! |
Looks like you have almost 4 million files, which is a lot for dvc to handle. |
I see. I did not see in any documentation that says there is an upperbound I should be aware of. Do you suggest an upperbound number? |
Can you show the rest of the output from |
Thanks for taking a closer look at this problem guys! I ran I also forgot to mention (updated in description) that I am installing and using DVC inside a conda environment. Do you see this as a potential problem? I have also mentioned the sequence of commands I ran to install DVC inside this virtual env. Do you see anything wrong with that? Also, I am not very well aware of sqlite and so let me know how can I provide you more info about this package. Running Will be checking this thread regularly to get you guys info asap. Thanks again. |
Sorry, you need to run it from inside the project to get additional output.
No, I doubt this is related. |
Updated in the description.
|
This is running on cprofile mode, which makes the call at least twice as slow. The above cprofile shows that it took 4372µs for a single I ran |
Good point, but that's the difference between spending 26 minutes and 11 minutes in sqlite, so it's worth noting that different filesystems and hardware can make a significant difference. Also, subsequent checkouts will have very different profiles and may take less time since this one seems to be done on a fresh index where a lot of hashing and index building is happening.
It should not make any difference to create individual .dvc files since |
This may be fixed if we introduce batch-save/bulk-save of the state entries, and parallelize hashing of files. For that, iterative/dvc-data#522 is a pre-requisite, which might improve building objects and |
Thanks again for investigating this further. I checked the other issues you mentioned (especially iterative/dvc-data#522). It appears you already have some plans for making changes that will yield significant speed improvements. Would it be possible at this point to give an ETA as to when will I be able to try these new speed improvements? |
I have about 1000 files in my repo, and checkout is extremely slow; 47minutes and it's still working. My repo is on a ZFS raid array, and the destination is two striped NVMe drives. i9-13900K, 128GB RAM, so don't think hardware is the problem. |
@JohnAtl, we have recently added concurrent hashing support. It's not released yet, but you can install it with the following command (in the same environment as dvc): pip install "dvc-data @ git+https://github.com/iterative/dvc-data.git" As you have pointed out, md5 is slower than other hashing methods. xxhash is non-cryptographic hash, but we need cryptographic hash function. So far, blake3 and sha256 are good alternative. See: But it's very unlikely to happen. Also, dvc caches hash of the file based on it's mtime (and size and inode). So, dvc will only hash files once until it gets modified. |
Closed by iterative/dvc-data#546, and released in https://github.com/iterative/dvc/releases/tag/3.54.0. |
Bug Report
Description
Recently, I started using DVC for a large project and encountered some problems that I couldn't solve using the provided documentation. I posted my issues in discord and @shcheklein recommended that I should raise an issue here.
Background
images
andannotations
. Theimages
folder has multiple subfolders containing the actual data.dvc add images/
, which created a singleimages.dvc
file tracking the whole dataset.Problems
1. With this workflow, when I use the
dvc checkout
operation to switch between branches, it takes more than an hour just to checkout branches.2. When I pull a new branch from remote storage, even if the new branch doesn't have many changes, the pull still takes more than an hour.
3. The lengthy pull and checkout times make it hard to convince my team to use DVC.
Tried Solutions
cache.type
was defaulting tocopy
and changed it tohardlink, symlink
. This reduced the cache space on my system but didn't affect thedvc checkout
times, which are still over an hour. The DVC documentation says the checkout should be instantaneous, but this isn't the case for me.images.dvc
file, I created individual .dvc files for each subfolder inside the images folder. I hoped that since each folder has its own md5 hash value, thedvc checkout
operation would be faster, only spending time on folders with different hash values between branches. However, DVC still takes the same amount of time to rundvc checkout
. I really had great hopes from this experiment since in this case, when I switched branches, the difference was just one .dvc having a different hash value, but sadly it still took more than an hour.I'm not sure what else I can do to speed up the
dvc checkout
anddvc pull
operations. I'd love to hear from the experts here on what else I can try. Thank you.I am also attaching the result from the profiler here.
Reproduce
Unfortunately I wont be able to share the dataset. But, the problem simply is that
dvc checkout
takes a lot of time even when I have multiple .dvc files (one for each subfolder) anddvc pull
is also very slow even when the changes in the dataset are minimal.Expected
dvc checkout
is fast between branches where nothing much changed.dvc pull
is fast when pulling branches where nothing much changed.Edit 1:
Output of
dvc doctor -v
when ran from within the project:Commands used when installing DVC in a conda virtual env
Do you think me installing and running DVC in a conda or python virtual environment may be causing the slowdown?
It would be great if you could help me solve this soon. If you need more info, I will be prompt in providing this info.Thank you.
The text was updated successfully, but these errors were encountered: