Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc checkout : Checkout takes a huge amount of time despite using hardlink cache type and having multiple .dvc files for each data folder #10491

Closed
ameyskulkarni opened this issue Jul 26, 2024 · 17 comments · Fixed by iterative/dvc-data#546
Labels
performance improvement over resource / time consuming tasks

Comments

@ameyskulkarni
Copy link

ameyskulkarni commented Jul 26, 2024

Bug Report

Description

Recently, I started using DVC for a large project and encountered some problems that I couldn't solve using the provided documentation. I posted my issues in discord and @shcheklein recommended that I should raise an issue here.

Background

  1. I'm working on a CV+ML project with a huge image dataset. The size is only a few tens of GBs, but the dataset has over 2 million files.
  2. The dataset folder contains two subfolders: images and annotations. The images folder has multiple subfolders containing the actual data.
  3. Initially, I started tracking the images folder using dvc add images/, which created a single images.dvc file tracking the whole dataset.

Problems
1. With this workflow, when I use the dvc checkout operation to switch between branches, it takes more than an hour just to checkout branches.
2. When I pull a new branch from remote storage, even if the new branch doesn't have many changes, the pull still takes more than an hour.
3. The lengthy pull and checkout times make it hard to convince my team to use DVC.

Tried Solutions

  1. I realized that the cache.type was defaulting to copy and changed it to hardlink, symlink. This reduced the cache space on my system but didn't affect the dvc checkout times, which are still over an hour. The DVC documentation says the checkout should be instantaneous, but this isn't the case for me.
  2. Instead of a single images.dvc file, I created individual .dvc files for each subfolder inside the images folder. I hoped that since each folder has its own md5 hash value, the dvc checkout operation would be faster, only spending time on folders with different hash values between branches. However, DVC still takes the same amount of time to run dvc checkout. I really had great hopes from this experiment since in this case, when I switched branches, the difference was just one .dvc having a different hash value, but sadly it still took more than an hour.

I'm not sure what else I can do to speed up the dvc checkout and dvc pull operations. I'd love to hear from the experts here on what else I can try. Thank you.

I am also attaching the result from the profiler here.
image

Reproduce

Unfortunately I wont be able to share the dataset. But, the problem simply is that dvc checkout takes a lot of time even when I have multiple .dvc files (one for each subfolder) and dvc pull is also very slow even when the changes in the dataset are minimal.

Expected

  1. dvc checkout is fast between branches where nothing much changed.
  2. dvc pull is fast when pulling branches where nothing much changed.

Edit 1:
Output of dvc doctor -v when ran from within the project:

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.9.19 on Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.15.1
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.4.0
	scmrepo = 3.3.5
Supports:
	azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.1),
	http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	ssh (sshfs = 2024.6.0)
Config:
	Global: /home/ak2004a/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: zfs on rpool/USERDATA/ak2004a_n24lgo
Caches: local
Remotes: None
Workspace directory: zfs on rpool/USERDATA/ak2004a_n24lgo
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/6824f05ebbed124a957b58a7174c6b6f
2024-07-29 09:47:38,286 DEBUG: Check for update is enabled.
2024-07-29 09:47:38,313 DEBUG: Trying to spawn ['daemon', 'updater', '-v']
2024-07-29 09:47:38,318 DEBUG: Spawned ['daemon', 'updater', '-v'] with pid 1680990
2024-07-29 09:47:38,321 DEBUG: Analytics is enabled.
2024-07-29 09:47:38,322 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpkwen9qdc', '-v']
2024-07-29 09:47:38,327 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpkwen9qdc', '-v'] with pid 1680993

Commands used when installing DVC in a conda virtual env

Do you think me installing and running DVC in a conda or python virtual environment may be causing the slowdown?

1. pip install dvc
2. pip install dvc[ssh]​
3. pip install dvc-ssh
4. pip install pyopenssl==24.0.0
5. pip install dvc-azure

It would be great if you could help me solve this soon. If you need more info, I will be prompt in providing this info.Thank you.

@skshetry
Copy link
Member

Can you attach the profiling data?

@skshetry skshetry added the awaiting response we are waiting for your reply, please respond! :) label Jul 26, 2024
@ameyskulkarni
Copy link
Author

I added the picture of the snakeviz visualization of the dvc checkout command in the description since github will not let me upload a .prof file. How do you recommend I upload this file? Should I add more screenshots of the snakeviz visualizer?

@skshetry
Copy link
Member

You can try zip it and upload.

@ameyskulkarni
Copy link
Author

dump_master_to_feature.zip

There you go!

@skshetry
Copy link
Member

skshetry commented Jul 26, 2024

Looks like you have almost 4 million files, which is a lot for dvc to handle.

@ameyskulkarni
Copy link
Author

ameyskulkarni commented Jul 26, 2024

I see. I did not see in any documentation that says there is an upperbound I should be aware of. Do you suggest an upperbound number?
Also, as described in my issue description, I tried making individual .dvc files for each data subfolder I have. In most cases, only one folder is added/deleted/or changed. In such an event, the hash value of only one .dvc file will change. Can DVC not just check .dvc files that changed and do the checkout and pull faster? I notice that the indexing operation runs for a huge amount of time. Can we not reduce this time somehow?

@dberenbaum
Copy link
Collaborator

Can you show the rest of the output from dvc doctor? It looks like sqlite is working very slowly. Even for 4 million files, spending 26 minutes in sqlite looks very slow to me. For example, I ran dvc checkout on 1.5 million files on an m1 mac with ~70 secs spent in sqlite. We have discussed batching sqlite operations to speed this up, but it looks odd that it's this slow anyway. @skshetry Do you agree?

@ameyskulkarni
Copy link
Author

ameyskulkarni commented Jul 29, 2024

Thanks for taking a closer look at this problem guys!

I ran dvc doctor -v and edited the original description with the output. That is all I see from the dvc doctor output. Please let me know if I should run a different command to get you a more verbose or informative output.

I also forgot to mention (updated in description) that I am installing and using DVC inside a conda environment. Do you see this as a potential problem? I have also mentioned the sequence of commands I ran to install DVC inside this virtual env. Do you see anything wrong with that?

Also, I am not very well aware of sqlite and so let me know how can I provide you more info about this package. Running sqlite3 --version provides 3.45.3 as the version number.

Will be checking this thread regularly to get you guys info asap. Thanks again.

@dberenbaum
Copy link
Collaborator

I ran dvc doctor -v and edited the original description with the output. That is all I see from the dvc doctor output. Please let me know if I should run a different command to get you a more verbose or informative output.

Sorry, you need to run it from inside the project to get additional output.

I also forgot to mention (updated in description) that I am installing and using DVC inside a conda environment. Do you see this as a potential problem? I have also mentioned the sequence of commands I ran to install DVC inside this virtual env. Do you see anything wrong with that?

No, I doubt this is related.

@ameyskulkarni
Copy link
Author

Updated in the description.
Adding it here as well -
dvc doctor -v from within the DVC repository

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.9.19 on Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.15.1
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.4.0
	scmrepo = 3.3.5
Supports:
	azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.1),
	http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
	ssh (sshfs = 2024.6.0)
Config:
	Global: /home/ak2004a/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: zfs on rpool/USERDATA/ak2004a_n24lgo
Caches: local
Remotes: None
Workspace directory: zfs on rpool/USERDATA/ak2004a_n24lgo
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/6824f05ebbed124a957b58a7174c6b6f
2024-07-29 09:47:38,286 DEBUG: Check for update is enabled.
2024-07-29 09:47:38,313 DEBUG: Trying to spawn ['daemon', 'updater', '-v']
2024-07-29 09:47:38,318 DEBUG: Spawned ['daemon', 'updater', '-v'] with pid 1680990
2024-07-29 09:47:38,321 DEBUG: Analytics is enabled.
2024-07-29 09:47:38,322 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpkwen9qdc', '-v']
2024-07-29 09:47:38,327 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpkwen9qdc', '-v'] with pid 1680993

@skshetry
Copy link
Member

skshetry commented Jul 30, 2024

It looks like sqlite is working very slowly. Even for 4 million files, spending 26 minutes in sqlite looks very slow to me. For example, I ran dvc checkout on 1.5 million files on an m1 mac with ~70 secs spent in sqlite. We have discussed #9813 (comment) to speed this up, but it looks odd that it's this slow anyway. @skshetry Do you agree?

This is running on cprofile mode, which makes the call at least twice as slow. The above cprofile shows that it took 4372µs for a single __setitem__ call for sqlite (cumulative, per call).

I ran dvc checkout over a small mnist dataset on cprofile mode on my machine (Intel Mac) and it takes 1871µs for a single __setitem__ call, so the user's machine is only 2.33x slower than mine.

@dberenbaum
Copy link
Collaborator

I ran dvc checkout over a small mnist dataset on cprofile mode on my machine (Intel Mac) and it takes 1871µs for a single __setitem__ call, so the user's machine is only 2.33x slower than mine.

Good point, but that's the difference between spending 26 minutes and 11 minutes in sqlite, so it's worth noting that different filesystems and hardware can make a significant difference.

Also, subsequent checkouts will have very different profiles and may take less time since this one seems to be done on a fresh index where a lot of hashing and index building is happening.

2. Instead of a single images.dvc file, I created individual .dvc files for each subfolder inside the images folder. I hoped that since each folder has its own md5 hash value, the dvc checkout operation would be faster, only spending time on folders with different hash values between branches.

It should not make any difference to create individual .dvc files since dvc checkout will still need to check what has changed across the entire repo. Instead, what can speed up operations is guiding dvc if you know which folders have changed. For example dvc checkout data/subdir will be faster than dvc checkout. Note that even if you have a single data.dvc file, you can still do dvc checkout data/subdir.

@skshetry
Copy link
Member

This may be fixed if we introduce batch-save/bulk-save of the state entries, and parallelize hashing of files.

For that, iterative/dvc-data#522 is a pre-requisite, which might improve building objects and dvc add --no-relink significantly.

@ameyskulkarni
Copy link
Author

Thanks again for investigating this further. I checked the other issues you mentioned (especially iterative/dvc-data#522). It appears you already have some plans for making changes that will yield significant speed improvements. Would it be possible at this point to give an ETA as to when will I be able to try these new speed improvements?
My team will be unblocked big time if we can get the 10-15x speed improvements mentioned!

@skshetry skshetry added performance improvement over resource / time consuming tasks optimize Optimizes DVC and removed awaiting response we are waiting for your reply, please respond! :) performance improvement over resource / time consuming tasks optimize Optimizes DVC labels Aug 6, 2024
@skshetry skshetry reopened this Aug 7, 2024
@JohnAtl
Copy link

JohnAtl commented Aug 9, 2024

I have about 1000 files in my repo, and checkout is extremely slow; 47minutes and it's still working. My repo is on a ZFS raid array, and the destination is two striped NVMe drives. i9-13900K, 128GB RAM, so don't think hardware is the problem.
The files are large, 500 - 950MiB each.
Could the bottleneck be calculating the MD5 hashes?
My quick testing comparing the python package xxhash and hashlib.md5, shows that xxHash128 is about 5x faster than MD5.
Assuming an average 7 seconds per file, that would be 117 minutes for MD5, and 1.3 seconds per file, or 22 minutes for xxHash.

@skshetry
Copy link
Member

skshetry commented Aug 10, 2024

@JohnAtl, we have recently added concurrent hashing support. It's not released yet, but you can install it with the following command (in the same environment as dvc):

pip install "dvc-data @ git+https://github.com/iterative/dvc-data.git"

As you have pointed out, md5 is slower than other hashing methods. xxhash is non-cryptographic hash, but we need cryptographic hash function. So far, blake3 and sha256 are good alternative. See:

But it's very unlikely to happen.

Also, dvc caches hash of the file based on it's mtime (and size and inode). So, dvc will only hash files once until it gets modified.

@skshetry
Copy link
Member

Closed by iterative/dvc-data#546, and released in https://github.com/iterative/dvc/releases/tag/3.54.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants