Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc diff slow when there are many unique additions and deletions #10515

Closed
Northo opened this issue Aug 9, 2024 · 0 comments · Fixed by iterative/dvc-data#550
Closed

dvc diff slow when there are many unique additions and deletions #10515

Northo opened this issue Aug 9, 2024 · 0 comments · Fixed by iterative/dvc-data#550

Comments

@Northo
Copy link

Northo commented Aug 9, 2024

Bug Report

diff: slow when there is a large amount of additions and deletions.

Description

For changes where there is a large amount of additions and deletions, dvc diff is much slower than comparable sized changes of only addition, deletion, or rename. The bug comes from an O(n^2) search for renames, in dvc_data.index.diff._detect_renames.

In the below minimal example, normal diff operations take ~0.5 secons, while the situation below takes ~15 seconds. For large datasets we work with, normal operations are in the order of 1 minute, while the situation described can take from 30 minutes to several hours.

Reproduce

  1. git init && dvc init
  2. Add and commit some dataset. For example
# dvc.yaml
stages:
  Create dataset:
    cmd: mkdir data; for i in {1..10000}; do echo $i > data/file_$i; done
    outs:
    - data
  1. Remove the old dataset and create a new one, and commit. Make sure the content of the files change and that the filename changes.
# dvc.yaml
stages:
  Create dataset:
    cmd: mkdir data2; for i in {1..10000}; do echo $((i + 20000)) > data2/file_$i; done
    outs:
    - data2
  1. dvc diff HEAD^ takes unreasonably long time, compared to changes with only addition, deletions, or renames.

Expected

dvc diff should be fast.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.53.1 (pip)
-------------------------
Platform: Python 3.12.4 on macOS-14.5-arm64-arm-64bit
Subprojects:
        dvc_data = 3.15.3.dev5+g6ad5866
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.7
Supports:
        http (aiohttp = 3.10.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.1, aiohttp-retry = 2.8.3)
Config:
        Global: /Users/thorvald/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/ffec761e9e6d289bf04d31239249a168

Additional Information (if any):

Profiling in worst case scenario (see ncalls of HashInfo.__eq__).
image

Profiling of normal scenario.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant