You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
diff: slow when there is a large amount of additions and deletions.
Description
For changes where there is a large amount of additions and deletions, dvc diff is much slower than comparable sized changes of only addition, deletion, or rename. The bug comes from an O(n^2) search for renames, in dvc_data.index.diff._detect_renames.
In the below minimal example, normal diff operations take ~0.5 secons, while the situation below takes ~15 seconds. For large datasets we work with, normal operations are in the order of 1 minute, while the situation described can take from 30 minutes to several hours.
Reproduce
git init && dvc init
Add and commit some dataset. For example
# dvc.yamlstages:
Create dataset:
cmd: mkdir data; for i in {1..10000}; do echo $i > data/file_$i; doneouts:
- data
Remove the old dataset and create a new one, and commit. Make sure the content of the files change and that the filename changes.
# dvc.yamlstages:
Create dataset:
cmd: mkdir data2; for i in {1..10000}; do echo $((i + 20000)) > data2/file_$i; doneouts:
- data2
dvc diff HEAD^ takes unreasonably long time, compared to changes with only addition, deletions, or renames.
Bug Report
diff
: slow when there is a large amount of additions and deletions.Description
For changes where there is a large amount of additions and deletions,
dvc diff
is much slower than comparable sized changes of only addition, deletion, or rename. The bug comes from an O(n^2) search for renames, indvc_data.index.diff._detect_renames
.In the below minimal example, normal diff operations take ~0.5 secons, while the situation below takes ~15 seconds. For large datasets we work with, normal operations are in the order of 1 minute, while the situation described can take from 30 minutes to several hours.
Reproduce
git init && dvc init
dvc diff HEAD^
takes unreasonably long time, compared to changes with only addition, deletions, or renames.Expected
dvc diff
should be fast.Environment information
Output of
dvc doctor
:Additional Information (if any):
Profiling in worst case scenario (see
ncalls
ofHashInfo.__eq__
).Profiling of normal scenario.
The text was updated successfully, but these errors were encountered: