-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance(diff): faster rename detection #550
Performance(diff): faster rename detection #550
Conversation
4278ea5
to
3817baa
Compare
@Northo, can you share benchmarks or profiling information? Also, can you please add a description? |
@skshetry, sorry. Inteded to make a draft pull request only, at this time. Will supply more detailed inormation soon. For now, see iterative/dvc#10515. |
66d986e
to
21e0223
Compare
Ok, thanks! In that case, consider my PR open for review:) |
EDIT: Sorry, I was not deleting the directories between each stages. I see the improvements now. :) Since we are using an iterator, I used the following patch to isolate diff --git a/src/dvc_data/index/diff.py b/src/dvc_data/index/diff.py
index 40fd5eb..491dfcf 100644
--- a/src/dvc_data/index/diff.py
+++ b/src/dvc_data/index/diff.py
@@ -5,6 +5,7 @@ from typing import TYPE_CHECKING, Any, Callable, Optional, cast
from attrs import define
from fsspec.callbacks import DEFAULT_CALLBACK, Callback
+import funcy
if TYPE_CHECKING:
from dvc_data.hashfile.hash_info import HashInfo
@@ -328,6 +329,9 @@ def diff( # noqa: PLR0913
if with_renames and old is not None and new is not None:
assert not meta_only
- yield from _detect_renames(changes)
+ changes_list = list(changes)
+ with funcy.print_durations("Detecting renames"):
+ changes = list(_detect_renames(changes_list))
+ yield from changes
else:
yield from changes Here's the script that I tried for the record: Details
cd "$(mktemp -d)"
git init
dvc init -q
dvc config -q core.autostage true
git commit -m "init"
mkdir data; for i in {1..10000}; do echo $i > data/file_$i; done
dvc add data -q
git commit -am "first"
first=$(git rev-parse HEAD)
rm -rf data
mkdir data; for i in {1..10000}; do echo $i > data/file_$i.ext; done
dvc add data -q
git commit -am "second"
second=$(git rev-parse HEAD)
rm -rf data
mkdir data; for i in {1..10000}; do echo $((i + 20000)) > data/file_$i.ext; done
dvc add data -q
git commit -am "third"
third=$(git rev-parse HEAD)
dvc diff $first $second | tail -n1
dvc diff $second $third | tail -n1
dvc diff $first $third | tail -n1 Output from this PR
Output from
|
Happy to hear that! I was struggling trying to figure out why it could not be reproduced;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing, and making an amazing improvement in performance to dvc. 🙂
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #550 +/- ##
==========================================
+ Coverage 62.98% 70.99% +8.00%
==========================================
Files 62 67 +5
Lines 4342 4941 +599
Branches 740 829 +89
==========================================
+ Hits 2735 3508 +773
+ Misses 1448 1223 -225
- Partials 159 210 +51 ☔ View full report in Codecov by Sentry. |
Thanks for really swift follow-up on the PR @skshetry ! |
Fixes iterative/dvc#10515
Notes and questions
added
anddeleted
before was to optimize the runtime (more likely to get an early hit in the inner loop). I included this now maintain the former output. In the case of multiple files with the same hash, changing this may result in other pairs being detected as renames. However, if maintaining this is not a priority, I think we can remove it, making it simpler. It will still be deterministic.change.new
andchange.old
, as this seemed to be the convention. LMK. if I should change for exceptions, or if we should just assume_diff
to always supply valid changes.dvc diff
slow when there are many unique additions and deletions dvc#10515?HashInfo
hasunsafe_hash=True
, however, as far as I understant, it should not affect this implementation.Benchmark
Consider a similar setup as the minimal reproduction in iterative/dvc#10515.
We make three versions of our dataset:
mkdir data; for i in {1..10000}; do echo $i > data/file_$i; done
mkdir data; for i in {1..10000}; do echo $i > data/file_$i.ext; done
mkdir data; for i in {1..10000}; do echo $((i + 20000)) > data/file_$i.ext; done
From Tag1 -> Tag2, we rename the files, without changing the content. From Tag2-> Tag3, we change the content of the files.
Run
dvc diff <from> <to>
.Ps. Results show timings from only one experiment.