Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance(diff): faster rename detection #550

Merged
merged 18 commits into from
Aug 13, 2024

Conversation

Northo
Copy link
Contributor

@Northo Northo commented Aug 9, 2024

Fixes iterative/dvc#10515

Notes and questions

  • I presume the reason for sorting added and deleted before was to optimize the runtime (more likely to get an early hit in the inner loop). I included this now maintain the former output. In the case of multiple files with the same hash, changing this may result in other pairs being detected as renames. However, if maintaining this is not a priority, I think we can remove it, making it simpler. It will still be deterministic.
  • I used asserts for verifying change.new and change.old, as this seemed to be the convention. LMK. if I should change for exceptions, or if we should just assume _diff to always supply valid changes.
  • I'm only just aware of the https://github.com/iterative/dvc-bench repo. Should we add a test case to it that would have uncovered dvc diff slow when there are many unique additions and deletions dvc#10515?
  • It is not entirely clear to me why HashInfo has unsafe_hash=True, however, as far as I understant, it should not affect this implementation.

Benchmark

Consider a similar setup as the minimal reproduction in iterative/dvc#10515.
We make three versions of our dataset:

  • Tag1: mkdir data; for i in {1..10000}; do echo $i > data/file_$i; done
  • Tag2: mkdir data; for i in {1..10000}; do echo $i > data/file_$i.ext; done
  • Tag3: mkdir data; for i in {1..10000}; do echo $((i + 20000)) > data/file_$i.ext; done
    From Tag1 -> Tag2, we rename the files, without changing the content. From Tag2-> Tag3, we change the content of the files.

Run dvc diff <from> <to>.

Change Old New
Tag1 -> Tag2 0.49 s 0.50 s
Tag2 -> Tag3 0.46 s 0.46 s
Tag1 -> Tag3 14.6 s 0.56 s

Ps. Results show timings from only one experiment.

@Northo Northo force-pushed the performance/faster-rename-detection branch from 4278ea5 to 3817baa Compare August 9, 2024 14:03
@skshetry
Copy link
Member

skshetry commented Aug 9, 2024

@Northo, can you share benchmarks or profiling information? Also, can you please add a description?

@Northo Northo marked this pull request as draft August 9, 2024 14:57
@Northo
Copy link
Contributor Author

Northo commented Aug 9, 2024

@Northo, can you share benchmarks or profiling information? Also, can you please add a description?

@skshetry, sorry. Inteded to make a draft pull request only, at this time. Will supply more detailed inormation soon. For now, see iterative/dvc#10515.

@Northo Northo marked this pull request as ready for review August 12, 2024 08:19
@Northo Northo changed the title Performance/faster rename detection Performance/faster rename detection in diff Aug 12, 2024
@Northo Northo changed the title Performance/faster rename detection in diff Performance(diff): faster rename detection Aug 12, 2024
@Northo Northo force-pushed the performance/faster-rename-detection branch from 66d986e to 21e0223 Compare August 12, 2024 08:55
@Northo
Copy link
Contributor Author

Northo commented Aug 12, 2024

@skshetry, one benchmark test failed with regression. I cannot see that it is related to these changes, and also see that it first failed and then suceeded on #552. Have I made a mistake, or is it just flaky?

@skshetry
Copy link
Member

@skshetry, one benchmark test failed with regression. I cannot see that it is related to these changes, and also see that it first failed and then suceeded on #552. Have I made a mistake, or is it just flaky?

yeah, it's flaky. Please ignore.

@Northo
Copy link
Contributor Author

Northo commented Aug 12, 2024

@skshetry, one benchmark test failed with regression. I cannot see that it is related to these changes, and also see that it first failed and then suceeded on #552. Have I made a mistake, or is it just flaky?

yeah, it's flaky. Please ignore.

Ok, thanks! In that case, consider my PR open for review:)

@skshetry
Copy link
Member

skshetry commented Aug 13, 2024

I tried your script above (but created 100,000 files instead of 10,000). But I did not see any meaningful changes, and in fact it was slightly slower than before.

EDIT: Sorry, I was not deleting the directories between each stages. I see the improvements now. :)

Since we are using an iterator, I used the following patch to isolate _detect_renames.

diff --git a/src/dvc_data/index/diff.py b/src/dvc_data/index/diff.py
index 40fd5eb..491dfcf 100644
--- a/src/dvc_data/index/diff.py
+++ b/src/dvc_data/index/diff.py
@@ -5,6 +5,7 @@ from typing import TYPE_CHECKING, Any, Callable, Optional, cast

 from attrs import define
 from fsspec.callbacks import DEFAULT_CALLBACK, Callback
+import funcy

 if TYPE_CHECKING:
     from dvc_data.hashfile.hash_info import HashInfo
@@ -328,6 +329,9 @@ def diff(  # noqa: PLR0913

     if with_renames and old is not None and new is not None:
         assert not meta_only
-        yield from _detect_renames(changes)
+        changes_list = list(changes)
+        with funcy.print_durations("Detecting renames"):
+            changes = list(_detect_renames(changes_list))
+        yield from changes
     else:
         yield from changes

Here's the script that I tried for the record:

Details

cd "$(mktemp -d)"
git init
dvc init -q
dvc config -q core.autostage true
git commit -m "init"

mkdir data; for i in {1..10000}; do echo $i > data/file_$i; done
dvc add data -q
git commit -am "first"
first=$(git rev-parse HEAD)

rm -rf data
mkdir data; for i in {1..10000}; do echo $i > data/file_$i.ext; done
dvc add data -q
git commit -am "second"
second=$(git rev-parse HEAD)

rm -rf data
mkdir data; for i in {1..10000}; do echo $((i + 20000)) > data/file_$i.ext; done
dvc add data -q
git commit -am "third"
third=$(git rev-parse HEAD)

dvc diff $first $second | tail -n1
dvc diff $second $third | tail -n1
dvc diff $first $third | tail -n1

Output from this PR

Initialized empty Git repository in /tmp/tmp.SL6nBZbf4r/.git/
[main (root-commit) 797a1c0] init
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
[main 7410fd4] first
 3 files changed, 9 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 data.dvc
[main b6a8d9c] second
 1 file changed, 1 insertion(+), 1 deletion(-)
[main b510be5] third
 1 file changed, 2 insertions(+), 2 deletions(-)
   34.54 ms in Detecting renames
files summary: 10000 renamed
  458.94 mks in Detecting renames
files summary: 10000 modified
   26.89 ms in Detecting renames
files summary: 10000 added, 10000 deleted

Output from main

Initialized empty Git repository in /tmp/tmp.z655bZNqCi/.git/
[main (root-commit) f8ba981] init
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
[main caa3a37] first
 3 files changed, 9 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 data.dvc
[main b697821] second
 1 file changed, 1 insertion(+), 1 deletion(-)
[main e413ebe] third
 1 file changed, 2 insertions(+), 2 deletions(-)
   25.62 ms in Detecting renames
files summary: 10000 renamed
  476.12 mks in Detecting renames
files summary: 10000 modified
   15.20 s in Detecting renames
files summary: 10000 added, 10000 deleted

@Northo
Copy link
Contributor Author

Northo commented Aug 13, 2024

EDIT: Sorry, I was not deleting the directories between each stages. I see the improvements now. :)

Happy to hear that! I was struggling trying to figure out why it could not be reproduced;)
Thanks for the well-formed reproduction script!

@Northo Northo requested a review from skshetry August 13, 2024 11:59
Copy link
Member

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing, and making an amazing improvement in performance to dvc. 🙂

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.99%. Comparing base (081a1d8) to head (9e53ce5).
Report is 33 commits behind head on main.

Files Patch % Lines
src/dvc_data/index/diff.py 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #550      +/-   ##
==========================================
+ Coverage   62.98%   70.99%   +8.00%     
==========================================
  Files          62       67       +5     
  Lines        4342     4941     +599     
  Branches      740      829      +89     
==========================================
+ Hits         2735     3508     +773     
+ Misses       1448     1223     -225     
- Partials      159      210      +51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry merged commit 98b5e73 into iterative:main Aug 13, 2024
15 of 16 checks passed
@Northo
Copy link
Contributor Author

Northo commented Aug 13, 2024

Thank you for contributing, and making an amazing improvement in performance to dvc. 🙂

Thanks for really swift follow-up on the PR @skshetry !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc diff slow when there are many unique additions and deletions
3 participants