Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with large git indexes #2938

Closed
krlvi opened this issue Feb 28, 2024 — with Linear · 11 comments
Closed

Performance issues with large git indexes #2938

krlvi opened this issue Feb 28, 2024 — with Linear · 11 comments
Labels
performance Application is too slow

Comments

Copy link
Member

krlvi commented Feb 28, 2024

Repositories that have high cardinality of files (20K+ of distinct files) are slow to operate on.
Initial investigation indicates that this is related to frequent operations on a large index.

Discord thread:https://discord.com/channels/1060193121130000425/1073202153163857920/1211957896968142912

@krlvi krlvi added the performance Application is too slow label Feb 28, 2024 — with Linear
@rhizoome
Copy link

rhizoome commented Mar 6, 2024

I presented GitButler to my colleagues and they of course immediately tried to use it on our largest repository. Moving changes between virtual-branches takes over two seconds. So I started to dig into this. And ran a profiler.

92% of the time is spent in git_diff_tree_to_workdir.

perf

So I am currently reading the code and thinking about the issue. I just wanted to give feedback early.

EDIT

Findings

  • We seem to endup in modified_uncertain way too often, maybe rehashing the complete workdir.
  • Yes, definitely every file gets hashed because the item that comes from the tree has no mtime, ctime, size or anything like that set.
oitem	const git_index_entry *	0x118fdb8e8	0x0000000118fdb8e8
ctime	git_index_time	
seconds	int32_t	0
nanoseconds	uint32_t	0
mtime	git_index_time	
seconds	int32_t	0
nanoseconds	uint32_t	0
dev	uint32_t	0
ino	uint32_t	0
mode	uint32_t	33188
uid	uint32_t	0
gid	uint32_t	0
file_size	uint32_t	0
id	git_oid	
flags	uint16_t	0
flags_extended	uint16_t	0
path	const char *	"django/camac/alexandria/extensions/__init__.py"	0x00006000012c4c80
  • We end up in git_odb__hashfd for every file, every time we move commit
  • But what then gets hashed is the item in the workdir and it is possible to cache these hashes since most files obviously do not change. It is just not a feature libgit2 has.

@krlvi
Copy link
Member Author

krlvi commented Mar 6, 2024

hey! thanks for the profiling info! how many tracked files do you have? git ls-files | wc -l
The app updates the index too frequently, and one immediate optimization I was working on is to detect noops and not write in those cases.

@krlvi
Copy link
Member Author

krlvi commented Mar 6, 2024

something that we will try out soon, is see if can use gitoxide for the heaviest operations GitoxideLabs/gitoxide#1287 but either way we'd need to make the flow efficient the the the snappiest possible experience

@rhizoome
Copy link

rhizoome commented Mar 6, 2024

$ git ls-files | wc -l
11611

It seems like rehashing a lot, we also have a lot large snapshot-test files. (See findings)

@rhizoome
Copy link

rhizoome commented Mar 15, 2024

I think I found the problem. In git_diff_tree_to_workdir the hashes from the workdir (filesystem) are not cached. It will rehash every unchanged file. The cache seems to be missing, because we compare to a (virtual) tree. I assume it is not implemented, because the cache cannot be stored in existing git-structures. libgit2 would have to store this cache at a place currently not existing. But it also seems difficult to fix that outside libgit2. (See also edits above)

@rhizoome
Copy link

If we use git_diff_tree_to_workdir_with_index we get a cache. I guess this means we could fix this, by reverting the effect of staged data before doing the diff. I assume users shouldn't stage anything when using GitButler anyways.

diff --git a/gitbutler-app/src/git/repository.rs b/gitbutler-app/src/git/repository.rs
index 7893c12b..db025c65 100644
--- a/gitbutler-app/src/git/repository.rs
+++ b/gitbutler-app/src/git/repository.rs
@@ -148,7 +148,7 @@ impl Repository {
         opts: Option<&mut git2::DiffOptions>,
     ) -> Result<git2::Diff<'_>> {
         self.0
-            .diff_tree_to_workdir(old_tree.map(Into::into), opts)
+            .diff_tree_to_workdir_with_index(old_tree.map(Into::into), opts)
             .map_err(Into::into)
     }

@krlvi
Copy link
Member Author

krlvi commented Mar 15, 2024

Thank you so much for showing this @ganwell! You are right that staging is not something that is in the flow. I am gonna merge your patch to do some testing with the nightly build over the weekend 🙇

@rhizoome
Copy link

rhizoome commented Mar 20, 2024

Thank you so much for showing this @ganwell! You are right that staging is not something that is in the flow. I am gonna merge your patch to do some testing with the nightly build over the weekend 🙇

I saw the failure at #3158. I checked cargo test and came to the conclusion the index should be updated before being used.

Steps

  • I updated my code to test lastest version b594c68
  • Checked with the profiler that the problem still exists
  • Run the cargo test with the old patch -> fails
  • Added index.update_all the to diff_tree_to_workdir
  • Checked the profiler: It does not show reindexing. The GUI also feels fast
  • Run the cargo test with the new patch -> ok

Considerations

  • I read to code and could not find anything that could be hurt by the update_all
  • I am kind of confused why the index never gets updated currently, but the index is only used with a checkout. Maybe on checkout it gets updated automatically
  • I currently assume that if the index does not exist, it is created on diff_tree_to_workdir_with_index, but I haven't read to code the check that. At least in unittests all conditions seem to be met.
  • EDIT: Important: I still haven't tested what happens if a user would stage something. Maybe we could just reset any staging, like additional commits get removed if the repo is in GitButler Mode -> GitButler reverts staging. We should be save!
diff --git a/gitbutler-app/src/git/repository.rs b/gitbutler-app/src/git/repository.rs
index 7893c12b..0017fe41 100644
--- a/gitbutler-app/src/git/repository.rs
+++ b/gitbutler-app/src/git/repository.rs
@@ -147,8 +147,11 @@ impl Repository {
         old_tree: Option<&Tree<'_>>,
         opts: Option<&mut git2::DiffOptions>,
     ) -> Result<git2::Diff<'_>> {
+        if let Ok(mut index) = self.0.index() {
+            index.update_all(vec!["*"], None)?;
+        }
         self.0
-            .diff_tree_to_workdir(old_tree.map(Into::into), opts)
+            .diff_tree_to_workdir_with_index(old_tree.map(Into::into), opts)
             .map_err(Into::into)
     }

@krlvi
Copy link
Member Author

krlvi commented Mar 20, 2024

thanks! let me get that patch in - i got distracted with other things. I really appreciated your help here, i wanna buy you a beer!
regarding staging, i think it is already a corner case that may haver unknown outcomes - i will investigate

@rhizoome
Copy link

rhizoome commented Mar 20, 2024

regarding staging, i think it is already a corner case that may haver unknown outcomes - i will investigate

GitButler reverts staging as soon as it does its update-cycle. I to my understanding that means using the index is save. If nothing is staged diff_tree_to_workdir and diff_tree_to_workdir_with_index should be equal.

@krlvi
Copy link
Member Author

krlvi commented Mar 20, 2024

Once again, than you for debugging and profiling this @ganwell ! I have merged your patch and if you spot anything else, you are welcome to open a PR directly too. I think i'm gonna close this issue.

If you would like, ping me on [email protected] and i will make sure that multiple beers get delivered to your door 🍻!

@krlvi krlvi closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Application is too slow
Projects
None yet
Development

No branches or pull requests

2 participants