-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add db-gen
program to create DBs 3x to 20x as fast (powered by gitoxide
)
#6
base: master
Are you sure you want to change the base?
Conversation
I ran the version with optimized caches against cpython for the first time and it finished in 16s. |
db-gen
program to create DBs twice as fast (powered by gitoxide
)db-gen
program to create DBs thrice as fast (powered by gitoxide
)
db-gen
program to create DBs thrice as fast (powered by gitoxide
)db-gen
program to create DBs 3x to 20x as fast (powered by gitoxide
)
Wow thanks for the input, I was looking into using gitpython to try and generate this. |
Glad I came along to prevent this - GitPython isn't good, trust me, I know ;).
Probably not, as it can't yet do rename tracking. I saw that the python script is relying on an orderly invocation of files, from first commit to last, and that's not done here either. The good thing is that the order can be re-introduced by adding sequential ids to chunks, so that's absolutely solvable without loosing parallelism. What's more concerning is that rename tracking isn't implemented in |
I took another look at the renaming tracking problem and realized, to my surprise, that the default is to do rename tracking, and to consider 50% similar files for renames. Since we already know how to do diffs, this would just be another version of it, causing many more diffs to be created between the deleted and added files (to determine their similarity). If you don't mind, please feel free to keep this PR open even without rename tracking, and I will implement it in |
025b9d7
to
93a3d92
Compare
@jmforsythe Would you mind adding a license file to the repository? I am now working on rename tracking within |
I haven't decided on a license yet. What specifically do you need it for? If it is just the database schema, then go ahead. |
Thanks, that's exactly what I would have needed it for. Then I will feel free to use the DB schema as is and probably link to your comment somewhere in the example code for reference and attribution. |
93a3d92
to
380182f
Compare
@jmforsythe Rename tracking has been implemented in In any case, please let me know what you think. Edit: it looks like rename tracking might violate a constraint, which seems to happen when building indices at the very end of the run - probably it wasn't finished yet as the expected runtime was 21 minutes. This is probably a sign that an investigation is needed here, updates will follow.
|
…ide`) It produces diffs in parallel, and despite wasting quite a bit of CPU due to less-than-stellar object access performance for diffs, it still manages to create a linux kernel database in ~21 minutes (M1 Pro).
380182f
to
ec7628a
Compare
Thanks for the patience - I believe the underlying issue was addressed so this implementation will track renames as well. You probably want to validate the tool's output with the baseline as well, which is something I have never done. I'd expect it to be indistinguishable for the most part, yet would be very interested to see a data-diff in case there are indeed differences and how this looks in practice. |
I'll try and write some tests soon to validate your generator. |
ff96bf3
to
6622df0
Compare
The
db-gen
program usesgitoxide
to produce diffs in parallel, and despite wasting quite a bit of CPU due to less-than-stellar object access performance for diffs, it still manages to create a linux kernel database in3521 minutes (M1 Pro).Tasks
gitoxide
gitoxide
version