Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add db-gen program to create DBs 3x to 20x as fast (powered by gitoxide) #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Byron
Copy link

@Byron Byron commented Jan 29, 2023

The db-gen program uses gitoxide to produce diffs in parallel, and despite wasting quite a bit of CPU due to less-than-stellar object access performance for diffs, it still manages to create a linux kernel database in 35 21 minutes (M1 Pro).

Tasks

  • wait for rename-tracking support in gitoxide
  • optional 'find-copies' , implemented as '--find-copies-harder`
  • ordering of diff stats in consumer to get renames into the right order as well
  • upgrade to latest gitoxide version
  • fix consistency issue

@Byron
Copy link
Author

Byron commented Jan 29, 2023

I ran the version with optimized caches against cpython for the first time and it finished in 16s.

@Byron Byron changed the title add db-gen program to create DBs twice as fast (powered by gitoxide) add db-gen program to create DBs thrice as fast (powered by gitoxide) Jan 29, 2023
@Byron Byron changed the title add db-gen program to create DBs thrice as fast (powered by gitoxide) add db-gen program to create DBs 3x to 20x as fast (powered by gitoxide) Jan 29, 2023
@jmforsythe
Copy link
Owner

Wow thanks for the input, I was looking into using gitpython to try and generate this.
I'm not certain that this can be done in parallel, as the file renaming/creating/deleting detection was very fiddly to get right.
I haven't fully read your branch yet, does it account for renaming?

@Byron
Copy link
Author

Byron commented Jan 29, 2023

Wow thanks for the input, I was looking into using gitpython to try and generate this.

Glad I came along to prevent this - GitPython isn't good, trust me, I know ;).

I haven't fully read your branch yet, does it account for renaming?

Probably not, as it can't yet do rename tracking. I saw that the python script is relying on an orderly invocation of files, from first commit to last, and that's not done here either.

The good thing is that the order can be re-introduced by adding sequential ids to chunks, so that's absolutely solvable without loosing parallelism. What's more concerning is that rename tracking isn't implemented in gitoxide yet, so simple rename tracking would have to be implemented here which could then be backported (simple, as in the renamed file wasn't changed and has the same hash).

@Byron
Copy link
Author

Byron commented Jan 30, 2023

I took another look at the renaming tracking problem and realized, to my surprise, that the default is to do rename tracking, and to consider 50% similar files for renames. Since we already know how to do diffs, this would just be another version of it, causing many more diffs to be created between the deleted and added files (to determine their similarity).

If you don't mind, please feel free to keep this PR open even without rename tracking, and I will implement it in gitoxide and be back here to finish it up.

@Byron
Copy link
Author

Byron commented Feb 9, 2023

@jmforsythe Would you mind adding a license file to the repository? I am now working on rename tracking within gitoxide and am considering this program here as an example for 'how to use gitoxide to generate a DB of information from a git repository'. If your license was MIT or Apache, I would be able to do that easily, given that I copied the database definition verbatim. Thank you.

@jmforsythe
Copy link
Owner

I haven't decided on a license yet. What specifically do you need it for? If it is just the database schema, then go ahead.

@Byron
Copy link
Author

Byron commented Feb 10, 2023

Thanks, that's exactly what I would have needed it for. Then I will feel free to use the DB schema as is and probably link to your comment somewhere in the example code for reference and attribution.

@Byron
Copy link
Author

Byron commented Feb 20, 2023

@jmforsythe Rename tracking has been implemented in gitoxide and luckily, it's just as fast as it is before. Copy tracking could also be activated without noticeable cost, but the database schema doesn't support that yet. Please note that the results of the rename tracking might differ rarely, as gitoxide uses the first suitable candidate whereas git will try up to 4 candidates and use the best. Hence, gitoxide is currently less precise.

In any case, please let me know what you think.

Edit: it looks like rename tracking might violate a constraint, which seems to happen when building indices at the very end of the run - probably it wasn't finished yet as the expected runtime was 21 minutes. This is probably a sign that an investigation is needed here, updates will follow.

cargo build --release && rm linux*; /usr/bin/time -lp ./target/release/db-gen /Users/byron/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux -o 800
[..]
21:08:38 traverse commit graph done 1.1M commits in 13.55s (83.9k commits/s)
Error: UNIQUE constraint failed: commitFile.hash, commitFile.fileID

Caused by:
    Error code 1555: A PRIMARY KEY constraint failed
real 1151.10
user 10680.99
sys 53.93
          5487312896  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              362162  page reclaims
              186714  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              100927  voluntary context switches
             7134379  involuntary context switches
      84275809400944  instructions retired
      30851432674389  cycles elapsed
          3668195840  peak memory footprint

Git-Heat-Map/db-gen ( faster-db-generation) [?] took 19m23s

…ide`)

It produces diffs in parallel, and despite wasting quite a bit of CPU
due to less-than-stellar object access performance for diffs, it still
manages to create a linux kernel database in ~21 minutes (M1 Pro).
@Byron
Copy link
Author

Byron commented Feb 24, 2023

Thanks for the patience - I believe the underlying issue was addressed so this implementation will track renames as well. You probably want to validate the tool's output with the baseline as well, which is something I have never done. I'd expect it to be indistinguishable for the most part, yet would be very interested to see a data-diff in case there are indeed differences and how this looks in practice.

@jmforsythe
Copy link
Owner

I'll try and write some tests soon to validate your generator.

@jmforsythe jmforsythe force-pushed the master branch 2 times, most recently from ff96bf3 to 6622df0 Compare July 20, 2023 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants