Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastest way to get the latest commit? #240

Closed
ercpe opened this issue Jan 19, 2015 · 6 comments
Closed

Fastest way to get the latest commit? #240

ercpe opened this issue Jan 19, 2015 · 6 comments
Labels

Comments

@ercpe
Copy link

ercpe commented Jan 19, 2015

I'm trying to get the latest commit for each item in a list of Tree or Blob objects (sth. like a repo browser). Currently i'm doing something like

repo = git.Repo("test.git")
tree = repo.tree()
for obj in tree:
    print obj, obj.path, repo.iter_commits(paths=obj.path, max_count=1).next()

but this is incredible slow. Another solution is to use repo.iter_commits() to fill a dict of path <-> commit:

for commit in repo.iter_commits(paths=paths):
    for f in commit.stats.files.keys():
        p = f[:f.index('/')] if '/' in f else f
        if p in latest_commits:
            continue

        latest_commits[p] = commit

However, in the worst case this one iterates over all commits in the entire repository (which is obviously a very bad idea).

Is there a faster solution?

@Byron Byron added the Q&A label Jan 19, 2015
@Byron Byron added this to the v0.3.6 - Features milestone Jan 19, 2015
@Byron
Copy link
Member

Byron commented Jan 19, 2015

I don't think so. My general advice is to use a GitCmdObjectDB in your Repo instance - this might already improve performance.

git.Repo(path, odbt=git.GitCmdObjectDB)

The operation you try to perform is inherently expensive, and I believe there is no better way if caches cannot be used. The second approach seems to be best, as you inherently use the git command to apply the path filter, and then python to find the actual commit per path. However, commit.stats is implemented in pure python, which could be a bottleneck.

Also I believe you should profile your application to actually see where the time is spend, maybe more ideas arise from that.

I'd be interested to learn about your findings - please feel free to post them here.
Please also note that I will close this issue when 0.3.6 is due for release.

@ercpe
Copy link
Author

ercpe commented Jan 20, 2015

Thanks for your reply. Here are some findings from my tests.
The repository is a bare clone of Torvald's linux sources - the biggest git repo i'm aware of.

test1.py:

repo = git.Repo("~/repos/linux.git")
tree = repo.tree()
for obj in tree:
    print obj, obj.path, repo.iter_commits(paths=obj.path, max_count=1).next()

test2.py:

repo = git.Repo("~/repos/linux.git", odbt=git.GitCmdObjectDB)
tree = repo.tree()
for obj in tree:
    print obj, obj.path, repo.iter_commits(paths=obj.path, max_count=1).next()

test1.py and test2.py are repeatable pretty close - ranging from 1.5 to 2.1 sec. On the first run, test2 was much faster (fs cache issue?).

test3.py:

repo = git.Repo("~/repos/linux.git")
tree = repo.tree()
paths = [obj.path for obj in tree]

latest_commits = {}

for commit in repo.iter_commits(paths=paths):
    for f in commit.stats.files.keys():
        p = f[:f.index('/')] if '/' in f else f
        if p in latest_commits:
            continue

        print("adding %s for %s (was: %s)" % (commit, p, f))
        latest_commits[p] = commit

    if len(latest_commits) == len(paths):
        break

test3.py completes in short over a minute and looks at over 6000 commits. commit.stats.files.keys() can contain something like {virt => arch/x86}/kvm/ioapic.h when a rename happen, so my test code may be at fault. However, even before the loop hits that commit it has already exceeded the time of test1 / test2.

Here is a call graph of test3:

test3

If i read it correctly, git.cmd.Git.execute is called for every commit.

test4.py is the same as test3py but with odbt=git.GitCmdObjectDB. I haven't managed to get it finished. The process is probably still looking for revisions...

@Byron
Copy link
Member

Byron commented Jan 20, 2015

Thanks for sharing your results ! I was quite surprised to see

  • test1 and test2 finish that quickly
  • a git.stats call is made when commit.stats is called, for some reason I thought this was pure-python (and it's probably a good thing that it isn't ;))

It would of course be even more interesting to see how fast pygit2 (bindings to libgit2) can be - apparently they have the required stats() function as well.

@ercpe
Copy link
Author

ercpe commented Jan 21, 2015

I've redone the test with test1 and test2 - test2 is on average only slightly faster (500ms).

I have looked at pygit2 which feels faster but has a lot worse interface. Unfortunatly, git_diff_get_stats isn't implemented yet (libgit2/pygit2#406) :/

@Byron
Copy link
Member

Byron commented Jan 21, 2015

Ah, I was confused for a moment, but finally understood that pygit just doesn't yet bind to it. Interesting is that they indeed implement this binding manually ! It's amazing that people still do this nowadays, as I believed binding generators are the standard way to approach this issue.

Maybe with a little bit of luck, they will get to it soon so you can give that one a shot.

@Byron Byron closed this as completed Jan 22, 2015
@graingert
Copy link
Contributor

@ercpe looks like it's implemented now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants