Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ChangeSet more full-featured #100

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

davisjam
Copy link
Contributor

@davisjam davisjam commented Oct 19, 2017

A Changeset now includes a message as well as author/committer objects
modeled on JGit's PersonIdent.

This is a breaking change for anyone who uses ChangeSet's time field.
The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git.

The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons.
This is what the user generally intends: "the time at which this commit entered the repository".

Addresses #96.

A Changeset now includes a message as well as author/committer objects
modeled on JGit's PersonIdent.

This is a breaking change for anyone who uses ChangeSet.
The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git.

The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons.
This is what the user generally intends: "the time at which this commit entered the repository".

Addresses mauricioaniche#96.
Copy link
Contributor Author

@davisjam davisjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this review

}

@Override
public int hashCode() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just remove this?

* A POJO to track details about a person involved with a commit.
*/
public class CommitPerson {
public String name;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just access these fields directly. Do you want me to use getters instead or to make these fields final?

Copy link
Owner

@mauricioaniche mauricioaniche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns about this:

  1. The idea of the ChangeSet class was for it to be a lightweight way to get all commit hashes that need to be investigated later. I think we can definitely add the two dates Git stores. However, In this PR, the ChangeSet contains the message and authors, which can be a burden at that moment where repodriller just needs the commit IDs.

  2. I need to carefully review it, but I suggest that we keep the behaviour of existing filters, and we add a new one that now looks to the committer date, as you need.

ChangeSets are only used internally, so I don't see a problem when changing it.

Thoughts?

@davisjam
Copy link
Contributor Author

davisjam commented Oct 19, 2017

  1. "The idea of the ChangeSet class was for it to be a lightweight way to get all commit hashes that need to be investigated later. I think we can definitely add the two dates Git stores. However, In this PR, the ChangeSet contains the message and authors, which can be a burden at that moment where repodriller just needs the commit IDs."

Since a ChangeSet includes a time and some CommitRanges (MonthlyCommits) explicitly relies on this field, I think a ChangeSet is more accurately described as "as much commit metadata as we can extract cheaply for easy filtering".

Storage cost: The new ChangeSet is slightly larger, it's true. But it's still a relatively small class composed of Strings and Calendars. I don't think the addition of a Calendar and some Strings will noticeably inflate the overhead, especially since we only load the ChangeSets for one SCM at a time.

Computation cost: Obtaining the message and authors uses the same JGit/SVN objects that were used to obtain the commit IDs. We just extract more information from them. So there's not really extra computational cost.

I don't see any downside to having more information in the ChangeSet. The upside is that a CommitRange can now make better decisions. The current example is that the CommitterTime, not the AuthorTime, is the time we probably want.

  1. "I need to carefully review it, but I suggest that we keep the behaviour of existing filters, and we add a new one that now looks to the committer date, as you need."

This has two parts:

  1. I think the existing RepoDriller-provided MonthlyCommits filter is not doing what it is supposed to in more complicated repositories, as described in MonthlyCommits is unsafe #96. This PR make the behavior what the user actually expects.

  2. Breaking existing user-created filters that use timestamps is unavoidable here, because this PR introduces a second definition of time. Do you think this will significantly inconvenience the userbase? Is this a popular features, are there enough long-term users that accommodating their needs matters, etc.?

Another way to phrase this second part is "If we had to choose between making an improvement and maintaining backwards compatibility, which do we pick?". My opinion is that RepoDriller doesn't have a large-enough user base to hold back on improvements just to avoid breaking users. Making a breaking change without a good reason is silly, of course, but I think this is a case where the upsides justify it. A project like BOA has to consider backwards compatibility, but I don't think RepoDriller is there yet. Interested in your thoughts.

@mauricioaniche
Copy link
Owner

I am sure commit messages can be quite large. As we don't have filter based on commit messages, I suggest we do not store them in ChangeSet.

The current MonthlyCommits gives commits according to one out of the two dates Git saves. It's not wrong, it's just does not sufficiently cover all the cases. My suggestion is to, instead of breaking the existing behavior of MonthlyCommits, we either 1) add a CommitterMonthlyCommits or 2) create a constructor in MonthlyCommits that receives a flag that indicates which date to use; the default is the existing one. We add nice documentation about it in the README.

No, we do not have hundreds of users using it, but myself and PhD students here have a lot of studies on it. Changing it would require us to change a lot of our existing code base (which we will not do, as we'll forget about this breaking change 2 days after the PR being merged, hehe).

@davisjam
Copy link
Contributor Author

  1. "As we don't have filter based on commit messages, I suggest we do not store them in ChangeSet."
    I've seen some research studies that do filter based on commit messages. Is this something we might want in the future?

  2. I like the idea of controlling which date the MonthlyCommits filter uses. There's a between-dates filter that will be similarly affected. I'll try something along these lines.

Address mauricioaniche's comments.

Note: We are defaulting to "probably not the value the user wants".
Is this a good idea?
@davisjam
Copy link
Contributor Author

4b88f68 addresses the second suggestion.

I've kept the commit messages for now. I don't think it makes sense to worry about the storage overhead of messages -- do you have an example repository with large commit messages I could evaluate?

Mark this deprecated.

Existing analyses should migrate to getAuthor().time or getCommitter().time.
I suspect everyone should use getCommitter().time.
@davisjam
Copy link
Contributor Author

6ba0cf8 means that existing ChangeSet filters won't break, though I've marked ChangeSet.getTime deprecated.

@mauricioaniche
Copy link
Owner

If one needs to do something (like filtering) with the commit message, then it can be done at the CommitVisitor, e.g.

if(commit.getMsg().contains("bug")) return;

I still do not see a reason to make ChangeSet almost as Commit. It seems a lot of duplicated data for no big reason. Maybe the all Filter part of the API should change and make use of Commit instead of ChangeSet.

Let me reflect on this.

@davisjam
Copy link
Contributor Author

davisjam commented Oct 21, 2017

It seems like the question is how much it costs to extract a full Commit, and what the user might want to filter on.

Here are my thoughts:
My guess is that:

  1. All of the metadata (author, message, time, etc.) is "cheap" (relatively small and predictable in size).
  2. Some diffs are small (one-liners) but diffs can be huge. We shouldn't load these if we can avoid it.
  3. The list of files touched is metadata and is a reasonable thing to include in the Filter (e.g. "files that end in .java"). This list can be long but I would expect it to be expensive only in rare cases (e.g. directory re-arrangements).

I don't have any data to back up my intuition, but we could certainly take some measurements -- say, on the Linux kernel, Ruby, NodeJS, apache, whatever @ayaankazerouni is working on, and so on.

I think a user might want to filter on any of the metadata. "Commits from author X", "Commits between times A and B", "Commits that touch .java files", "Commits whose checkin messages contain profanity", and so on.

While a user could filter on data ("Commits that modified class X or introduced method Y"), I think this is a rare enough case that the user can apply their own filter while processing the Commit.

Since my intuition is that "metadata is cheap, data is expensive", I suggest that the ChangeSet have all of the metadata we can get ahold of, but not the data. Basically this is everything printed by git log --stat.

@mauricioaniche
Copy link
Owner

Two more cents on this:

  1. Maybe we should once and for all remove the redundancy between CommitRange and Filter. They are meant for the same thing: filtering. However, one looks to ChangeSet (and thus, the filtering happen before capturing all Modifications) and the other looks to Commit (therefore, after creating the 'expensive' Commit object. I am still not sure if we should keep both or simplify it once and for all. At the beginning, they were different just for the sake of saving the trouble of loading the Commit object. But now, I do not know if it worths.

  2. We can keep cheap things at ChangeSet. I also do not have data to back up my opinion. It will definitely increase memory usage as we'd have two objects holding the same data. Unless Commit starts to internally store the ChangeSet, and we save the commit message just once.

What do you think?

@davisjam
Copy link
Contributor Author

If you like my "metadata vs. data" proposal, then I think it makes sense to refactor Commit to contain a ChangeSet and a List<Modification>. The data duplicated between ChangeSet and Commit would then be only the list of affected files, which should be small.

If we drop ChangeSet filters and only offer Commit filters, we will at some point have loaded every commit into a Commit. This seems undesirably expensive.

To avoid having to process the entire repository through Commits at some point during the repository visit, I think it makes sense to load the metadata from the git repository into ChangeSets and let the user apply filters to them.

Then we can walk through the surviving ChangeSets and build Commits one-at-a-time.

As a future possibility, we could add a nextChangeSet() and nextCommit() API to SCM. Then we could stream ChangeSets/Commits instead of loading all of the {ChangeSets, Commits} into memory at once.

Since we already load all of the (minimal) ChangeSets into memory and no one has complained, the easiest approach is to expand the ChangeSets and see if it breaks anything. But that nextChangeSet/nextCommit thing would be cool.

@mauricioaniche
Copy link
Owner

That's how it works right now, right? We get a list of cheap ChangeSet, we filter them. Then, for each ChangeSet, we build an "expensive" Commit and give it to the Visitor. Commits are built one by one and not in a list.

We could definitely offer a Stream API to users, but that can be in another PR.

I think for this one, I'd just now inject the ChangeSet inside of the Commit object, so that 'commit message' is only stored once in the memory. I'd also love to see some memory measurements in any repository (e.g. Rails, Linux). Hopefully, storing all commits upfront will not blow things up.

Just to give you a concrete example of how I use it and why memory is important to me. I often run my experiments in cheap Digital Ocean machines with 200 MB of RAM available for me. I'd love to keep this 200MB as our goal.

@davisjam
Copy link
Contributor Author

Right, at the moment we load all the ChangeSets at once, filter, and then one-at-a-time build Commits. But the total cost of the Commits is equal to the overall size of the repository if there is no filtering -- the goal of the filters is partly for the user's convenience and partly to make sure we only create Commits when we have to.

I'll add a commit moving ChangeSet inside Commit to reduce duplication.

Addresses concerns on mauricioaniche#100.

This version still maintains two copies of a ChangeSet, one in
the Commit and one in the list of ChangeSets that survived the filter.

This is because the SCM interface doesn't have a way to pass in
(and save a reference to) a Changeset. Just the Id is passed at the moment.
Per comments on mauricioaniche#100, don't duplicate the ChangeSet when creating a Commit.
This is an optimization.

Add an SCM.getCommit(ChangeSet cs) API.
Refactor RepositoryMining to use this API.
Deprecate the SCM.getCommit(String id) API.
@davisjam
Copy link
Contributor Author

@mauricioaniche, commits 91f3bee and 78537b3 refactor Commit to use a ChangeSet, and add a new SCM API getCommit(Changeset) to avoid duplicating the ChangeSet when creating a Commit from a ChangeSet.

@mauricioaniche
Copy link
Owner

@davisjam, yes, but we do not store a list of Commits in memory, right? This means that after a Commit is processed, the object is ready to be GC'ed. ChangeSets, unfortunately, stay in the memory through all the execution.

I'll review this commit today at some point!

@davisjam
Copy link
Contributor Author

@mauricioaniche Right, my point is just that even if we don't run out of memory, we still don't want to pay the cost of loading every Commit if we can filter some out ahead of time.

It wouldn't be hard to pop and discard ChangeSets after we process them. That would reduce the amount of time that we stay in the "high memory usage" zone until a PR that streams the ChangeSets themselves.

@davisjam
Copy link
Contributor Author

ee75977 fixes merge conflicts.

@mauricioaniche
Copy link
Owner

We are agreeing on not loading more Commits than necessary. I just wanna test how bigger our memory footprint will be as soon as ChangeSets are bigger.

Review is still on my ToDo list. Sorry!

Copy link
Owner

@mauricioaniche mauricioaniche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR sounds ok!

Before merging, I'd love to see a memory comparison. I'll do it as soon as I have time (I'll have a flight from SFO -> Chicago, which I'll have some hours).

I think a simple CommitVisitor that tracks the free memory is enough for us to observe the differences.

Runtime runtime = Runtime.getRuntime();
long memory = runtime.totalMemory() - runtime.freeMemory()

Nevertheless, a good optimization would be to remove the ChangeSet from the list after it was used, and let GC remove it from the memory.


private Set<String> parentIds; /* Parent(s) of this commit. Only "merge commits" have more than one parent. Does the root commit have 0? */

private Set<String> branches; /* Set of branches that contain this commit, i.e. the branches for which this commit is an ancestor of the most recent commit in that branch. */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will definitely increase the memory footprint. Maybe we should indeed discard ChangeSets after they are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discarding ChangeSets after they are used will definitely reduce the time of the maximum footprint (decreasing ChangeSet footprint as the expected footprint from CommitVisitors rises). I'll look into adding this, should just be a few lines of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 1b7affc.

}

@Override
public int hashCode() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the one that intellij generates automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I didn't realize an IDE could do that.

String msg = r.getFullMessage().trim();
CommitContributor author = extractCommitContributor(r.getAuthorIdent());
CommitContributor committer = extractCommitContributor(r.getCommitterIdent());
List<RevCommit> parents = Arrays.asList(r.getParents());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the point where I believe we can connect to #104. We could just load what the user really needs in her particular study.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I don't think we should do that in this PR though.

Problem:
We currently construct all ChangeSets in memory and then process each one.
We maintain pointers to all of the ChangeSets during this process, so the
overall memory footprint only increases due to the CommitVisitors.
See RepositoryMining.processRepo.

The "full-featured ChangeSet" PR increases the size of each ChangeSet,
and there is concern that this may lead to OOM issues on small machines.

Optimization:
Discard the pointer to each ChangeSet after we process it.
If a CommitVisitor does not save the Commits it visits, then
a ChangeSet can be GC'd after a worker processes it.

Longer-term suggestions:
1. If we stream ChangeSets instead of constructing them up front, this
   concern goes away entirely.
2. If we make the amount of data stored in a ChangeSet tunable, the
   old ChangeSet memory footprint can still be achieved.
@mauricioaniche
Copy link
Owner

Awesome. This PR is done! I'll just check it out and run some experiments soon!

@davisjam
Copy link
Contributor Author

Hey @mauricioaniche, had any time for experiments on this?

@davisjam davisjam mentioned this pull request Nov 17, 2017
@mauricioaniche
Copy link
Owner

@davisjam Ah, I'll do the charts and etc. Just need travis to run there! :)

@davisjam
Copy link
Contributor Author

Paper deadline on Dec 1, I should be able to look at this around Dec. 5?

@mauricioaniche
Copy link
Owner

Sure! Good luck with your submission!

@mauricioaniche
Copy link
Owner

@davisjam See the latest build in the master and the build at this one.

In the master, it takes a few milliseconds to start processing the repo:

21:46:17.737 [main] INFO  org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails
21:46:19.369 [main] INFO  org.repodriller.RepositoryMining - 1001 ChangeSets to process

In this PR, I'm waiting for 7 minutes now, and it does even leave the first line:

21:52:31.963 [main] INFO  org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails

I suppose that collecting all the information beforehand is just too much... Thoughts?

@davisjam
Copy link
Contributor Author

davisjam commented Dec 16, 2017 via email

@mauricioaniche
Copy link
Owner

A lot has changed since this (old) PR. Merging it will be challenging. What do you wanna do, @davisjam ? Shall we just close it?

@davisjam
Copy link
Contributor Author

I'm a bit torn. I think the additional fields are useful. I just don't need them at the moment. Open to thoughts of other RepoDriller users on reviving this PR is of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants