Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate synchronization of repos into the VMR (git-wise) #10257

Closed
Tracked by #9359
premun opened this issue Aug 3, 2022 · 1 comment
Closed
Tracked by #9359

Investigate synchronization of repos into the VMR (git-wise) #10257

premun opened this issue Aug 3, 2022 · 1 comment
Assignees

Comments

@premun
Copy link
Member

premun commented Aug 3, 2022

Context

Based on the design document, we will need to synchronize changes from individual repos into a given path in the VMR. Additionally, we have further requirements on how to synchronize the files. There are several ways to approach this in git (submodule, subtree, subrepo, custom..) and we should investigate this.

Additional requirements

  • Synchronization should adhere to the file cloaking rules (described in the doc)
  • Commit messages in VMR should clearly link to the original commits
  • We should keep things as simple and git-native as possible
  • We should think about conflicts - how will they manifest and how we will deal with them

Goals

  • Find and compare various approaches for repo synchronization
  • Create some simple CLI and try to synchronize commits from various repositories to verify the process
  • Document the findings
@premun premun changed the title Figure out the synchronization of repos into the VMR (git-wise) Investigate synchronization of repos into the VMR (git-wise) Aug 3, 2022
@premun premun self-assigned this Aug 3, 2022
@premun
Copy link
Member Author

premun commented Aug 4, 2022

I have investigated various ways how we could build the mono repo. This is a report of what I was going through.

Submodules ℹ️

How it works
Keeps a reference (a pointer) to the other repo in a way that most git operations consider it.

Evaluation

This one receives a lot of bad rep online as-is already but the reasons we care about and why we will not go this way are:

  • Submodule will break if referenced repo goes away (same goes for a commit inside of the referenced repo)
  • End users have to know about these and interact with them (manage them), it's better to move this burden onto the repo owner / infra team.
  • Some git commands don't account for submodules, some do. This differs across git versions a lot.
  • Doesn't support cloaking

Subtree ℹ️

How it works
Synchronizes the whole git branch as-is into the mono repo (creates multi-root graph) and merges it into the main branch.

Evaluation

  • Introduces a very messy and complex history which would not work for the repo of our size
  • Apparently the support in git is buggy and can lead to problems in the repo (the SO is full of examples)
  • Doesn't support cloaking

Subrepo ℹ️

How it works
A 3rd party bash script solution that mixes subtree ideas with vanilla git. Keeps info in .gitrepo files and pulls changes based on that.

Evaluation

  • Gets the most "good press" online with some articles mentioning that it's supreme to subtree and submodules
  • Doesn't support customization, cloaking

There is a comprehensive comparison between various benefits and shortcomings here. From these points of view, it seems that custom solution might be the best way to go, however subrepo is not the one neither. It is actually very similar to what we have though with the source-mappings.json configuration file.

Custom solution

I went further and investigated whether we can achieve what we want with vanilla git and keep it simple. It was important that:

  • Cloaking is possible (inclusion/exclusion of files via path filters)
  • Cloaking + new changes don't break in cases such as a file was moved from/to exclusion zone
  • Synchronizing binaries, CRLF and other whitespace changes works
  • We are able to customize the commit messages in the VMR

The investigations were done as part of #9604 and the result is that we can assemble a good combination of git diff and git apply commands to get what we want which seems superior to all other options.

The result was a PoC including:

Using this, I further ran a month-long synchronization test to see if everything works well.
Additionally, I tried to assemble a repo with 20k commit history to see how git performs and I didn't observe any serious degradation (compared to dotnet/runtime for instance) but I didn't spend more time on performance. There are many ways how we can optimize once we get to a need for that.

The commands we are going with are surprisingly simple. The patch is created in the most

git diff --patch --binary --output .patch [SHA1]..[SHA2] -- :(glob)included/**/files :(exclude,glob)**/excluded-files/*.*

The patch is then applied:

git config apply.ignoreWhitespace change
git apply --cached --ignore-space-change --directory src/repo .patch

It took some testing to figure out the right combination of all of the different ways you can create and apply a patch but this simple solution seems to work great. Even patches that are few GBs large work without a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant