Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large repo size #1403

Closed
mpizenberg opened this issue Feb 25, 2023 · 4 comments
Closed

Large repo size #1403

mpizenberg opened this issue Feb 25, 2023 · 4 comments
Assignees
Labels
🧑‍💻 dev experience developer experience (excluding CI) 👀 needs triage This issue needs to be triaged by the Rerun team other Generated by the "Other" issue template

Comments

@mpizenberg
Copy link
Contributor

Hi I just cloned the repo to try a few of the examples and noticed it took quite a while, and current size is around 75MB, which is quite big for a git repo. Appart from being slightly annoying when cloning the repo, it can also have an impact on CI runtimes and costs, so I tried to understand where most of this size comes from.

When using the script from this gist, I get the following result:

4b74c766021a788ff5fc9b776302e0e72505b510 17248915  rerun_py/rerun_sdk/rerun_demo/demo.rrd
fab9f5ef89169d244a780932ef08abe1832a70ea 6849280  rerun_py/rerun_sdk/rerun_demo/demo.rrd
a8d6073bc42011e8e0f847d1a301fa9864c14942 3011563  docs/rust/head/search-index.js
c4aef6c11b44d50459f49a0fffe0066a524c4476 3011012  docs/rust/head/search-index.js
6c09003676e6c1f25518451a376f324107ac71c9 3004677  docs/rust/head/search-index.js
1202a6ba12763620b9ce44ff7ddc5ffa116aa117 3004357  docs/rust/head/search-index.js
8c35345865f8b448b7cf2d5167d8109e63444ad9 2998732  docs/rust/head/search-index.js
...

Full file: file_sizes.txt

Browsing these results, it seems there are two main sources for this weight

  1. the demo.rrd files
  2. the build artifacts for the website on gh-pages

Regarding (1), are these demo files worth keeping checked in the repo. If their main purpose is for easy try with the python APIs, would it make sense to load them into the wheel instead? Is the .rrd file format going to stay stable? If not, it might incur further ballooning down the line. Maybe storing these files versioned somewhere else that can easily be curled or web-button-downloaded would make more sense?

Regarding (2), It seems to me by looking at your CI actions that the branch does not contains the docs building logic and is only the build target storage. If that's the case, I can see two rather easy changes. Either publish instead on something like netlify, which is free for static content, even with your own DNS settings. Or use a different setting for your publishing on the gh-pages branch that makes that an orphan branch and rewrite over the same initial branch commit. As such, the objects do not get accumulated in the git history.

I'm mentioning this repo size issue because this is the kind of things that requires a git history rewrite to fix, and you usually prefer doing these things early on, instead of when the number of contributors starts really growing.

@mpizenberg mpizenberg added other Generated by the "Other" issue template 👀 needs triage This issue needs to be triaged by the Rerun team labels Feb 25, 2023
@emilk emilk added the 🧑‍💻 dev experience developer experience (excluding CI) label Feb 25, 2023
@emilk
Copy link
Member

emilk commented Feb 25, 2023

Good catch! I agree this is bad, and that we should fix it asap.

@jleibs
Copy link
Member

jleibs commented Feb 25, 2023

Ouch -- I can't believe that not only did I let that rrd file slip through in #1085, but I edited it and uploaded a modified version in #1301 🤦‍♂️

We produce these files in the CI so there's zero reason for them to be checked in for exactly this reason. It will be annoying, but a history rewrite is probably worth wrangling. Might as well take advantage and see if there are other things we can purge while we're at it -- scanning the list, I see a depth_image.pgm, and some camera.glb files that I suspect should be able to go as well.

Definitely need to set up a github workflow and some precommit hooks to default reject anything larger than a certain threshold. I'm shocked this isn't just a stock configurable behavior in github.

Regarding (2), this should have been obvious, but is a pretty big flaw of the gh-pages branch deployment model. Thanks for pointing it out! In addition to collapsing commits, we might as well move the entire gh-pages deploy to the rerun-docs repository instead.

@emilk
Copy link
Member

emilk commented Mar 2, 2023

Let's:

  • Add a GitHub action stopping us from merging large files (>100kB)
  • Make gh-pages an orphan branch with no history
  • Remove the demo.rrd from our history with a force push to main 😬

@jleibs jleibs self-assigned this Mar 3, 2023
@jleibs
Copy link
Member

jleibs commented Mar 3, 2023

History has been rewritten as of 03-03-2023. Fresh clone is down to 22MB:

$ git clone [email protected]:rerun-io/rerun.git
Cloning into 'rerun'...
remote: Enumerating objects: 29923, done.
remote: Counting objects: 100% (1735/1735), done.
remote: Compressing objects: 100% (701/701), done.
remote: Total 29923 (delta 1122), reused 1591 (delta 1023), pack-reused 28188
Receiving objects: 100% (29923/29923), 22.78 MiB | 42.33 MiB/s, done.
Resolving deltas: 100% (22418/22418), done.

@jleibs jleibs closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧑‍💻 dev experience developer experience (excluding CI) 👀 needs triage This issue needs to be triaged by the Rerun team other Generated by the "Other" issue template
Projects
None yet
Development

No branches or pull requests

3 participants