Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion repo got 40MB larger #10422

Open
alamb opened this issue May 8, 2024 · 15 comments
Open

DataFusion repo got 40MB larger #10422

alamb opened this issue May 8, 2024 · 15 comments
Labels
bug Something isn't working development-process Related to development process of DataFusion

Comments

@alamb
Copy link
Contributor

alamb commented May 8, 2024

Describe the bug

We accidentally have checked in a 40MB binary docs.tgz file in #10407

#10416 removed the file but it is still in git history

Thus the DataFusion repo is significantly larger than it used to be

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@alamb alamb added bug Something isn't working development-process Related to development process of DataFusion labels May 8, 2024
@findepi
Copy link
Member

findepi commented Jul 15, 2024

That's unfortunate, but fixing this requires rewriting git history.
Is this issue about such a rewrite? or preventing similar mistake in the future?

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

That's unfortunate, but fixing this requires rewriting git history. Is this issue about such a rewrite? or preventing similar mistake in the future?

I suggest we use this particular issue for fixing the history -- I recommend a seprate ticket for ways to prevent a similar mistake in the future

@findepi
Copy link
Member

findepi commented Jul 16, 2024

if we want to fix this, this will require a force push to main branch and also doing something about tags that reference the commit that includes docs/docs.tgz file (39.0.0, 39.0.0-rc1, 40.0.0, 40.0.0-rc1).

Fixing main is simple (but painful)

i prepared a main branch copy with the docs archive removed (based on main as of today, commit 1331288), someone would need to push that into the main

git fetch https://github.com/findepi/datafusion findepi/main-without-docs
main_fixed=$(git rev-parse FETCH_HEAD)
current_main=133128840ca3dbea200dcfe84050cb7b82bf94a8 # commit on `main`
fork_point=$(git merge-base $main_fixed $current_main)

# verify
git show $main_fixed
git diff $current_main $main_fixed # same content

# verify history -- this should show `remove docs archive` commit omitted
git diff <(git log $fork_point..$current_main --format=%s) <(git log $fork_point..$main_fixed --format=%s)

# apply
git checkout main
git reset --hard $main_fixed

# apply commits from real current main
git fetch https://github.com/apache/datafusion main
git cherry-pick $current_main..FETCH_HEAD

# publish
git push https://github.com/apache/datafusion main:main --force-with-lease

once we do the above, then we need to delete (replace?) the release tags mentioned above

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2024

I think the branch protections would need to be updated temporarily too -- I don't think we can force push to main

I also think if we force push to main all outstanding PRs will become quite messed up until after a rebase

THank you @findepi for preparing this

@andygrove / @comphead / @jayzhan211 do you have any thoughts on this matter?

@comphead
Copy link
Contributor

comphead commented Jul 16, 2024

I think it was a precommit github hook checking large files, please allow me some time to investigate it.

https://github.com/pre-commit/pre-commit-hooks?tab=readme-ov-file#check-added-large-files

@findepi
Copy link
Member

findepi commented Jul 16, 2024

I also think if we force push to main all outstanding PRs will become quite messed up until after a rebase

that's correct.

also, editing release tags is something to think about.
if we re-use tag names, existing forks/clones will choke on this.

i'd be tempted not to solve this problem.

precommit github hook checking large files

is this something we can configure in the repo itself?
cause we should not rely on git commit hooks -- a contributor may not set them up.
we could have a CI check for this (like https://github.com/ppremk/lfs-warning , https://github.com/james-callahan/gha-prevent-big-files or something similar, perhaps our own)

@comphead
Copy link
Contributor

Git action looks even more promising as you right, the precommit checks are not reliable as they running locally.
Imho I like the way https://github.com/james-callahan/gha-prevent-big-files it seems simpler for now and we have much control of it.
Would you like to experiment with this @findepi ?

@findepi
Copy link
Member

findepi commented Jul 17, 2024

@comphead sure
created #11508 with the check

@ozankabak
Copy link
Contributor

Talking about the size and "cleanliness" of the repo, should we do a git gc? I find conflicting information on the Internet about whether GH does this automatically. In my experience GH doesn't seem to do this.

@findepi
Copy link
Member

findepi commented Jul 18, 2024

@ozankabak good question. git gc can remove only unreachable objects. Files/content that is part of repo history cannot be deleted without rewriting the history, so it's not something git gc could help with.

@ozankabak
Copy link
Contributor

Yep, git gc and cleaning up history are different things, but I wanted bring this up as we make efforts into cleaning up the repo in general

@alamb
Copy link
Contributor Author

alamb commented Jul 18, 2024

Talking about the size and "cleanliness" of the repo, should we do a git gc? I find conflicting information on the Internet about whether GH does this automatically. In my experience GH doesn't seem to do this.

My understanding is that what comes from github is automatically gcd. Your local checkout also periodically builds up garbage (e.g. from local commits, etc) that you can clean up by locally running git gc

So I don't think there is anything we need to do with gc on github but I ma be mistaken

@ozankabak
Copy link
Contributor

I keep seeing notifications on commits with the notice "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.", which basically says that this is an unreachable commit/garbage. I think these commits accumulate because people fork the repo to try things, and when the fork repo is deleted, we get into this state. This prompts me to think GH doesn't do git gc, but I'm not sure either.

@findepi
Copy link
Member

findepi commented Jul 18, 2024

It's more complicated. It seems that all forks' objects are stored together on github backend.
If i make a commit and push it to my fork, you can read it from this repo if you know the commit id.
I assume github does git gc internally (or equivalent), but it doesn't prevent the "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository." situation.

@ozankabak
Copy link
Contributor

If i make a commit and push it to my fork, you can read it from this repo if you know the commit id.

Yes, there is actually no problem until the fork is deleted. When the fork is then deleted, it seems like the commit still stays on the "mother" repo as an unreachable commit and these things accumulate over time. Googling it suggests gc would solve this, but that may be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working development-process Related to development process of DataFusion
Projects
None yet
Development

No branches or pull requests

4 participants