Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent bigger files from being checked in #11508

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 17, 2024

See #10422 for rationale.

@github-actions github-actions bot added the development-process Related to development process of DataFusion label Jul 17, 2024
@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

the 1 MB file size limit is arbitrary, but in line with gihub's suggestions
There already is one file bigger in the repo (./docs/logos/Datafusion_Branding_Guideline.pdf).
This is a kind of file for which git-lfs could be used.

i experimented with this check in the following test PRs

The error reporting isn't ideal. the offending file name is NOT printed out to console. It's only visible on the Summary page of the workflow. Hopefully this can be improved in the check itself (offered a patch here: james-callahan/gha-prevent-big-files#3), or the code inlined. I did that, but then using existing action looks so more aesthetically...

@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

cc @comphead @alamb @andygrove

@lewiszlw
Copy link
Member

We couldn't use third-party github actions, see #11259 (comment).

@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

@lewiszlw thanks for pointing this out.
there was an intent from ASF infra to be able to allow 3rd party actions, which can be trusted (once reviewed) as long as you specify the exact commit hash as the reference.

i will inline the action code in the check

@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

@lewiszlw inlined the action. thanks again for pointing this out.

This is how the example error looks like
https://github.com/findepi/datafusion/actions/runs/9972548232/job/27556028696?pr=1

image

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean to fetch all commits in PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means non-shallow clone.
this is necessary for the references ${{ github.event.pull_request.base.sha }} and ${{ github.event.pull_request.head.sha }} to work

- name: Check size of new Git objects
env:
# 1 MB ought to be enough for anybody.
# TODO in case we may want to consciously commit a bigger file to the repo without using Git LFS we may disable the check e.g. with a label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this is a separate workflow then committers can enable/disable the workflow check on the project level https://github.com/apache/datafusion/actions/workflows

Copy link
Member Author

@findepi findepi Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know what branch protections there are configured for this repo. Committers may be also ignore such error. Ignoring workflow erros is definitely a bad habit, but I don't think we risk creating any habit. Unless things work incorrectly, the false positive should be very very rare.

if [ "${size}" -gt "${MAX_FILE_SIZE_BYTES}" ]; then
exit_code=1
echo "Object ${id} [${path}] has size ${size}, exceeding ${MAX_FILE_SIZE_BYTES} limit." >&2
echo "::error file=${path}::File ${path} has size ${size}, exceeding ${MAX_FILE_SIZE_BYTES} limit."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to duplicate file path?

Copy link
Member Author

@findepi findepi Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first file=${path} is necessary to create annotation on the file itself. I suppose github UI would display it if the offending file is a text file. This doesn't show up on the build output log page.

The second File ${path} is necessary for the error to be easy to understand when looking at the build output log page.

See #11508 (comment) for visualization.

# Skip objects which are not files (commits, trees)
if [ ! -z "${path}" ]; then
size="$(git cat-file -s "${id}")"
if [ "${size}" -gt "${MAX_FILE_SIZE_BYTES}" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit: should we do an early return exit=1 if we find the large file? or we wanna to output all of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention was to output all of them.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @findepi it looks promising imho

@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

Thanks @comphead for your review!
Updated. I also refreshed test PRs (linked in #11508 (comment) ), to verify I didn't break something with the new changes.

@comphead
Copy link
Contributor

Thanks @findepi
just 2 small things to double verify

  • The script traverses on changes for this particular PR? What I mean if there is already existing file in the repo exceeding the max size, the new PR check shouldn't complain on that
  • if the user accidentally added a large file in PR, the deleted it, would the PR pass through the check or user needs to recreate the PR? (I suppose it might be an issue with depth:0)

@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

  • The script traverses on changes for this particular PR? What I mean if there is already existing file in the repo exceeding the max size, the new PR check shouldn't complain on that

correct.
and yes, there is one 3.7 MB file in the repo (./docs/logos/Datafusion_Branding_Guideline.pdf)

if the user accidentally added a large file in PR, the deleted it, would the PR pass through the check or user needs to recreate the PR? (I suppose it might be an issue with depth:0)

good question

  • if the deletion is a separate commit, then no, the check will not pass
  • if the deletion is done by amending the commit, then yes, the check will pass

Keeping failure in "deletion is a separate commit" case is right if the PR will be merged or rebase-merged, but unnecessary if the PR will be squash-merged. If the PRs are always squash-merged, we can improve the script to take this into account. But it can be view as overengineering, on the optimistic assumption that this will fail rarely anyway.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @findepi I think this PR is good

@alamb there is a small side effect described above which may give a false positive and basically may require PR recreate, which doesn't seem a big problem for me but for large complex PRs the user might be distressed to recreate it after hard rebase. But this is pretty rare case

@findepi
Copy link
Member Author

findepi commented Jul 18, 2024

may require PR recreate

or squashing commits (e.g. via git rebase)

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

@alamb there is a small side effect described above which may give a false positive and basically may require PR recreate, which doesn't seem a big problem for me but for large complex PRs the user might be distressed to recreate it after hard rebase. But this is pretty rare case

I agree with @findepi that there is a fairly easy workaround that doesn't need to recreate the PR (it would require rebasing it).

I think we should try this and we can always adjust as neessary if we find something isn't ideal

Thank you very much @findepi and @comphead -- I think this is a nice improvement

@alamb alamb merged commit 9189a1a into apache:main Jul 19, 2024
24 checks passed
@findepi findepi deleted the findepi/check-files branch July 20, 2024 09:27
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants