Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin unsigned long->size_t conversion to support large files on Windows #3533

Merged
merged 7 commits into from
Dec 15, 2021

Conversation

PhilipOakley
Copy link

This short series starts the code conversion to the more portable size_t type for object sizes.

A minimalist test is used as the initial test case.

The Windows LLP64 convention and Linux LP64 conventions differ in the sizeof(long),
meaning objects larger than 4GB (2GB in some case) are not handled correctly by the Linux LP64 code.

Git-LFS oversize object detection was resolved in Matt Cooper's series 596b5e7 (clean/smudge: allow clean filters to process extremely large files, 2021-11-02).

The strbuf's are already fully size_t in compatible.

Some RFC (request for comment) points:

  • Brian Carlson for checking the wider SHAxxx compatibility (e.g. the other sha libraries)
  • Dscho for the >big vs using a $() sub-shell for feeding the 5GB to stdin?
  • Should test be in t1007-hash-object.sh, or maybe in t1051-large-conversion.sh?
  • any other tidy-ups?

Future/next work: use -w write a loose object as a step-up in complexity:
This will need bigFileThreshold sorted, and zlib.

Strategically, avoid any pack work until loose object paths resolved.
Mark "/* TODO size_t <what> */" to avoid cascaded changes when call tree changes are deep and wide. The long/size_t name change has no effect on Linux, and on Windows it's already 'broken', so delaying the TODOs till later patches will reduce review loading.

2nd step work: the other hash-object options.

--
The PR can be passed (ggg) to the main list once dumb flaws are fixed.

Nevertheless, you can use GitGitGadget (https://gitgitgadget.github.io/)
to conveniently send your Pull Requests commits to our mailing list.

Philip

object-file.c Outdated Show resolved Hide resolved
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

This PR is distinctly more pleasant to read and to review than #2179, it is really great to see.

There is a bit more work to be done to polish this to a point where we can hope for the patches to be accepted on the Git mailing list either without changes or with only minor adjustments. I am pretty optimistic at this stage that it won't take too much effort to get there so that we can upstream this soon after Git v2.34.0 is released.

@@ -49,6 +49,9 @@ test_expect_success 'setup' '

example sha1:ddd3f836d3e3fbb7ae289aa9ae83536f76956399
example sha256:b44fe1fe65589848253737db859bd490453510719d7424daab03daf0767b85ae

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting about the commit message: a verb is missing from the commit subject. It would read better as "hash-object: demonstrate failure with big files in an LLP64 data model".

@@ -2027,7 +2027,7 @@ int write_object_file_flags(const void *buf, unsigned long len,
return write_loose_object(oid, hdr, hdrlen, buf, len, 0, flags);
}

int hash_object_file_literally(const void *buf, unsigned long len,
int hash_object_file_literally(const void *buf, size_t len,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this looks like a good change. Still, simply changing the signature leaves open a rather big question (which the commit message does not answer): shouldn't the callees of this function be updated, either? I would expect at least a call to cast_size_t_to_ulong(). At least when my eyes look at the signature of write_object_file_prepare(), it still uses unsigned long as the data type for len.

About the commit message:

object-file.c: use size_t for large object sizes

Please drop the .c. Or maybe even better: use hash_object_file_literally() as prefix, because we sure do not take care of all of object-file.c in this commit.

This is the starting point for size_t conversion for the --literally
option.

That --literally comes out of left field a bit. Did you mean to talk about git hash-object? The commit message did not talk about that, it talked about object-file.c, and that file has no --literally option.

The simplest of function signature changes.

I would strongly disagree that this is the simplest of function signature changes. Maybe avoid the judgement and instead lead with the motivation for this change (namely, spell it out that files >=4GB cannot currently be hashed in an LLP64 data model), and then explain that this change is a minimal step toward that goal.

There are no other uses of the function.

We have not even described one user. Does that mean that there are no users of this function?

The string buffer type is already size_t compatible.

Why is this relevant in this context? Remember, commit messages should clarify, preempt questions a reader might have about the patch.

Lower level function signature changes are in subsequent commits.

What are "lower level functions"? Do you mean callers of hash_object_file_literally()? Or callees?

Future commits will extend the code paths of the test cases to real usages.

That is too vague, especially given the fact that this commit does not even touch test cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do with this (far too much in the one comment).

'We' needed to start somewhere.

The first commit identified (and created) a failing test that has the tightest of focus. The changes then work from tat specific failure, working outward, until we have a passing test, then we can flip the test status. It only resolves that very narrowest of tests, leaving the rest still untested and failing (the future changes).

One factor this series did drive out (for me) is that the hash functions also needed updating, not just the zlib functions and the many internal Git length type usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do with this (far too much in the one comment).

You need to practice writing informative commit messages.

'We' needed to start somewhere.

Sure. And we now need to tell a story with the patch series. And like every good story, it needs to be edited. Nobody cares how we arrived at the end, as long as the story is fun to read.

The first commit identified (and created) a failing test that has the tightest of focus.

Yes, that's how you developed the patch series. And the last patch in the patch series changes something in the code that lets the test case pass. But it does not have to end there. The first commit can still stay in place, but the last commit should probably be put earlier in the patch series. Necessarily, the change from test_expect_failure to test_expect_success will have to be moved to the patch that will now be at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do with this (far too much in the one comment).

You need to practice writing informative commit messages.

At my day job, this problem came up, and the best advice that came out of the session was: pretend that you are talking your future self, like, 6 months from now, when you don't recall any details. What would you tell your future self to catch them up?

I am usually loathe to suggest commit messages to native English speakers (because English is only a second language to me, or third, depending how you want to count). So I will need encouragement from your side if you want me to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the commit messages, but retained the 'find, fix, find again' processing, and the half-stepping (splitting at signature, then do internals), though all false steps have been rebased away.

I have also added three additional tests at the end, because they passed (yay) which gives a bit of confidence there weren't other gotcha's in those other parallel paths.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't updated the PR cover note yet.

object-file.c Show resolved Hide resolved
object-file.c Show resolved Hide resolved
const char *type, struct object_id *oid,
unsigned flags)
{
char hdr[MAX_HEADER_LEN];
int hdrlen = sizeof(hdr);
size_t hdrlen = sizeof(hdr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike hash_object_file(), which only calls write_object_file_prepare() and therefore handles the increased data width of len and hdrlen correctly, the write_object_file_flags() function also calls write_loose_object(), which you have not prepared to accept the wider data type:

git/object-file.c

Lines 1919 to 1921 in 06d5a16

static int write_loose_object(const struct object_id *oid, char *hdr,
int hdrlen, const void *buf, unsigned long len,
time_t mtime, unsigned flags)

It would be better to change write_loose_object() in a preemptive patch before changing write_object_file_flags(): it is totally possible to call a function that accepts size_t parameters with unsigned long, as the latter will be always at most as wide as the former, at least in the data models we support in Git.

@@ -2032,7 +2032,8 @@ int hash_object_file_literally(const void *buf, size_t len,
unsigned flags)
{
char *header;
int hdrlen, status = 0;
size_t hdrlen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for write_object_file_flags(): the hash_object_file_literally() function also calls write_loose_object() (which should be prepared for size_t before changing the data type of hdrlen here).

sha1dc_git.c Show resolved Hide resolved
@PhilipOakley PhilipOakley force-pushed the hashliteral_t branch 2 times, most recently from daeb6c8 to 437ec2a Compare December 7, 2021 10:00
On LLP64 systems, such as Windows, the size of `long`, `int`, etc. is
only 32 bits (for backward compatibility). Git's use of `unsigned long`
for file memory sizes in many places, rather than size_t, limits the
handling of large files on LLP64 systems (commonly given as `>4GB`).

Provide a minimum test for handling a >4GB file. The `hash-object`
command, with the  `--literally` and without `-w` option avoids
writing the object, either loose or packed. This avoids the code paths
hitting the `bigFileThreshold` config test code, the zlib code, and the
pack code.

Subsequent patches will walk the test's call chain, converting types to
`size_t` (which is larger in LLP64 data models) where appropriate.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
The previous commit adds a test that demonstrates a problem in the
`hash-object --literally` command, manifesting in an unnecessary file
size limit on systems using the LLP64 data model (which includes
Windows).

Walking the affected code path is `cmd_hash_object()` >> `hash_fd()` >>
`hash_literally()` >> `hash_object_file_literally()`.

The function `hash_object_file_literally()` is the first with a file
length parameter (via a mem buffer). This commit changes the type of
that parameter to the LLP64 compatible `size_t` type.

There are no other uses of the function. The `strbuf` type is already
`size_t` compatible.

Note: The hash-object test does not yet pass. Subsequent commits will
continue to walk the call tree's lower level functions to identify
further fixes.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Continue walking the code path for the >4GB `hash-object --literally`
test. The `hash_object_file_literally()` function internally uses both
`hash_object_file()` and `write_object_file_prepare()`. Both function
signatures use `unsigned long` rather than `size_t` for the mem buffer
sizes. Use `size_t` instead, for LLP64 compatibility.

While at it, convert those function's object's header buffer length to
`size_t` for consistency. The value is already upcast to `uintmax_t` for
print format compatibility.

Note: The hash-object test still does not pass. A subsequent commit
continues to walk the call tree's lower level hash functions to identify
further fixes.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Continue walking the code path for the >4GB `hash-object --literally`
test to the hash algorithm step for LLP64 systems.

This patch lets the SHA1DC code use `size_t`, making it compatible with
LLP64 data models (as used e.g. by Windows).

The interested reader of this patch will note that we adjust the
signature of the `git_SHA1DCUpdate()` function without updating _any_
call site. This certainly puzzled at least one reviewer already, so here
is an explanation:

This function is never called directly, but always via the macro
`platform_SHA1_Update`, which is usually called via the macro
`git_SHA1_Update`. However, we never call `git_SHA1_Update()` directly
in `struct git_hash_algo`. Instead, we call `git_hash_sha1_update()`,
which is defined thusly:

    static void git_hash_sha1_update(git_hash_ctx *ctx,
                                     const void *data, size_t len)
    {
        git_SHA1_Update(&ctx->sha1, data, len);
    }

i.e. it contains an implicit downcast from `size_t` to `unsigned long`
(before this here patch). With this patch, there is no downcast anymore.

With this patch, finally, the t1007-hash-object.sh "files over 4GB hash
literally" test case is fixed.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Just like the `hash-object --literally` code path, the `--stdin` code
path also needs to use `size_t` instead of `unsigned long` to represent
memory sizes, otherwise it would cause problems on platforms using the
LLP64 data model (such as Windows).

To limit the scope of the test case, the object is explicitly not
written to the object store, nor are any filters applied.

The `big` file from the previous test case is reused to save setup time;
To avoid relying on that side effect, it is generated if it does not
exist (e.g. when running via `sh t1007-*.sh --long --run=1,41`).

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
To complement the `--stdin` and `--literally` test cases that verify
that we can hash files larger than 4GB on 64-bit platforms using the
LLP64 data model, here is a test case that exercises `hash-object`
_without_ any options.

Just as before, we use the `big` file from the previous test case if it
exists to save on setup time, otherwise generate it.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
To verify that the `clean` side of the `clean`/`smudge` filter code is
correct with regards to LLP64 (read: to ensure that `size_t` is used
instead of `unsigned long`), here is a test case using a trivial filter,
specifically _not_ writing anything to the object store to limit the
scope of the test case.

As in previous commits, the `big` file from previous test cases is
reused if available, to save setup time, otherwise re-generated.

Signed-off-by: Philip Oakley <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho changed the title [RFC] begin Windows large object conversion Begin unsigned long->size_t conversion to support large files on Windows Dec 15, 2021
@dscho
Copy link
Member

dscho commented Dec 15, 2021

The macos failures are due to the occasional update of Perforce (they overwrite the file when upgrading, so that Homebrew always ends up with outdated checksums for a while).

@dscho dscho merged commit 784d65b into git-for-windows:main Dec 15, 2021
@dscho dscho added this to the Next release milestone Dec 15, 2021
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 15, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 16, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Dec 16, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Dec 17, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
@PhilipOakley
Copy link
Author

Thank you for the updates and extra info in the commit messages.

I hadn't thought about that alternative (the test -f big ||) to the code comment.

I hadn't thought of the simple sha1 #defines as being macros, which I tend to think of as the functional form, but it's not important to the explanation. I'd agree that the renaming dance does get confusing. The second set of eyes does make for a better explanation.

Next, for me, is either the size_t bigFileThreshold configs (currently missing) leading to actually writing the big loose object, or maybe try the 'roll-up' of the size_t changes in Matt's git-lfs tree #3487 (#3487 (comment) bottom, up to do_oid_object_info_extended())

dscho added a commit that referenced this pull request Dec 21, 2021
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 7, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
@dscho dscho mentioned this pull request Jan 7, 2025
dscho added a commit to dscho/git that referenced this pull request Jan 7, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit to dscho/git that referenced this pull request Jan 7, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 7, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 7, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 8, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 8, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 8, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 8, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 9, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 9, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Jan 17, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Jan 27, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Jan 27, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Jan 27, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Jan 27, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 27, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 28, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Jan 28, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 3, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 5, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 5, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 5, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Feb 5, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 6, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit that referenced this pull request Feb 6, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
git-for-windows-ci pushed a commit that referenced this pull request Feb 6, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit to dscho/git that referenced this pull request Feb 6, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
dscho added a commit to dscho/git that referenced this pull request Feb 6, 2025
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants