-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Begin unsigned long
->size_t
conversion to support large files on Windows
#3533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
This PR is distinctly more pleasant to read and to review than #2179, it is really great to see.
There is a bit more work to be done to polish this to a point where we can hope for the patches to be accepted on the Git mailing list either without changes or with only minor adjustments. I am pretty optimistic at this stage that it won't take too much effort to get there so that we can upstream this soon after Git v2.34.0 is released.
@@ -49,6 +49,9 @@ test_expect_success 'setup' ' | |||
|
|||
example sha1:ddd3f836d3e3fbb7ae289aa9ae83536f76956399 | |||
example sha256:b44fe1fe65589848253737db859bd490453510719d7424daab03daf0767b85ae | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting about the commit message: a verb is missing from the commit subject. It would read better as "hash-object: demonstrate failure with big files in an LLP64 data model".
@@ -2027,7 +2027,7 @@ int write_object_file_flags(const void *buf, unsigned long len, | |||
return write_loose_object(oid, hdr, hdrlen, buf, len, 0, flags); | |||
} | |||
|
|||
int hash_object_file_literally(const void *buf, unsigned long len, | |||
int hash_object_file_literally(const void *buf, size_t len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this looks like a good change. Still, simply changing the signature leaves open a rather big question (which the commit message does not answer): shouldn't the callees of this function be updated, either? I would expect at least a call to cast_size_t_to_ulong()
. At least when my eyes look at the signature of write_object_file_prepare()
, it still uses unsigned long
as the data type for len
.
About the commit message:
object-file.c: use size_t for large object sizes
Please drop the .c
. Or maybe even better: use hash_object_file_literally()
as prefix, because we sure do not take care of all of object-file.c
in this commit.
This is the starting point for size_t conversion for the --literally
option.
That --literally
comes out of left field a bit. Did you mean to talk about git hash-object
? The commit message did not talk about that, it talked about object-file.c
, and that file has no --literally
option.
The simplest of function signature changes.
I would strongly disagree that this is the simplest of function signature changes. Maybe avoid the judgement and instead lead with the motivation for this change (namely, spell it out that files >=4GB cannot currently be hashed in an LLP64 data model), and then explain that this change is a minimal step toward that goal.
There are no other uses of the function.
We have not even described one user. Does that mean that there are no users of this function?
The string buffer type is already size_t compatible.
Why is this relevant in this context? Remember, commit messages should clarify, preempt questions a reader might have about the patch.
Lower level function signature changes are in subsequent commits.
What are "lower level functions"? Do you mean callers of hash_object_file_literally()
? Or callees?
Future commits will extend the code paths of the test cases to real usages.
That is too vague, especially given the fact that this commit does not even touch test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to do with this (far too much in the one comment).
'We' needed to start somewhere.
The first commit identified (and created) a failing test that has the tightest of focus. The changes then work from tat specific failure, working outward, until we have a passing test, then we can flip the test status. It only resolves that very narrowest of tests, leaving the rest still untested and failing (the future changes).
One factor this series did drive out (for me) is that the hash functions also needed updating, not just the zlib functions and the many internal Git length type usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to do with this (far too much in the one comment).
You need to practice writing informative commit messages.
'We' needed to start somewhere.
Sure. And we now need to tell a story with the patch series. And like every good story, it needs to be edited. Nobody cares how we arrived at the end, as long as the story is fun to read.
The first commit identified (and created) a failing test that has the tightest of focus.
Yes, that's how you developed the patch series. And the last patch in the patch series changes something in the code that lets the test case pass. But it does not have to end there. The first commit can still stay in place, but the last commit should probably be put earlier in the patch series. Necessarily, the change from test_expect_failure
to test_expect_success
will have to be moved to the patch that will now be at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to do with this (far too much in the one comment).
You need to practice writing informative commit messages.
At my day job, this problem came up, and the best advice that came out of the session was: pretend that you are talking your future self, like, 6 months from now, when you don't recall any details. What would you tell your future self to catch them up?
I am usually loathe to suggest commit messages to native English speakers (because English is only a second language to me, or third, depending how you want to count). So I will need encouragement from your side if you want me to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the commit messages, but retained the 'find, fix, find again' processing, and the half-stepping (splitting at signature, then do internals), though all false steps have been rebased away.
I have also added three additional tests at the end, because they passed (yay) which gives a bit of confidence there weren't other gotcha's in those other parallel paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't updated the PR cover note yet.
const char *type, struct object_id *oid, | ||
unsigned flags) | ||
{ | ||
char hdr[MAX_HEADER_LEN]; | ||
int hdrlen = sizeof(hdr); | ||
size_t hdrlen = sizeof(hdr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike hash_object_file()
, which only calls write_object_file_prepare()
and therefore handles the increased data width of len
and hdrlen
correctly, the write_object_file_flags()
function also calls write_loose_object()
, which you have not prepared to accept the wider data type:
Lines 1919 to 1921 in 06d5a16
static int write_loose_object(const struct object_id *oid, char *hdr, | |
int hdrlen, const void *buf, unsigned long len, | |
time_t mtime, unsigned flags) |
It would be better to change write_loose_object()
in a preemptive patch before changing write_object_file_flags()
: it is totally possible to call a function that accepts size_t
parameters with unsigned long
, as the latter will be always at most as wide as the former, at least in the data models we support in Git.
@@ -2032,7 +2032,8 @@ int hash_object_file_literally(const void *buf, size_t len, | |||
unsigned flags) | |||
{ | |||
char *header; | |||
int hdrlen, status = 0; | |||
size_t hdrlen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for write_object_file_flags()
: the hash_object_file_literally()
function also calls write_loose_object()
(which should be prepared for size_t
before changing the data type of hdrlen
here).
daeb6c8
to
437ec2a
Compare
On LLP64 systems, such as Windows, the size of `long`, `int`, etc. is only 32 bits (for backward compatibility). Git's use of `unsigned long` for file memory sizes in many places, rather than size_t, limits the handling of large files on LLP64 systems (commonly given as `>4GB`). Provide a minimum test for handling a >4GB file. The `hash-object` command, with the `--literally` and without `-w` option avoids writing the object, either loose or packed. This avoids the code paths hitting the `bigFileThreshold` config test code, the zlib code, and the pack code. Subsequent patches will walk the test's call chain, converting types to `size_t` (which is larger in LLP64 data models) where appropriate. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
The previous commit adds a test that demonstrates a problem in the `hash-object --literally` command, manifesting in an unnecessary file size limit on systems using the LLP64 data model (which includes Windows). Walking the affected code path is `cmd_hash_object()` >> `hash_fd()` >> `hash_literally()` >> `hash_object_file_literally()`. The function `hash_object_file_literally()` is the first with a file length parameter (via a mem buffer). This commit changes the type of that parameter to the LLP64 compatible `size_t` type. There are no other uses of the function. The `strbuf` type is already `size_t` compatible. Note: The hash-object test does not yet pass. Subsequent commits will continue to walk the call tree's lower level functions to identify further fixes. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
Continue walking the code path for the >4GB `hash-object --literally` test. The `hash_object_file_literally()` function internally uses both `hash_object_file()` and `write_object_file_prepare()`. Both function signatures use `unsigned long` rather than `size_t` for the mem buffer sizes. Use `size_t` instead, for LLP64 compatibility. While at it, convert those function's object's header buffer length to `size_t` for consistency. The value is already upcast to `uintmax_t` for print format compatibility. Note: The hash-object test still does not pass. A subsequent commit continues to walk the call tree's lower level hash functions to identify further fixes. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
Continue walking the code path for the >4GB `hash-object --literally` test to the hash algorithm step for LLP64 systems. This patch lets the SHA1DC code use `size_t`, making it compatible with LLP64 data models (as used e.g. by Windows). The interested reader of this patch will note that we adjust the signature of the `git_SHA1DCUpdate()` function without updating _any_ call site. This certainly puzzled at least one reviewer already, so here is an explanation: This function is never called directly, but always via the macro `platform_SHA1_Update`, which is usually called via the macro `git_SHA1_Update`. However, we never call `git_SHA1_Update()` directly in `struct git_hash_algo`. Instead, we call `git_hash_sha1_update()`, which is defined thusly: static void git_hash_sha1_update(git_hash_ctx *ctx, const void *data, size_t len) { git_SHA1_Update(&ctx->sha1, data, len); } i.e. it contains an implicit downcast from `size_t` to `unsigned long` (before this here patch). With this patch, there is no downcast anymore. With this patch, finally, the t1007-hash-object.sh "files over 4GB hash literally" test case is fixed. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
Just like the `hash-object --literally` code path, the `--stdin` code path also needs to use `size_t` instead of `unsigned long` to represent memory sizes, otherwise it would cause problems on platforms using the LLP64 data model (such as Windows). To limit the scope of the test case, the object is explicitly not written to the object store, nor are any filters applied. The `big` file from the previous test case is reused to save setup time; To avoid relying on that side effect, it is generated if it does not exist (e.g. when running via `sh t1007-*.sh --long --run=1,41`). Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
To complement the `--stdin` and `--literally` test cases that verify that we can hash files larger than 4GB on 64-bit platforms using the LLP64 data model, here is a test case that exercises `hash-object` _without_ any options. Just as before, we use the `big` file from the previous test case if it exists to save on setup time, otherwise generate it. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
To verify that the `clean` side of the `clean`/`smudge` filter code is correct with regards to LLP64 (read: to ensure that `size_t` is used instead of `unsigned long`), here is a test case using a trivial filter, specifically _not_ writing anything to the object store to limit the scope of the test case. As in previous commits, the `big` file from previous test cases is reused if available, to save setup time, otherwise re-generated. Signed-off-by: Philip Oakley <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
unsigned long
->size_t
conversion to support large files on Windows
The macos failures are due to the occasional update of Perforce (they overwrite the file when upgrading, so that Homebrew always ends up with outdated checksums for a while). |
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Thank you for the updates and extra info in the commit messages. I hadn't thought about that alternative (the I hadn't thought of the simple sha1 Next, for me, is either the size_t bigFileThreshold configs (currently missing) leading to actually writing the big loose object, or maybe try the 'roll-up' of the size_t changes in Matt's git-lfs tree #3487 (#3487 (comment) bottom, up to |
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
Begin `unsigned long`->`size_t` conversion to support large files on Windows
This short series starts the code conversion to the more portable
size_t
type for object sizes.A minimalist test is used as the initial test case.
The Windows LLP64 convention and Linux LP64 conventions differ in the
sizeof(long)
,meaning objects larger than 4GB (2GB in some case) are not handled correctly by the Linux LP64 code.
Git-LFS oversize object detection was resolved in Matt Cooper's series 596b5e7 (clean/smudge: allow clean filters to process extremely large files, 2021-11-02).
The strbuf's are already fully size_t in compatible.
Some RFC (request for comment) points:
>big
vs using a$()
sub-shell for feeding the 5GB to stdin?t1007-hash-object.sh
, or maybe int1051-large-conversion.sh
?Future/next work: use
-w
write a loose object as a step-up in complexity:This will need
bigFileThreshold
sorted, andzlib
.Strategically, avoid any pack work until loose object paths resolved.
Mark "
/* TODO size_t <what> */
" to avoid cascaded changes when call tree changes are deep and wide. The long/size_t name change has no effect on Linux, and on Windows it's already 'broken', so delaying the TODOs till later patches will reduce review loading.2nd step work: the other hash-object options.
--
The PR can be passed (ggg) to the main list once dumb flaws are fixed.
Nevertheless, you can use GitGitGadget (https://gitgitgadget.github.io/)
to conveniently send your Pull Requests commits to our mailing list.
Philip