Skip to content

Commit

Permalink
repack: add --path-walk option
Browse files Browse the repository at this point in the history
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.40(0.47+0.04)
5313.3: thin pack size                                    1.2M
5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash             22.8K
5313.6: thin pack with --path-walk             0.08(0.06+0.02)
5313.7: thin pack size with --path-walk                  20.8K
5313.8: big pack                               2.16(8.43+0.23)
5313.9: big pack size                                    17.7M
5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash             18.0M
5313.12: big pack with --path-walk             2.21(8.39+0.24)
5313.13: big pack size with --path-walk                  17.8M
5313.14: repack                                98.05(662.37+2.64)
5313.15: repack size                                    449.1K
5313.16: repack with --full-name-hash          33.95(129.44+2.63)
5313.17: repack size with --full-name-hash              182.9K
5313.18: repack with --path-walk               106.21(121.58+0.82)
5313.19: repack size with --path-walk                   159.6K

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes in the thin packs:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.39(2.91+0.10)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.01(0.00+0.00)
5313.3: thin pack size                                     310
5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash              1.4K
5313.6: thin pack with --path-walk             0.00(0.00+0.00)
5313.7: thin pack size with --path-walk                    310

Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.

Signed-off-by: Derrick Stolee <[email protected]>
  • Loading branch information
derrickstolee authored and dscho committed Jan 11, 2025
1 parent 2d88d3f commit 1d621e1
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 2 deletions.
15 changes: 14 additions & 1 deletion Documentation/git-repack.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ SYNOPSIS
[verse]
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
[--write-midx] [--full-name-hash]
[--write-midx] [--full-name-hash] [--path-walk]

DESCRIPTION
-----------
Expand Down Expand Up @@ -251,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.

--path-walk::
This option passes the `--path-walk` option to the underlying
`git pack-options` process (see linkgit:git-pack-objects[1]).
By default, `git pack-objects` walks objects in an order that
presents trees and blobs in an order unrelated to the path they
appear relative to a commit's root tree. The `--path-walk` option
enables a different walking algorithm that organizes trees and
blobs by path. This has the potential to improve delta compression
especially in the presence of filenames that cause collisions in
Git's default name-hash algorithm. Due to changing how the objects
are walked, this option is not compatible with `--delta-islands`
or `--filter`.

CONFIGURATION
-------------

Expand Down
7 changes: 6 additions & 1 deletion builtin/repack.c
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
"[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
"[--write-midx] [--full-name-hash]"),
"[--write-midx] [--full-name-hash] [--path-walk]"),
NULL
};

Expand All @@ -63,6 +63,7 @@ struct pack_objects_args {
int quiet;
int local;
int full_name_hash;
int path_walk;
struct list_objects_filter_options filter_options;
};

Expand Down Expand Up @@ -313,6 +314,8 @@ static void prepare_pack_objects(struct child_process *cmd,
strvec_pushf(&cmd->args, "--no-reuse-object");
if (args->full_name_hash)
strvec_pushf(&cmd->args, "--full-name-hash");
if (args->path_walk)
strvec_pushf(&cmd->args, "--path-walk");
if (args->local)
strvec_push(&cmd->args, "--local");
if (args->quiet)
Expand Down Expand Up @@ -1212,6 +1215,8 @@ int cmd_repack(int argc,
N_("pass --no-reuse-object to git-pack-objects")),
OPT_BOOL(0, "full-name-hash", &po_args.full_name_hash,
N_("(EXPERIMENTAL!) pass --full-name-hash to git-pack-objects")),
OPT_BOOL(0, "path-walk", &po_args.path_walk,
N_("(EXPERIMENTAL!) pass --path-walk to git-pack-objects")),
OPT_NEGBIT('n', NULL, &run_update_server_info,
N_("do not run git-update-server-info"), 1),
OPT__QUIET(&po_args.quiet, N_("be quiet")),
Expand Down
24 changes: 24 additions & 0 deletions t/perf/p5313-pack-objects.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,14 @@ test_size 'thin pack size with --full-name-hash' '
test_file_size out
'

test_perf 'thin pack with --path-walk' '
git pack-objects --thin --stdout --revs --sparse --path-walk <in-thin >out
'

test_size 'thin pack size with --path-walk' '
wc -c <out
'

test_perf 'big pack' '
git pack-objects --stdout --revs --sparse <in-big >out
'
Expand All @@ -52,6 +60,14 @@ test_size 'big pack size with --full-name-hash' '
test_file_size out
'

test_perf 'big pack with --path-walk' '
git pack-objects --stdout --revs --sparse --path-walk <in-big >out
'

test_size 'big pack size with --path-walk' '
wc -c <out
'

test_perf 'repack' '
git repack -adf
'
Expand All @@ -70,4 +86,12 @@ test_size 'repack size with --full-name-hash' '
test_file_size "$pack"
'

test_perf 'repack with --path-walk' '
git repack -adf --path-walk
'

test_size 'repack size with --path-walk' '
wc -c <.git/objects/pack/pack-*.pack
'

test_done

0 comments on commit 1d621e1

Please sign in to comment.