Perf tracking #781

jonjohnsonjr · 2023-07-01T18:24:41Z

Opening this as a meta-issue to track low hanging fruit for perf wins.

Tasks

Give feedback

jonjohnsonjr · 2023-07-05T17:31:42Z

State of the world before we start optimizing things too much:

$ pwd
/Users/jonjohnson/src/github.com/chainguard-images/images

$ time apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go
...
apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go  27.45s user 4.15s system 242% cpu 13.009 total

jonjohnsonjr · 2023-07-05T17:47:55Z

After #782

apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go  14.16s user 3.83s system 151% cpu 11.892 total

Mild speedup, but huge reduction in CPU usage.

You can see the relative length of BuildLayer shrinking vs buildImage.

jonjohnsonjr · 2023-07-05T20:38:26Z

After chainguard-dev/go-apk#74

apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go  13.42s user 4.45s system 210% cpu 8.477 total

Shaved off another ~3s.

jonjohnsonjr · 2023-07-05T22:10:31Z

After chainguard-dev/go-apk#75

apko publish --keyring-append  --repository-append  --arch amd64    13.46s user 4.56s system 227% cpu 7.917 total

Shaved off a little under a second.

jonjohnsonjr · 2023-07-10T21:46:13Z

Been a while since an update...

Here's a cold cache:

apko publish --keyring-append  --repository-append  --arch amd64    12.49s user 3.26s system 123% cpu 12.779 total

Here's warm:

apko publish --keyring-append  --repository-append  --arch amd64    9.03s user 1.56s system 216% cpu 4.882 total

Notably, cold is faster than warm when we started this effort 🎉

jonjohnsonjr · 2023-07-13T22:06:46Z

Got another half second from some bufio here and there.

Cold:

apko publish --keyring-append  --repository-append  --arch amd64    12.20s user 2.79s system 132% cpu 11.289 total

Hot:

apko publish --keyring-append  --repository-append  --arch amd64    9.06s user 1.56s system 251% cpu 4.223 total

jonjohnsonjr · 2023-07-13T22:32:12Z

I have a branch that gets us down to ~3s on the hot path, but it's a bit of a dead end because it mostly just makes the work we're already doing a little bit more concurrent, which doesn't actually help that much in a build-the-world scenario.

This is HEAD:

This is my branch:

At least in these two flamegraphs, the exact same 9.26 seconds of CPU time is getting done.

Looking at where we're spending that time...

About a third of our CPU time is in pgzip compressing the final layer. Since we're doing a parallel compression, this only takes ~850ms, so that's about the speed of light for us on a hot path:

We spend 1.5 serially writing things to disk and walking the filesystem to read them back from disk:

Meanwhile, we are gunzipping the data section of each APK, so that we are paying that time 2x (just concurrently):

A bit of a surprising result is that we spent a third of a second just cleaning up the temporary directory we created:

Then we spend a surprising amount of time pushing images, but that's mostly because docker desktop won't stop touching my config file 🙄

cat ~/.docker/config.json | jq .credsStore
"desktop"

If we drop that, things look a little better here:

The SBOM generation is still pretty slow. I'm going to see if I can shift some of that left, but I managed to cut 1/3 of it in #801

Anyway, looking back at where we're spending our time, very roughly:

1s writing a bunch of files to disk
0.5s reading a bunch of files from disk
0.5s cleaning up a bunch of files from disk
1s compressing things and other stuff

I have a plan to index the data section of APKs (really, just extract the tar headers) when we download them for the first time, then use that to avoid writing everything to disk. Instead, we can figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)... then we just (in parallel) compress the subset of files from each APK that would have ended up in the final layer and at the very end append all these gzipped tarballs together with a bunch of metadata we compute along the way.

It should look something like this:

~~1s writing a bunch of files to disk~~
~~0.5s reading a bunch of files from disk~~
~~0.5s cleaning up a bunch of files from disk~~
1s compressing things and other stuff

That last bit of compressing stuff will now happen even more concurrently than with pgzip, so I'm guessing that will bring us well under a second (on a hot path).

The next step after that would be to write some fun software that takes advantage of some details in DEFLATE to much more efficiently modify/recompress the existing APK's data section, which would shrink that latency by ~4-5x and get us closer to 250ms, at which point it will make some sense to revisit where we are spending our time.

cosmicexplorer · 2023-07-19T21:08:15Z

So I'm not sure that we have the same performance constraints here, but you may find pex-tool/pex#2175 interesting, especially the medusa-zip tool to rearrange zip files really fast at https://github.com/cosmicexplorer/medusa-zip. It's not quite the same thing as taking advantage of DEFLATE, but one extreme crime I have performed is the hackery to read out the contents of a zip archive into another one without touching the file stream at all: https://github.com/cosmicexplorer/zip/blob/94c21b77b21db4133a210f335e0671f4ea85d6a0/src/read.rs#L331-L392. The zip format was made for messing around like this; I would love to see more crimes against DEFLATE too.

cosmicexplorer · 2023-08-19T23:56:13Z

This isn't exactly relevant except that I happened to be working on it at the same time as the above, but in pypa/pip#12184 (comment) I demonstrate the performance impact of creating a local index for pip which gets lazily updated as it crawls dependencies. Since I believe we discussed one result of this being the publication of indices for positions referencing some other compressed targz stream, I wanted to note instead that in a related but different application, I was able to generate local indices for resources as they were crawled, amortizing that transformation per-node. I would recommend trying that approach first here if you haven't solved the problem already by now.

cosmicexplorer · 2023-08-20T00:12:01Z

It would also seem very much within the scope of something like medusa-zip to handle the creation of such indices in a streaming manner when a targz is first downloaded.

The following is mostly a note to self:
One additional concern that arose from the zip merging solution investigated in pex-tool/pex#2158 and pex-tool/pex#2175 was that (as initially proposed) merging zip files from a shared cache would also take up more disk space than before (to create the cached zips). While handling that cache is in one sense an application-specific issue (see pex-tool/pex#2201), if we also expand this medusa-zip archive library's capabilities to cover targz merging via creation of decompressed indices (and therefore hand over responsibility for the lifecycle of filesystem cache entries to that library/service), we could have it handle cache eviction/etc for the local entries it handles.

I'll create a separate issue if I have further thoughts on any of this and stop derailing this thread.

cosmicexplorer · 2023-08-20T01:46:39Z

Although, regarding this approach in particular:

I have a plan to index the data section of APKs (really, just extract the tar headers) when we download them for the first time, then use that to avoid writing everything to disk. Instead, we can figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)... then we just (in parallel) compress the subset of files from each APK that would have ended up in the final layer and at the very end append all these gzipped tarballs together with a bunch of metadata we compute along the way.

In order to execute build processes in isolated chroots that can be cached and executed remotely via the bazel remexec api, pants maintains a virtual filesystem consisting of merkle trees stored in an LMDB content-addressed store, which can be efficiently synced against a remote database (since the db only contains a mapping of (checksum) -> (byte string), and entries are stored as encoded protobufs). It exposes this to build tasks with a pretty novel API.

Your problem here can be solved without the global deduplication that pants performs, but I wanted to mention how encoding directory contents into merkle trees is a useful general approach for performing (as you said) "figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)...". This act of normalization into a db-friendly format (in pants's case, converting directory trees into protobufs referencing other entities by checksum) may be the link that lets us meaningfully generalize this into a library, one which:

efficiently reads/normalizes zips/targz into a local LMDB store
- encoded into protobufs like pants
efficiently computes the result of superimposing/transforming a sequence of normalized directory trees,
- without filesystem operations; this is what pants does with Digest and Snapshot
has methods to efficiently export a normalized directory tree into zip/targz
- not done in pants, but see e.g. Upload large files directly from disk, without loading into memory pantsbuild/pants#19049 for similar optimizations

jonjohnsonjr · 2023-08-31T22:35:52Z

After chainguard-dev/go-apk#98

Cold

~11s -> 4.9s

This came mostly from being able to fetch and decompress in parallel, which speeds up the installation phase.

Hot

~4.2s -> 2.6s

We still have that faster install phase but we get to skip the fetch phase entirely.

jonjohnsonjr · 2023-08-31T22:37:45Z

After #860

Using --offline flag (can't do this cold) on the hot path saves ~200ms mostly from avoiding TLS handshake at the beginning.

jonjohnsonjr · 2023-08-31T22:42:14Z

With #867

Building cgr.dev/chainguard/go for amd64.

cold

4.9s -> 3.7s

We are mostly limited here by how quickly we can fetch and decompress each APK. We definitely leave some performance on the table by limiting our concurrency during that phase.... maybe worth looking into.

hot

2.6s -> 1.5s

We spend most of our time now in pgzip with a bit of time burned doing TLS handshakes at the beginning and SBOM generation (giant JSON document rendering) at the end.

offline

2.4s -> 1.3s

jonjohnsonjr · 2023-09-01T19:45:18Z

The next phase is to take this (CPU) hungry hungry pgzippopotamus and replace it with something that can go faster with less CPU. I'd even be with with a slightly slower implementation that would use much less CPU.

There is a particularly ambitious optimization we can perform where we could stitch together pre-existing DEFLATE streams when we know that their decompressed contents are identical, which would let us reuse the CPU-intensive parts of compressing all these files.

Ensuring that the decompressed contents are identical is very difficult in the general case, but we can skip that difficulty by taking advantage of APK checksums where we already do know that the contents are identical. This requires writing a custom DEFLATE encoder, which might be out of reach for the amount of time I have here, but I want to write it here for posterity in case I come back to it.

jonjohnsonjr mentioned this issue Jul 3, 2023

Add otel spans chainguard-dev/melange#529

Merged

jonjohnsonjr mentioned this issue Jul 5, 2023

Memoize version parsing, filtering, and sorting chainguard-dev/go-apk#76

Merged

cosmicexplorer mentioned this issue Jul 26, 2023

consume packed wheel cache in zipapp creation pex-tool/pex#2175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf tracking #781

Perf tracking #781

jonjohnsonjr commented Jul 1, 2023 •

edited

Loading

Tasks

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 5, 2023 •

edited

Loading

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 10, 2023 •

edited

Loading

jonjohnsonjr commented Jul 13, 2023

jonjohnsonjr commented Jul 13, 2023

cosmicexplorer commented Jul 19, 2023

cosmicexplorer commented Aug 19, 2023

cosmicexplorer commented Aug 20, 2023

cosmicexplorer commented Aug 20, 2023

jonjohnsonjr commented Aug 31, 2023

jonjohnsonjr commented Aug 31, 2023

jonjohnsonjr commented Aug 31, 2023 •

edited

Loading

jonjohnsonjr commented Sep 1, 2023 •

edited

Loading

Perf tracking #781

Perf tracking #781

Comments

jonjohnsonjr commented Jul 1, 2023 • edited Loading

Tasks

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 5, 2023 • edited Loading

jonjohnsonjr commented Jul 5, 2023

jonjohnsonjr commented Jul 10, 2023 • edited Loading

jonjohnsonjr commented Jul 13, 2023

jonjohnsonjr commented Jul 13, 2023

cosmicexplorer commented Jul 19, 2023

cosmicexplorer commented Aug 19, 2023

cosmicexplorer commented Aug 20, 2023

cosmicexplorer commented Aug 20, 2023

jonjohnsonjr commented Aug 31, 2023

Cold

Hot

jonjohnsonjr commented Aug 31, 2023

jonjohnsonjr commented Aug 31, 2023 • edited Loading

cold

hot

offline

jonjohnsonjr commented Sep 1, 2023 • edited Loading

jonjohnsonjr commented Jul 1, 2023 •

edited

Loading

jonjohnsonjr commented Jul 5, 2023 •

edited

Loading

jonjohnsonjr commented Jul 10, 2023 •

edited

Loading

jonjohnsonjr commented Aug 31, 2023 •

edited

Loading

jonjohnsonjr commented Sep 1, 2023 •

edited

Loading