-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf tracking #781
Comments
State of the world before we start optimizing things too much:
|
After #782
Mild speedup, but huge reduction in CPU usage. You can see the relative length of |
After chainguard-dev/go-apk#74
Shaved off another ~3s. |
After chainguard-dev/go-apk#75
Shaved off a little under a second. |
Been a while since an update... Here's a cold cache:
Here's warm:
Notably, cold is faster than warm when we started this effort 🎉 |
I have a branch that gets us down to ~3s on the hot path, but it's a bit of a dead end because it mostly just makes the work we're already doing a little bit more concurrent, which doesn't actually help that much in a build-the-world scenario. This is HEAD: This is my branch: At least in these two flamegraphs, the exact same Looking at where we're spending that time... About a third of our CPU time is in pgzip compressing the final layer. Since we're doing a parallel compression, this only takes ~850ms, so that's about the speed of light for us on a hot path: We spend 1.5 serially writing things to disk and walking the filesystem to read them back from disk: Meanwhile, we are gunzipping the data section of each APK, so that we are paying that time 2x (just concurrently): A bit of a surprising result is that we spent a third of a second just cleaning up the temporary directory we created: Then we spend a surprising amount of time pushing images, but that's mostly because docker desktop won't stop touching my config file 🙄
If we drop that, things look a little better here: The SBOM generation is still pretty slow. I'm going to see if I can shift some of that left, but I managed to cut 1/3 of it in #801 Anyway, looking back at where we're spending our time, very roughly: 1s writing a bunch of files to disk I have a plan to index the data section of APKs (really, just extract the tar headers) when we download them for the first time, then use that to avoid writing everything to disk. Instead, we can figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)... then we just (in parallel) compress the subset of files from each APK that would have ended up in the final layer and at the very end append all these gzipped tarballs together with a bunch of metadata we compute along the way. It should look something like this:
That last bit of compressing stuff will now happen even more concurrently than with pgzip, so I'm guessing that will bring us well under a second (on a hot path). The next step after that would be to write some fun software that takes advantage of some details in DEFLATE to much more efficiently modify/recompress the existing APK's data section, which would shrink that latency by ~4-5x and get us closer to 250ms, at which point it will make some sense to revisit where we are spending our time. |
So I'm not sure that we have the same performance constraints here, but you may find pex-tool/pex#2175 interesting, especially the |
This isn't exactly relevant except that I happened to be working on it at the same time as the above, but in pypa/pip#12184 (comment) I demonstrate the performance impact of creating a local index for pip which gets lazily updated as it crawls dependencies. Since I believe we discussed one result of this being the publication of indices for positions referencing some other compressed targz stream, I wanted to note instead that in a related but different application, I was able to generate local indices for resources as they were crawled, amortizing that transformation per-node. I would recommend trying that approach first here if you haven't solved the problem already by now. |
It would also seem very much within the scope of something like The following is mostly a note to self: I'll create a separate issue if I have further thoughts on any of this and stop derailing this thread. |
Although, regarding this approach in particular:
In order to execute build processes in isolated chroots that can be cached and executed remotely via the bazel remexec api, pants maintains a virtual filesystem consisting of merkle trees stored in an LMDB content-addressed store, which can be efficiently synced against a remote database (since the db only contains a mapping of Your problem here can be solved without the global deduplication that pants performs, but I wanted to mention how encoding directory contents into merkle trees is a useful general approach for performing (as you said) "figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)...". This act of normalization into a db-friendly format (in pants's case, converting directory trees into protobufs referencing other entities by checksum) may be the link that lets us meaningfully generalize this into a library, one which:
|
After chainguard-dev/go-apk#98 Cold~11s -> 4.9s This came mostly from being able to fetch and decompress in parallel, which speeds up the installation phase. Hot~4.2s -> 2.6s We still have that faster install phase but we get to skip the fetch phase entirely. |
After #860 Using |
With #867 Building cold4.9s -> 3.7s We are mostly limited here by how quickly we can fetch and decompress each APK. We definitely leave some performance on the table by limiting our concurrency during that phase.... maybe worth looking into. hot2.6s -> 1.5s We spend most of our time now in pgzip with a bit of time burned doing TLS handshakes at the beginning and SBOM generation (giant JSON document rendering) at the end. offline2.4s -> 1.3s |
The next phase is to take this (CPU) hungry hungry pgzippopotamus and replace it with something that can go faster with less CPU. I'd even be with with a slightly slower implementation that would use much less CPU. There is a particularly ambitious optimization we can perform where we could stitch together pre-existing DEFLATE streams when we know that their decompressed contents are identical, which would let us reuse the CPU-intensive parts of compressing all these files. Ensuring that the decompressed contents are identical is very difficult in the general case, but we can skip that difficulty by taking advantage of APK checksums where we already do know that the contents are identical. This requires writing a custom DEFLATE encoder, which might be out of reach for the amount of time I have here, but I want to write it here for posterity in case I come back to it. |
Opening this as a meta-issue to track low hanging fruit for perf wins.
Tasks
The text was updated successfully, but these errors were encountered: