[RFC] allow configuring a compression algorithm for logs#15159
[RFC] allow configuring a compression algorithm for logs#15159CyberShadow wants to merge 3 commits intoNixOS:masterfrom
Conversation
Centralizes the mapping from CompressionAlgo to file extension (e.g. bzip2 → ".bz2", zstd → ".zst") in a single function, replacing the ad-hoc inline mapping in binary-cache-store.cc. This will also be used in the next commit to support configurable build log compression algorithms.
Change the `compress-build-log` setting from a boolean to a compression algorithm name (e.g. `zstd`, `gzip`, `none`), using the existing `CompressionAlgo` enum. The default remains `bzip2`. For backward compatibility, `true` maps to `bzip2` and `false` maps to `none`. On the command line, `--compress-build-log <method>` now takes an algorithm argument, and `--no-compress-build-log` disables compression. This also updates `addBuildLog()` to respect the setting (it previously always hardcoded bzip2), and `getBuildLogExact()` to try all known compressed extensions when reading logs, so old logs remain readable after switching algorithms. Motivation: bzip2 uses large blocks (up to 900KB) which makes partially written build logs unreadable during builds. Streaming-friendly formats like zstd or gzip flush data in smaller frames, allowing tools like `nix log` to display in-progress build output.
…-log Update the functional test to use the new `--compress-build-log <method>` syntax. Add test cases for zstd and no-compression modes alongside the existing bzip2 test. Add a release note documenting the change.
|
Rather than making it configurable, maybe we should just pick a better algorithm (e.g. zstd) and switch to it. |
|
We can certainly do that. A benchmark/comparison from Claude, measuring recoverability/streamability vs. compression qualities: DetailsBuild log compression benchmarkBenchmark of compression algorithms for Nix build logs, measuring compression Uses libarchive for both compression and decompression -- the same library Nix Test dataThree real build logs from
Compression ratio & speedBest of 3 runs per algorithm, using libarchive default settings. Summary:
zstd achieves nearly the same compression ratio as bzip2 while being Streaming recoverabilityThe key metric: if the compressed file is truncated (build still in progress), Method: compress the full log, truncate the compressed output at X% of its gcc-cross (11.3 MB uncompressed)linux-kernel (4.6 MB uncompressed)vm-test (612 KB uncompressed)bzip2 recovers zero bytes from the 612 KB vm-test log at every truncation First-byte latencyMinimum compressed bytes that must be written before libarchive can recover Note: for the vm-test sample, bzip2's "min compressed" of 87.4 KB equals the Effective block/frame sizesSize of each independently-decompressible unit, as seen by libarchive. Measured gcc-cross (11.3 MB)linux-kernel (4.6 MB)vm-test (612 KB)Conclusionsbzip2 is the worst choice for build log compression. Its 896 KB block zstd is the recommended replacement:
gzip and xz also stream well through libarchive (64 KB blocks), but gzip Benchmark source codeCompile with: In a Nix checkout: /*
* Benchmark compression algorithms for Nix build logs using libarchive.
*
* This uses the exact same library that Nix uses for compression/decompression,
* giving realistic results for streaming recoverability of truncated logs.
*
* Compile: cc -O2 -o bench-libarchive bench-libarchive.c -larchive
*/
#include <archive.h>
#include <archive_entry.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <errno.h>
/* ── Dynamic buffer ──────────────────────────────────────────────── */
typedef struct {
char *data;
size_t len;
size_t cap;
} Buffer;
static void buf_init(Buffer *b) {
b->data = NULL;
b->len = 0;
b->cap = 0;
}
static void buf_free(Buffer *b) {
free(b->data);
buf_init(b);
}
static void buf_append(Buffer *b, const void *data, size_t len) {
if (b->len + len > b->cap) {
size_t newcap = b->cap ? b->cap * 2 : 65536;
while (newcap < b->len + len) newcap *= 2;
b->data = realloc(b->data, newcap);
b->cap = newcap;
}
memcpy(b->data + b->len, data, len);
b->len += len;
}
/* ── Compression via libarchive ──────────────────────────────────── */
typedef int (*add_filter_fn)(struct archive *);
struct algo_info {
const char *name;
add_filter_fn add_filter;
const char *filter_name; /* for archive_write_set_filter_option */
const char *extension;
};
static struct algo_info ALGOS[] = {
{"bzip2", archive_write_add_filter_bzip2, "bzip2", ".bz2"},
{"gzip", archive_write_add_filter_gzip, "gzip", ".gz"},
{"zstd", archive_write_add_filter_zstd, "zstd", ".zst"},
{"xz", archive_write_add_filter_xz, "xz", ".xz"},
};
#define N_ALGOS (sizeof(ALGOS) / sizeof(ALGOS[0]))
/* Write callback for compression: appends to a Buffer */
struct write_ctx {
Buffer *buf;
};
static ssize_t write_cb(struct archive *a, void *client, const void *buffer, size_t length) {
struct write_ctx *ctx = client;
buf_append(ctx->buf, buffer, length);
return (ssize_t)length;
}
/* Compress `input` using `algo`, store result in `output` */
static int do_compress(struct algo_info *algo, const char *input, size_t input_len, Buffer *output) {
struct archive *a = archive_write_new();
if (!a) return -1;
struct write_ctx ctx = {.buf = output};
output->len = 0;
algo->add_filter(a);
archive_write_set_format_raw(a);
archive_write_set_bytes_per_block(a, 0);
archive_write_set_bytes_in_last_block(a, 1);
archive_write_open(a, &ctx, NULL, write_cb, NULL);
struct archive_entry *ae = archive_entry_new();
archive_entry_set_filetype(ae, AE_IFREG);
archive_write_header(a, ae);
archive_entry_free(ae);
archive_write_data(a, input, input_len);
archive_write_close(a);
archive_write_free(a);
return 0;
}
/* Read callback for decompression: reads from a buffer */
struct read_ctx {
const char *data;
size_t len;
size_t pos;
};
static ssize_t read_cb(struct archive *a, void *client, const void **buffer) {
struct read_ctx *ctx = client;
if (ctx->pos >= ctx->len) return 0;
*buffer = ctx->data + ctx->pos;
size_t avail = ctx->len - ctx->pos;
size_t chunk = avail < 65536 ? avail : 65536;
ctx->pos += chunk;
return (ssize_t)chunk;
}
/*
* Decompress `input[0..input_len)` using libarchive auto-detection.
* Returns number of decompressed bytes, or -1 on total failure.
* Tolerant of truncated streams.
*/
static ssize_t do_decompress(const char *input, size_t input_len, Buffer *output) {
if (!input || input_len == 0) return 0;
struct archive *a = archive_read_new();
if (!a) return -1;
output->len = 0;
archive_read_support_filter_all(a);
archive_read_support_format_raw(a);
struct read_ctx ctx = {.data = input, .len = input_len, .pos = 0};
int r = archive_read_open(a, &ctx, NULL, read_cb, NULL);
if (r != ARCHIVE_OK) {
archive_read_free(a);
return -1;
}
struct archive_entry *ae;
r = archive_read_next_header(a, &ae);
if (r != ARCHIVE_OK) {
archive_read_free(a);
return 0; /* can't even read header — 0 bytes recovered */
}
/* Read as much decompressed data as possible */
char buf[65536];
for (;;) {
ssize_t n = archive_read_data(a, buf, sizeof(buf));
if (n > 0) {
buf_append(output, buf, (size_t)n);
} else {
break; /* EOF, error, or truncated — stop */
}
}
archive_read_free(a);
return (ssize_t)output->len;
}
/* ── File I/O ────────────────────────────────────────────────────── */
static char *read_file(const char *path, size_t *out_len) {
FILE *f = fopen(path, "rb");
if (!f) { perror(path); return NULL; }
fseek(f, 0, SEEK_END);
long len = ftell(f);
fseek(f, 0, SEEK_SET);
char *data = malloc(len);
if (fread(data, 1, len, f) != (size_t)len) { free(data); fclose(f); return NULL; }
fclose(f);
*out_len = (size_t)len;
return data;
}
/* Decompress a .bz2 file to get the raw log */
static char *decompress_bz2_file(const char *path, size_t *out_len) {
size_t comp_len;
char *comp = read_file(path, &comp_len);
if (!comp) return NULL;
Buffer output;
buf_init(&output);
ssize_t r = do_decompress(comp, comp_len, &output);
free(comp);
if (r < 0) { buf_free(&output); return NULL; }
*out_len = output.len;
return output.data;
}
/* ── Formatting helpers ──────────────────────────────────────────── */
static void human_size(char *buf, size_t buflen, size_t n) {
if (n < 1024)
snprintf(buf, buflen, "%zu B", n);
else if (n < 1024 * 1024)
snprintf(buf, buflen, "%.1f KB", n / 1024.0);
else
snprintf(buf, buflen, "%.1f MB", n / (1024.0 * 1024.0));
}
static double now_sec(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
/* ── Benchmark ───────────────────────────────────────────────────── */
struct sample {
const char *name;
char *data;
size_t len;
};
int main(int argc, char **argv) {
/*
* Usage: bench-libarchive [logfile.bz2 ...]
*
* If no arguments are given, uses hardcoded paths to sample logs.
* If arguments are given, each should be a .bz2-compressed log file.
*/
struct sample samples[16];
int loaded = 0;
printf("=== Loading sample build logs ===\n\n");
if (argc > 1) {
for (int i = 1; i < argc && loaded < 16; i++) {
size_t len;
char *data = decompress_bz2_file(argv[i], &len);
if (data) {
/* Use basename as sample name */
const char *name = strrchr(argv[i], '/');
samples[loaded].name = name ? name + 1 : argv[i];
samples[loaded].data = data;
samples[loaded].len = len;
char hs[32];
human_size(hs, sizeof(hs), len);
printf(" %s: %s\n", samples[loaded].name, hs);
loaded++;
}
}
} else {
struct { const char *name; const char *path; } defaults[] = {
{"gcc-cross", "/nix/var/log/nix/drvs/cq/6vldgqkcbka93kbmxy1ji9arnsxni2-armv7l-unknown-linux-musleabihf-gcc-14.3.0.drv.bz2"},
{"linux-kernel", "/nix/var/log/nix/drvs/pi/fxyb7a6z4pblflrbrqnlh9rhznyffh-linux-6.15.11.drv.bz2"},
{"vm-test", "/nix/var/log/nix/drvs/16/7iyxzvppyagla72yz0rj94l5744ar8-vm-test-run-k4-21a-health-checks.drv.bz2"},
};
for (int i = 0; i < 3; i++) {
size_t len;
char *data = decompress_bz2_file(defaults[i].path, &len);
if (data) {
samples[loaded].name = defaults[i].name;
samples[loaded].data = data;
samples[loaded].len = len;
char hs[32];
human_size(hs, sizeof(hs), len);
printf(" %s: %s\n", samples[loaded].name, hs);
loaded++;
}
}
}
printf("\n");
if (loaded == 0) {
fprintf(stderr, "No sample logs found. Pass .bz2 log files as arguments.\n");
return 1;
}
/* Storage for compressed data per (sample, algo) */
Buffer compressed[16][N_ALGOS];
for (int s = 0; s < loaded; s++)
for (size_t a = 0; a < N_ALGOS; a++)
buf_init(&compressed[s][a]);
/* ── 1. Compression ratio & speed ─────────────────────────── */
printf("=== Compression ratio & speed ===\n\n");
printf("%-10s %-16s %10s %12s %8s %12s %12s\n",
"Algorithm", "Sample", "Original", "Compressed", "Ratio", "Comp MB/s", "Decomp MB/s");
for (int i = 0; i < 82; i++) putchar('-');
putchar('\n');
for (int s = 0; s < loaded; s++) {
for (size_t a = 0; a < N_ALGOS; a++) {
Buffer *comp = &compressed[s][a];
Buffer decomp_buf;
buf_init(&decomp_buf);
double best_comp = 1e9;
for (int run = 0; run < 3; run++) {
comp->len = 0;
double t0 = now_sec();
do_compress(&ALGOS[a], samples[s].data, samples[s].len, comp);
double dt = now_sec() - t0;
if (dt < best_comp) best_comp = dt;
}
double best_decomp = 1e9;
for (int run = 0; run < 3; run++) {
decomp_buf.len = 0;
double t0 = now_sec();
do_decompress(comp->data, comp->len, &decomp_buf);
double dt = now_sec() - t0;
if (dt < best_decomp) best_decomp = dt;
}
double ratio = (double)comp->len / samples[s].len * 100.0;
double comp_speed = best_comp > 0.001 ? samples[s].len / 1048576.0 / best_comp : 0;
double decomp_speed = best_decomp > 0.001 ? samples[s].len / 1048576.0 / best_decomp : 0;
char hs_orig[32], hs_comp[32];
human_size(hs_orig, sizeof(hs_orig), samples[s].len);
human_size(hs_comp, sizeof(hs_comp), comp->len);
printf("%-10s %-16s %10s %12s %7.1f%% %11.1f %11.1f\n",
ALGOS[a].name, samples[s].name, hs_orig, hs_comp,
ratio, comp_speed, decomp_speed);
buf_free(&decomp_buf);
}
}
printf("\n");
/* ── 2. Streaming recoverability ──────────────────────────── */
printf("=== Streaming recoverability ===\n\n");
int trunc_pcts[] = {5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95};
int n_pcts = sizeof(trunc_pcts) / sizeof(trunc_pcts[0]);
Buffer decomp_buf;
buf_init(&decomp_buf);
for (int s = 0; s < loaded; s++) {
char hs[32];
human_size(hs, sizeof(hs), samples[s].len);
printf("-- %s (%s uncompressed) --\n", samples[s].name, hs);
printf("%8s", "Written");
for (size_t a = 0; a < N_ALGOS; a++)
printf(" %18s", ALGOS[a].name);
printf("\n");
for (int i = 0; i < 8 + (int)N_ALGOS * 20; i++) putchar('-');
putchar('\n');
for (int p = 0; p < n_pcts; p++) {
printf("%6d%% ", trunc_pcts[p]);
for (size_t a = 0; a < N_ALGOS; a++) {
Buffer *comp = &compressed[s][a];
size_t trunc_bytes = comp->len * trunc_pcts[p] / 100;
decomp_buf.len = 0;
ssize_t recovered = do_decompress(comp->data, trunc_bytes, &decomp_buf);
if (recovered < 0) recovered = 0;
double recov_pct = (double)recovered / samples[s].len * 100.0;
char hs2[32];
human_size(hs2, sizeof(hs2), (size_t)recovered);
printf(" %6.1f%% (%8s)", recov_pct, hs2);
}
printf("\n");
}
printf("%6d%% (all algorithms: 100%%)\n\n", 100);
}
/* ── 3. First-byte latency ────────────────────────────────── */
printf("=== First-byte latency ===\n\n");
printf("%-10s %-16s %16s %16s\n",
"Algorithm", "Sample", "Min compressed", "First recovery");
for (int i = 0; i < 62; i++) putchar('-');
putchar('\n');
for (int s = 0; s < loaded; s++) {
for (size_t a = 0; a < N_ALGOS; a++) {
Buffer *comp = &compressed[s][a];
size_t lo = 1, hi = comp->len;
while (lo < hi) {
size_t mid = (lo + hi) / 2;
decomp_buf.len = 0;
ssize_t r = do_decompress(comp->data, mid, &decomp_buf);
if (r > 0) hi = mid; else lo = mid + 1;
}
decomp_buf.len = 0;
ssize_t first_recovered = do_decompress(comp->data, lo, &decomp_buf);
if (first_recovered < 0) first_recovered = 0;
char hs_min[32], hs_rec[32];
human_size(hs_min, sizeof(hs_min), lo);
human_size(hs_rec, sizeof(hs_rec), (size_t)first_recovered);
printf("%-10s %-16s %16s %16s\n",
ALGOS[a].name, samples[s].name, hs_min, hs_rec);
}
}
printf("\n");
/* ── 4. Effective block/frame sizes ───────────────────────── */
printf("=== Effective block/frame sizes ===\n\n");
for (int s = 0; s < loaded; s++) {
printf("-- %s --\n", samples[s].name);
printf("%-10s %8s %12s %12s %12s %12s\n",
"Algorithm", "Blocks", "Min block", "Max block", "Avg block", "Median");
for (int i = 0; i < 70; i++) putchar('-');
putchar('\n');
for (size_t a = 0; a < N_ALGOS; a++) {
Buffer *comp = &compressed[s][a];
size_t step = comp->len / 500;
if (step < 1) step = 1;
size_t prev_recovered = 0;
size_t *block_sizes = NULL;
int n_blocks = 0, block_cap = 0;
for (size_t offset = step; offset <= comp->len; offset += step) {
decomp_buf.len = 0;
ssize_t r = do_decompress(comp->data, offset, &decomp_buf);
size_t recovered = r > 0 ? (size_t)r : 0;
if (recovered > prev_recovered) {
size_t block = recovered - prev_recovered;
if (n_blocks >= block_cap) {
block_cap = block_cap ? block_cap * 2 : 64;
block_sizes = realloc(block_sizes, sizeof(size_t) * block_cap);
}
block_sizes[n_blocks++] = block;
prev_recovered = recovered;
}
}
if (prev_recovered < samples[s].len) {
if (n_blocks >= block_cap) {
block_cap = block_cap ? block_cap * 2 : 64;
block_sizes = realloc(block_sizes, sizeof(size_t) * block_cap);
}
block_sizes[n_blocks++] = samples[s].len - prev_recovered;
}
if (n_blocks > 0) {
for (int i = 0; i < n_blocks - 1; i++)
for (int j = i + 1; j < n_blocks; j++)
if (block_sizes[i] > block_sizes[j]) {
size_t tmp = block_sizes[i];
block_sizes[i] = block_sizes[j];
block_sizes[j] = tmp;
}
size_t total = 0, min_b = block_sizes[0], max_b = block_sizes[0];
for (int i = 0; i < n_blocks; i++) {
total += block_sizes[i];
if (block_sizes[i] < min_b) min_b = block_sizes[i];
if (block_sizes[i] > max_b) max_b = block_sizes[i];
}
char hs_min[32], hs_max[32], hs_avg[32], hs_med[32];
human_size(hs_min, sizeof(hs_min), min_b);
human_size(hs_max, sizeof(hs_max), max_b);
human_size(hs_avg, sizeof(hs_avg), total / n_blocks);
human_size(hs_med, sizeof(hs_med), block_sizes[n_blocks / 2]);
printf("%-10s %8d %12s %12s %12s %12s\n",
ALGOS[a].name, n_blocks, hs_min, hs_max, hs_avg, hs_med);
}
free(block_sizes);
}
printf("\n");
}
/* Cleanup */
buf_free(&decomp_buf);
for (int s = 0; s < loaded; s++) {
free(samples[s].data);
for (size_t a = 0; a < N_ALGOS; a++)
buf_free(&compressed[s][a]);
}
return 0;
}The numbers point at zstd as the current champion, and it does seem to be universally well-liked today. Granted, 128K is still a lot of text, so it's not a qualitative improvement, but it does seem to beat bzip2 in every aspect quantitatively. Shall I amend this to switch compression of new logs to zstd? |
|
This is a low risk strategy but also an incomplete solution. A complete solution would return everything at byte or line granularity, but this requires either an uncompressed buffer file on disk (and a good plan for sync) or an architectural change like daemon threads that share memory. For sync I imagine you could use fresh buffer files for each e.g. 1 MB block and give them a sequence number. Only delete the uncompressed block when the compressed stream includes it in whole. Doesn't seem too complicated after all, but then I haven't implemented it. This is not a review - just an idea. |
|
So even at level 15 (which is still fast), (Fun fact: at level 16-18, the output becomes bigger... Level 19 is a lot smaller (335623 bytes) but also much slower.) |
Motivation
The
compress-build-logsetting currently only acceptstrue(bzip2) orfalse(none). This is an unfortunate choice of compression format, because bzip2 uses large blocks (up to 900KB) that cannot be decompressed until fully written. This means partially-written build logs are unreadable -- if you startnix buildwithout-Land later want to check progress, there is no way to see the in-progress log.The only workaround today is
compress-build-log = false, which gives up compression entirely.This is frustrating because most modern compression formats (zstd, gzip, lz4) use small frames that can be decompressed independently. A partially-written zstd file, for example, is decompressible up to the last complete frame -- you get everything except the last few KB. This would make
nix logwork on in-progress builds (returning whatever has been flushed so far), and would also let userszstdcatlog files from another terminal while a build is running.The problem in detail
nix build nixpkgs#something-Land want to see what's happeningnix logreturns nothing -- the log file exists on disk but is a partial bzip2 stream that cannot be decompressed-LWith a streaming-friendly format like zstd, step 3 would return all log output up to the last flushed frame.
Proposed solution
Make the compression format configurable by extending
compress-build-logto accept an algorithm name:This gives the best of both worlds: compressed logs that are also partially readable during builds.
While this does not enable real-time streaming, it does allow you to see all but the last few KB of the log file. Changing the default to
zstdcould be considered separately in the future.Attached patch
A machine-generated patch is attached that implements this feature. Summary of what it does:
compress-build-logto accept a compression algorithm name (e.g.zstd,gzip,none) in addition to the existingtrue/falsevaluesbzip2;truemaps tobzip2andfalsemaps tononefor backward compatibility--no-compress-build-logon the command line still works;--compress-build-lognow takes a<method>argumentnix log,nix-store -l) tries all known compressed extensions (.bz2,.zst,.xz,.gz,.lz4,.br), so old logs remain readable after switching algorithmsaddBuildLog()(used when receiving logs from remote builders) now also respects the setting -- it previously hardcoded bzip2 regardless of thecompress-build-logvaluecompressionAlgoExtension()to centralize the algo-to-file-extension mapping that was previously duplicated inline inbinary-cache-store.ccBuilds on #15020 which introduced the
CompressionAlgoenum andSetting<CompressionAlgo>infrastructure.Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.