Skip to content

perf(task): cache per-file content hashes for source_freshness_hash_contents#9819

Merged
jdx merged 6 commits into
mainfrom
claude/blissful-sutherland-8392e4
May 12, 2026
Merged

perf(task): cache per-file content hashes for source_freshness_hash_contents#9819
jdx merged 6 commits into
mainfrom
claude/blissful-sutherland-8392e4

Conversation

@jdx

@jdx jdx commented May 12, 2026

Copy link
Copy Markdown
Owner

Summary

  • When task.source_freshness_hash_contents = true, every freshness check currently re-reads and re-blake3-hashes every source file (discussion #9802).
  • This caches each file's content hash keyed by (size, mtime_secs, mtime_nanos) — git's stat-info trick — in a per-task file under `STATE/task-sources/-content-cache`. Unchanged files are skipped on subsequent runs; the file is only re-hashed when its size or mtime moves.
  • The cache is rebuilt from scratch each run, so entries for files no longer in `sources` are pruned and the cache file stays bounded by the current source set.

Design notes

  • Serialization: `rmp_serde + zlib`, matching src/cache.rs and src/env_diff.rs. Considered `rkyv` for zero-copy reads, but the bottleneck this PR addresses is blake3 hashing of file contents (ms–s); parsing a `BTreeMap` of ≤1000s of entries with rmp_serde is microseconds and not on the critical path. Adding rkyv would mean a new heavy dep and a brittle byte format inconsistent with the rest of the codebase.
  • Per-task scoping: cache key reuses `sources_hash_path`'s identity hash (task + config_source + root), so any change to the task definition invalidates the cache automatically.
  • Atomic writes: `*.part-XXXX` + rename, same pattern as src/cache.rs:262. Two concurrent runs may lose each other's writes but cannot corrupt the cache.
  • Failure handling: corrupt/unreadable cache → silently fall back to empty (`unwrap_or_default`); failed write → trace-level log. Correctness is preserved either way; only speed is at stake.
  • Cache hit predicate: `size + mtime_secs + mtime_nanos`. False positives only if a file is rewritten with identical size and mtime preserved to the nanosecond — vanishingly rare and the same trick git uses safely.

Test plan

  • New unit tests in src/task/task_source_checker.rs:
    • `content_hash_cache_reuses_unchanged_files` — second call reuses cached hashes; mutating a file invalidates.
    • `content_hash_cache_prunes_dropped_files` — files removed from `sources` drop out of the cache.
    • `content_hash_cache_round_trips_through_disk` — save/load round-trips; corrupt file falls back to empty.
  • Existing e2e tests still pass: `test_task_source_freshness`, `test_task_source_freshness_with_cwd`, `test_task_run_sources`, `test_task_run_sources_negation`, `test_task_dep_invalidates_sources`.

🤖 Generated with Claude Code


Note

Medium Risk
Touches task invalidation/freshness logic by persisting and reusing per-file content hashes; bugs could cause tasks to incorrectly be treated as fresh or stale, though it falls back safely on cache errors.

Overview
When task.source_freshness_hash_contents is enabled, freshness checks now persist a per-task, per-file blake3 hash cache (keyed by file size + mtime secs/nanos) so unchanged sources aren’t re-read/re-hashed on every run.

This introduces a stable task_state_key to scope state under STATE/task-sources/, adds msgpack+zlib load/save with atomic *.part-* writes (and silent fallback on corrupt cache / trace-only write failures), and expands unit tests to cover cache reuse, pruning, and disk round-tripping.

Reviewed by Cursor Bugbot for commit f1f039b. Bugbot is set up for automated code reviews on this repo. Configure here.

…ontents

When `task.source_freshness_hash_contents = true`, every freshness check
re-reads and re-blake3-hashes every source file. This caches each file's
content hash keyed by `(size, mtime_secs, mtime_nanos)` (git stat-info
style) in a per-task file under `STATE/task-sources/<key>-content-cache`,
so unchanged files are skipped on subsequent runs.

The cache is rebuilt from scratch each run so entries for files no longer
in `sources` are pruned; on disk it uses `rmp_serde + zlib` matching the
existing cache convention. A corrupt or unreadable cache file falls back
to empty, so correctness is preserved either way — only speed is at stake.

Closes #9802

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds an on-disk per-file content-hash cache for task.source_freshness_hash_contents, keyed by (size, mtime_secs, mtime_nanos) (git's stat-info trick), stored as msgpack+zlib under STATE/task-sources/*-content-cache. Files whose stat info is unchanged between runs are skipped without re-reading their contents.

  • Refactors the shared filename stem into task_state_key so both the checksum file and the new cache file are invalidated together when the task definition changes.
  • Writes the cache atomically via a random *.part-XXXX temporary file and rename; the previous concern about ZlibEncoder's Drop silently discarding finalization errors is fixed by calling zlib.finish()? explicitly before the rename.
  • Adds three unit tests covering cache reuse, pruning of removed sources, and corrupt-file fallback.

Confidence Score: 5/5

Safe to merge — the cache is keyed correctly, writes are atomic, and every error path falls back to a full re-hash so task freshness decisions remain correct even when the cache is absent or corrupt.

The ZlibEncoder::finish() concern from the prior review round is explicitly fixed. The mtime cast (as_secs() as i64) is symmetric between make_cache_entry and cached_entry_matches, so cache comparisons are always consistent. Pre-epoch mtimes and unreadable cache files both fall back conservatively to a full re-hash. Atomic rename prevents partial-file reads.

No files require special attention.

Important Files Changed

Filename Overview
src/task/task_source_checker.rs Introduces on-disk per-file content-hash cache with atomic writes, correct explicit ZlibEncoder finalization, consistent mtime cast on both store and compare sides, and conservative fallback (always re-hashes) for pre-epoch mtimes and unreadable caches.

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment thread src/task/task_source_checker.rs

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a content hash cache for task sources, using file metadata to avoid redundant hashing and persisting the cache to disk with zlib compression and MessagePack. Feedback recommends optimizing memory usage by streaming data directly through the zlib encoder and decoder during serialization and deserialization instead of using intermediate buffers.

Comment on lines +434 to +437
let mut zlib = ZlibDecoder::new(File::open(path)?);
let mut bytes = Vec::new();
zlib.read_to_end(&mut bytes)?;
Ok(rmp_serde::from_slice(&bytes)?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of reading the entire decompressed stream into a Vec before deserializing, you can deserialize directly from the ZlibDecoder stream. This is more memory-efficient, especially if the cache grows large.

        let zlib = ZlibDecoder::new(File::open(path)?);
        Ok(rmp_serde::from_read(zlib)?)

Comment on lines +448 to +449
let mut zlib = ZlibEncoder::new(File::create(&partial)?, Compression::fast());
zlib.write_all(&rmp_serde::to_vec_named(cache)?)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly to the loading logic, you can serialize directly to the ZlibEncoder instead of creating an intermediate Vec via to_vec_named. This avoids an unnecessary allocation.

        let mut zlib = ZlibEncoder::new(File::create(&partial)?, Compression::fast());
        rmp_serde::encode::write_named(&mut zlib, cache)?;

… cache

ZlibEncoder's Drop impl calls finish() but silently discards the Result.
If the final flush failed, the partial file would still be renamed into
place as the live cache, leaving a truncated zlib stream on disk. The
next load would fall back to empty (so correctness held), but the cache
was effectively poisoned for that run.

Call finish() explicitly so a failed finalization bubbles up as an Err
and the partial file is never installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx

jdx commented May 12, 2026

Copy link
Copy Markdown
Owner Author

Thanks @gemini-code-assist, but I'm going to pass on both of these.

The buffered `read_to_end` → `from_slice` and `to_vec_named` → `write_all` pattern is the established convention everywhere else in this codebase that does `rmp_serde + zlib` (src/cache.rs:248-251, src/env_diff.rs:136, src/hook_env.rs:363). Diverging just here adds inconsistency for a theoretical memory win that doesn't materialize at this scale — a cache entry is ~100 bytes, so even a task with 10k source files is ~1MB uncompressed. Streaming msgpack also produces less precise error messages than `from_slice` on a flat byte buffer.

If we ever want to switch the convention across the codebase, that should be a separate PR touching all four call sites at once.

This comment was generated by an AI coding assistant.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Thanks for the clarification, @jdx. I completely understand the need for consistency across the codebase, especially when the current pattern is well-established and the performance impact is negligible at this scale. I agree that a codebase-wide refactor would be the better approach if we decide to move away from this pattern later. I'll keep this in mind for future reviews.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1ef7b76. Configure here.

Comment thread src/task/task_source_checker.rs
jdx and others added 2 commits May 12, 2026 13:54
… filenames

sources_hash_path and content_hash_cache_path both derived a filename stem
from the same (task, config_source, root) triple via DefaultHasher. Pulling
that into a single helper means future changes to the identity hash apply
to both files in lock-step — a divergence would silently desync the
content cache from the sources hash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatting drift from mise run render — multi-line array layout. The
committed files predate the current schema generator's output style; CI's
"render produced changes" check was failing on this. No semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown

Hyperfine Performance

mise x -- echo

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.6 x -- echo 19.4 ± 0.9 17.5 24.6 1.00
mise x -- echo 20.1 ± 2.6 17.6 36.9 1.04 ± 0.14

mise env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.6 env 20.2 ± 1.9 16.8 27.5 1.00
mise env 21.7 ± 1.5 18.0 27.4 1.08 ± 0.13

mise hook-env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.6 hook-env 20.7 ± 1.1 18.7 24.9 1.01 ± 0.07
mise hook-env 20.5 ± 1.0 18.9 25.6 1.00

mise ls

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.6 ls 16.6 ± 0.8 15.2 22.7 1.00
mise ls 17.0 ± 0.9 15.2 22.7 1.03 ± 0.07

xtasks/test/perf

Command mise-2026.5.6 mise Variance
install (cached) 128ms 129ms +0%
ls (cached) 61ms 61ms +0%
bin-paths (cached) 65ms 65ms +0%
task-ls (cached) 503ms 503ms +0%

jdx and others added 2 commits May 12, 2026 14:08
JSON.stringify always splits arrays across lines, but the repo's prettier
config inlines short arrays. Without a follow-up prettier pass, every
mise run render produced schema drift that the hk prettier check then
flagged — leaving CI wedged where render-drift and prettier-drift could
not both pass.

Run prettier --write on each output file inside writeFormattedJson so the
on-disk format matches what the lint pipeline expects. Also reverts the
multi-line schema state from the previous commit, which only existed to
make the render-drift check pass and would have re-broken prettier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rland-8392e4

# Conflicts:
#	xtasks/render/schema.ts
@jdx jdx merged commit e1abd7c into main May 12, 2026
33 checks passed
@jdx jdx deleted the claude/blissful-sutherland-8392e4 branch May 12, 2026 19:49
mise-en-dev added a commit that referenced this pull request May 13, 2026
### 🐛 Bug Fixes

- **(backend)** use runtime paths for backend bin dirs by @risu729 in
[#9606](#9606)
- **(ci)** preserve vendor/aqua-registry/ in PPA publish workflow by
@jdx in [#9782](#9782)
- **(ci)** set UTF-8 locale in e2e Docker image by @jdx in
[#9820](#9820)
- **(ci)** pass UTF-8 locale through to e2e tests by @jdx in
[#9823](#9823)
- **(conda)** dedup repodata by archive identifier instead of URL by
@jdx in [#9831](#9831)
- **(github)** use default shell for credential command by @risu729 in
[#9664](#9664)
- **(settings)** distinguish unset known settings from unknown ones by
@jdx in [#9818](#9818)
- **(upgrade)** remove completed progress jobs to prevent duplicate
output by @jdx in [#9779](#9779)
- **(vfox)** resolve GitHub token lazily inside Lua plugins by @jdx in
[#9816](#9816)

### 🚜 Refactor

- **(config)** separate core and backend tool options by @risu729 in
[#9753](#9753)
- **(schema)** reuse env directive property schemas by @risu729 in
[#9651](#9651)

### 📚 Documentation

- **(aliases)** fix Aliased Versions example and drop stale asdf callout
by @jdx in [#9830](#9830)

### ⚡ Performance

- **(aqua)** use phf for baked registry lookups by @risu729 in
[#9763](#9763)
- **(task)** cache per-file content hashes for
source_freshness_hash_contents by @jdx in
[#9819](#9819)

### 🧪 Testing

- **(e2e)** pin aube to known-good version in npm package_manager test
by @jdx in [#9794](#9794)

### 📦 Registry

- replace unsupported exe options by @risu729 in
[#9587](#9587)
- update pi by @garysassano in
[#9792](#9792)

### Chore

- **(ci)** use non-large runners for release builds by @jdx in
[#9786](#9786)
- **(ci)** compare registry PRs from fork point by @risu729 in
[#9643](#9643)
- **(ci)** make build-copr.sh the single source of truth for COPR
chroots by @jdx in [#9788](#9788)
- **(ci)** use crates.io trusted publishing in release-plz by @jdx in
[#9793](#9793)
- **(ci)** remove autofix.ci workflow by @jdx in
[#9801](#9801)
- **(ci)** restore -large runner for Linux release builds by @jdx in
[#9815](#9815)
- **(ci)** add zizmor workflow for github actions security analysis by
@jdx in [#9804](#9804)
- **(ci)** assert mise run render produces no diff by @jdx in
[#9803](#9803)
- **(copr)** publish EL9 builds via centos-stream+epel-next-9 chroot by
@jdx in [#9787](#9787)

### Ci

- remove pull_request_target workflow by @jdx in
[#9799](#9799)
- remove caching from publishing workflows by @jdx in
[#9800](#9800)

### Security

- reject shell metacharacters in version strings and CI inputs by @jdx
in [#9814](#9814)

## 📦 Aqua Registry Updates

### New Packages (11)

- [`Code-Hex/Neo-cowsay`](https://github.com/Code-Hex/Neo-cowsay)
-
[`SonarSource/sonarqube-cli`](https://github.com/SonarSource/sonarqube-cli)
- [`earendil-works/pi`](https://github.com/earendil-works/pi)
- [`hylo-lang/hylo-new`](https://github.com/hylo-lang/hylo-new)
- [`jfernandez/bpftop`](https://github.com/jfernandez/bpftop)
- [`modem-dev/hunk`](https://github.com/modem-dev/hunk)
- [`npm/cli`](https://github.com/npm/cli)
- [`racket/racket/minimal`](https://github.com/racket/racket)
- [`slackapi/slack-cli`](https://github.com/slackapi/slack-cli)
- [`vectordotdev/vector`](https://github.com/vectordotdev/vector)
- [`wasilibs/go-yamllint`](https://github.com/wasilibs/go-yamllint)

### Updated Packages (10)

- [`DataDog/pup`](https://github.com/DataDog/pup)
- [`aquasecurity/trivy`](https://github.com/aquasecurity/trivy)
- [`astral-sh/uv`](https://github.com/astral-sh/uv)
- [`caarlos0/svu`](https://github.com/caarlos0/svu)
-
[`cargo-bins/cargo-binstall`](https://github.com/cargo-bins/cargo-binstall)
- [`foundry-rs/foundry`](https://github.com/foundry-rs/foundry)
- [`gastownhall/beads`](https://github.com/gastownhall/beads)
-
[`gruntwork-io/terragrunt`](https://github.com/gruntwork-io/terragrunt)
- [`pnpm/pnpm`](https://github.com/pnpm/pnpm)
- [`santosr2/TerraTidy`](https://github.com/santosr2/TerraTidy)
3PeatVR pushed a commit to 3PeatVR/mise that referenced this pull request May 14, 2026
### 🐛 Bug Fixes

- **(backend)** use runtime paths for backend bin dirs by @risu729 in
[jdx#9606](jdx#9606)
- **(ci)** preserve vendor/aqua-registry/ in PPA publish workflow by
@jdx in [jdx#9782](jdx#9782)
- **(ci)** set UTF-8 locale in e2e Docker image by @jdx in
[jdx#9820](jdx#9820)
- **(ci)** pass UTF-8 locale through to e2e tests by @jdx in
[jdx#9823](jdx#9823)
- **(conda)** dedup repodata by archive identifier instead of URL by
@jdx in [jdx#9831](jdx#9831)
- **(github)** use default shell for credential command by @risu729 in
[jdx#9664](jdx#9664)
- **(settings)** distinguish unset known settings from unknown ones by
@jdx in [jdx#9818](jdx#9818)
- **(upgrade)** remove completed progress jobs to prevent duplicate
output by @jdx in [jdx#9779](jdx#9779)
- **(vfox)** resolve GitHub token lazily inside Lua plugins by @jdx in
[jdx#9816](jdx#9816)

### 🚜 Refactor

- **(config)** separate core and backend tool options by @risu729 in
[jdx#9753](jdx#9753)
- **(schema)** reuse env directive property schemas by @risu729 in
[jdx#9651](jdx#9651)

### 📚 Documentation

- **(aliases)** fix Aliased Versions example and drop stale asdf callout
by @jdx in [jdx#9830](jdx#9830)

### ⚡ Performance

- **(aqua)** use phf for baked registry lookups by @risu729 in
[jdx#9763](jdx#9763)
- **(task)** cache per-file content hashes for
source_freshness_hash_contents by @jdx in
[jdx#9819](jdx#9819)

### 🧪 Testing

- **(e2e)** pin aube to known-good version in npm package_manager test
by @jdx in [jdx#9794](jdx#9794)

### 📦 Registry

- replace unsupported exe options by @risu729 in
[jdx#9587](jdx#9587)
- update pi by @garysassano in
[jdx#9792](jdx#9792)

### Chore

- **(ci)** use non-large runners for release builds by @jdx in
[jdx#9786](jdx#9786)
- **(ci)** compare registry PRs from fork point by @risu729 in
[jdx#9643](jdx#9643)
- **(ci)** make build-copr.sh the single source of truth for COPR
chroots by @jdx in [jdx#9788](jdx#9788)
- **(ci)** use crates.io trusted publishing in release-plz by @jdx in
[jdx#9793](jdx#9793)
- **(ci)** remove autofix.ci workflow by @jdx in
[jdx#9801](jdx#9801)
- **(ci)** restore -large runner for Linux release builds by @jdx in
[jdx#9815](jdx#9815)
- **(ci)** add zizmor workflow for github actions security analysis by
@jdx in [jdx#9804](jdx#9804)
- **(ci)** assert mise run render produces no diff by @jdx in
[jdx#9803](jdx#9803)
- **(copr)** publish EL9 builds via centos-stream+epel-next-9 chroot by
@jdx in [jdx#9787](jdx#9787)

### Ci

- remove pull_request_target workflow by @jdx in
[jdx#9799](jdx#9799)
- remove caching from publishing workflows by @jdx in
[jdx#9800](jdx#9800)

### Security

- reject shell metacharacters in version strings and CI inputs by @jdx
in [jdx#9814](jdx#9814)

## 📦 Aqua Registry Updates

### New Packages (11)

- [`Code-Hex/Neo-cowsay`](https://github.com/Code-Hex/Neo-cowsay)
-
[`SonarSource/sonarqube-cli`](https://github.com/SonarSource/sonarqube-cli)
- [`earendil-works/pi`](https://github.com/earendil-works/pi)
- [`hylo-lang/hylo-new`](https://github.com/hylo-lang/hylo-new)
- [`jfernandez/bpftop`](https://github.com/jfernandez/bpftop)
- [`modem-dev/hunk`](https://github.com/modem-dev/hunk)
- [`npm/cli`](https://github.com/npm/cli)
- [`racket/racket/minimal`](https://github.com/racket/racket)
- [`slackapi/slack-cli`](https://github.com/slackapi/slack-cli)
- [`vectordotdev/vector`](https://github.com/vectordotdev/vector)
- [`wasilibs/go-yamllint`](https://github.com/wasilibs/go-yamllint)

### Updated Packages (10)

- [`DataDog/pup`](https://github.com/DataDog/pup)
- [`aquasecurity/trivy`](https://github.com/aquasecurity/trivy)
- [`astral-sh/uv`](https://github.com/astral-sh/uv)
- [`caarlos0/svu`](https://github.com/caarlos0/svu)
-
[`cargo-bins/cargo-binstall`](https://github.com/cargo-bins/cargo-binstall)
- [`foundry-rs/foundry`](https://github.com/foundry-rs/foundry)
- [`gastownhall/beads`](https://github.com/gastownhall/beads)
-
[`gruntwork-io/terragrunt`](https://github.com/gruntwork-io/terragrunt)
- [`pnpm/pnpm`](https://github.com/pnpm/pnpm)
- [`santosr2/TerraTidy`](https://github.com/santosr2/TerraTidy)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant