perf(task): cache per-file content hashes for source_freshness_hash_contents#9819
Conversation
…ontents When `task.source_freshness_hash_contents = true`, every freshness check re-reads and re-blake3-hashes every source file. This caches each file's content hash keyed by `(size, mtime_secs, mtime_nanos)` (git stat-info style) in a per-task file under `STATE/task-sources/<key>-content-cache`, so unchanged files are skipped on subsequent runs. The cache is rebuilt from scratch each run so entries for files no longer in `sources` are pruned; on disk it uses `rmp_serde + zlib` matching the existing cache convention. A corrupt or unreadable cache file falls back to empty, so correctness is preserved either way — only speed is at stake. Closes #9802 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryAdds an on-disk per-file content-hash cache for
Confidence Score: 5/5Safe to merge — the cache is keyed correctly, writes are atomic, and every error path falls back to a full re-hash so task freshness decisions remain correct even when the cache is absent or corrupt. The ZlibEncoder::finish() concern from the prior review round is explicitly fixed. The mtime cast (as_secs() as i64) is symmetric between make_cache_entry and cached_entry_matches, so cache comparisons are always consistent. Pre-epoch mtimes and unreadable cache files both fall back conservatively to a full re-hash. Atomic rename prevents partial-file reads. No files require special attention. Important Files Changed
Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile |
There was a problem hiding this comment.
Code Review
This pull request implements a content hash cache for task sources, using file metadata to avoid redundant hashing and persisting the cache to disk with zlib compression and MessagePack. Feedback recommends optimizing memory usage by streaming data directly through the zlib encoder and decoder during serialization and deserialization instead of using intermediate buffers.
| let mut zlib = ZlibDecoder::new(File::open(path)?); | ||
| let mut bytes = Vec::new(); | ||
| zlib.read_to_end(&mut bytes)?; | ||
| Ok(rmp_serde::from_slice(&bytes)?) |
There was a problem hiding this comment.
| let mut zlib = ZlibEncoder::new(File::create(&partial)?, Compression::fast()); | ||
| zlib.write_all(&rmp_serde::to_vec_named(cache)?)?; |
There was a problem hiding this comment.
Similarly to the loading logic, you can serialize directly to the ZlibEncoder instead of creating an intermediate Vec via to_vec_named. This avoids an unnecessary allocation.
let mut zlib = ZlibEncoder::new(File::create(&partial)?, Compression::fast());
rmp_serde::encode::write_named(&mut zlib, cache)?;… cache ZlibEncoder's Drop impl calls finish() but silently discards the Result. If the final flush failed, the partial file would still be renamed into place as the live cache, leaving a truncated zlib stream on disk. The next load would fall back to empty (so correctness held), but the cache was effectively poisoned for that run. Call finish() explicitly so a failed finalization bubbles up as an Err and the partial file is never installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @gemini-code-assist, but I'm going to pass on both of these. The buffered `read_to_end` → `from_slice` and `to_vec_named` → `write_all` pattern is the established convention everywhere else in this codebase that does `rmp_serde + zlib` (src/cache.rs:248-251, src/env_diff.rs:136, src/hook_env.rs:363). Diverging just here adds inconsistency for a theoretical memory win that doesn't materialize at this scale — a cache entry is ~100 bytes, so even a task with 10k source files is ~1MB uncompressed. Streaming msgpack also produces less precise error messages than `from_slice` on a flat byte buffer. If we ever want to switch the convention across the codebase, that should be a separate PR touching all four call sites at once. This comment was generated by an AI coding assistant. |
|
Thanks for the clarification, @jdx. I completely understand the need for consistency across the codebase, especially when the current pattern is well-established and the performance impact is negligible at this scale. I agree that a codebase-wide refactor would be the better approach if we decide to move away from this pattern later. I'll keep this in mind for future reviews. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1ef7b76. Configure here.
… filenames sources_hash_path and content_hash_cache_path both derived a filename stem from the same (task, config_source, root) triple via DefaultHasher. Pulling that into a single helper means future changes to the identity hash apply to both files in lock-step — a divergence would silently desync the content cache from the sources hash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatting drift from mise run render — multi-line array layout. The committed files predate the current schema generator's output style; CI's "render produced changes" check was failing on this. No semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hyperfine Performance
|
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.6 x -- echo |
19.4 ± 0.9 | 17.5 | 24.6 | 1.00 |
mise x -- echo |
20.1 ± 2.6 | 17.6 | 36.9 | 1.04 ± 0.14 |
mise env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.6 env |
20.2 ± 1.9 | 16.8 | 27.5 | 1.00 |
mise env |
21.7 ± 1.5 | 18.0 | 27.4 | 1.08 ± 0.13 |
mise hook-env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.6 hook-env |
20.7 ± 1.1 | 18.7 | 24.9 | 1.01 ± 0.07 |
mise hook-env |
20.5 ± 1.0 | 18.9 | 25.6 | 1.00 |
mise ls
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.6 ls |
16.6 ± 0.8 | 15.2 | 22.7 | 1.00 |
mise ls |
17.0 ± 0.9 | 15.2 | 22.7 | 1.03 ± 0.07 |
xtasks/test/perf
| Command | mise-2026.5.6 | mise | Variance |
|---|---|---|---|
| install (cached) | 128ms | 129ms | +0% |
| ls (cached) | 61ms | 61ms | +0% |
| bin-paths (cached) | 65ms | 65ms | +0% |
| task-ls (cached) | 503ms | 503ms | +0% |
JSON.stringify always splits arrays across lines, but the repo's prettier config inlines short arrays. Without a follow-up prettier pass, every mise run render produced schema drift that the hk prettier check then flagged — leaving CI wedged where render-drift and prettier-drift could not both pass. Run prettier --write on each output file inside writeFormattedJson so the on-disk format matches what the lint pipeline expects. Also reverts the multi-line schema state from the previous commit, which only existed to make the render-drift check pass and would have re-broken prettier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rland-8392e4 # Conflicts: # xtasks/render/schema.ts
### 🐛 Bug Fixes - **(backend)** use runtime paths for backend bin dirs by @risu729 in [#9606](#9606) - **(ci)** preserve vendor/aqua-registry/ in PPA publish workflow by @jdx in [#9782](#9782) - **(ci)** set UTF-8 locale in e2e Docker image by @jdx in [#9820](#9820) - **(ci)** pass UTF-8 locale through to e2e tests by @jdx in [#9823](#9823) - **(conda)** dedup repodata by archive identifier instead of URL by @jdx in [#9831](#9831) - **(github)** use default shell for credential command by @risu729 in [#9664](#9664) - **(settings)** distinguish unset known settings from unknown ones by @jdx in [#9818](#9818) - **(upgrade)** remove completed progress jobs to prevent duplicate output by @jdx in [#9779](#9779) - **(vfox)** resolve GitHub token lazily inside Lua plugins by @jdx in [#9816](#9816) ### 🚜 Refactor - **(config)** separate core and backend tool options by @risu729 in [#9753](#9753) - **(schema)** reuse env directive property schemas by @risu729 in [#9651](#9651) ### 📚 Documentation - **(aliases)** fix Aliased Versions example and drop stale asdf callout by @jdx in [#9830](#9830) ### ⚡ Performance - **(aqua)** use phf for baked registry lookups by @risu729 in [#9763](#9763) - **(task)** cache per-file content hashes for source_freshness_hash_contents by @jdx in [#9819](#9819) ### 🧪 Testing - **(e2e)** pin aube to known-good version in npm package_manager test by @jdx in [#9794](#9794) ### 📦 Registry - replace unsupported exe options by @risu729 in [#9587](#9587) - update pi by @garysassano in [#9792](#9792) ### Chore - **(ci)** use non-large runners for release builds by @jdx in [#9786](#9786) - **(ci)** compare registry PRs from fork point by @risu729 in [#9643](#9643) - **(ci)** make build-copr.sh the single source of truth for COPR chroots by @jdx in [#9788](#9788) - **(ci)** use crates.io trusted publishing in release-plz by @jdx in [#9793](#9793) - **(ci)** remove autofix.ci workflow by @jdx in [#9801](#9801) - **(ci)** restore -large runner for Linux release builds by @jdx in [#9815](#9815) - **(ci)** add zizmor workflow for github actions security analysis by @jdx in [#9804](#9804) - **(ci)** assert mise run render produces no diff by @jdx in [#9803](#9803) - **(copr)** publish EL9 builds via centos-stream+epel-next-9 chroot by @jdx in [#9787](#9787) ### Ci - remove pull_request_target workflow by @jdx in [#9799](#9799) - remove caching from publishing workflows by @jdx in [#9800](#9800) ### Security - reject shell metacharacters in version strings and CI inputs by @jdx in [#9814](#9814) ## 📦 Aqua Registry Updates ### New Packages (11) - [`Code-Hex/Neo-cowsay`](https://github.com/Code-Hex/Neo-cowsay) - [`SonarSource/sonarqube-cli`](https://github.com/SonarSource/sonarqube-cli) - [`earendil-works/pi`](https://github.com/earendil-works/pi) - [`hylo-lang/hylo-new`](https://github.com/hylo-lang/hylo-new) - [`jfernandez/bpftop`](https://github.com/jfernandez/bpftop) - [`modem-dev/hunk`](https://github.com/modem-dev/hunk) - [`npm/cli`](https://github.com/npm/cli) - [`racket/racket/minimal`](https://github.com/racket/racket) - [`slackapi/slack-cli`](https://github.com/slackapi/slack-cli) - [`vectordotdev/vector`](https://github.com/vectordotdev/vector) - [`wasilibs/go-yamllint`](https://github.com/wasilibs/go-yamllint) ### Updated Packages (10) - [`DataDog/pup`](https://github.com/DataDog/pup) - [`aquasecurity/trivy`](https://github.com/aquasecurity/trivy) - [`astral-sh/uv`](https://github.com/astral-sh/uv) - [`caarlos0/svu`](https://github.com/caarlos0/svu) - [`cargo-bins/cargo-binstall`](https://github.com/cargo-bins/cargo-binstall) - [`foundry-rs/foundry`](https://github.com/foundry-rs/foundry) - [`gastownhall/beads`](https://github.com/gastownhall/beads) - [`gruntwork-io/terragrunt`](https://github.com/gruntwork-io/terragrunt) - [`pnpm/pnpm`](https://github.com/pnpm/pnpm) - [`santosr2/TerraTidy`](https://github.com/santosr2/TerraTidy)
### 🐛 Bug Fixes - **(backend)** use runtime paths for backend bin dirs by @risu729 in [jdx#9606](jdx#9606) - **(ci)** preserve vendor/aqua-registry/ in PPA publish workflow by @jdx in [jdx#9782](jdx#9782) - **(ci)** set UTF-8 locale in e2e Docker image by @jdx in [jdx#9820](jdx#9820) - **(ci)** pass UTF-8 locale through to e2e tests by @jdx in [jdx#9823](jdx#9823) - **(conda)** dedup repodata by archive identifier instead of URL by @jdx in [jdx#9831](jdx#9831) - **(github)** use default shell for credential command by @risu729 in [jdx#9664](jdx#9664) - **(settings)** distinguish unset known settings from unknown ones by @jdx in [jdx#9818](jdx#9818) - **(upgrade)** remove completed progress jobs to prevent duplicate output by @jdx in [jdx#9779](jdx#9779) - **(vfox)** resolve GitHub token lazily inside Lua plugins by @jdx in [jdx#9816](jdx#9816) ### 🚜 Refactor - **(config)** separate core and backend tool options by @risu729 in [jdx#9753](jdx#9753) - **(schema)** reuse env directive property schemas by @risu729 in [jdx#9651](jdx#9651) ### 📚 Documentation - **(aliases)** fix Aliased Versions example and drop stale asdf callout by @jdx in [jdx#9830](jdx#9830) ### ⚡ Performance - **(aqua)** use phf for baked registry lookups by @risu729 in [jdx#9763](jdx#9763) - **(task)** cache per-file content hashes for source_freshness_hash_contents by @jdx in [jdx#9819](jdx#9819) ### 🧪 Testing - **(e2e)** pin aube to known-good version in npm package_manager test by @jdx in [jdx#9794](jdx#9794) ### 📦 Registry - replace unsupported exe options by @risu729 in [jdx#9587](jdx#9587) - update pi by @garysassano in [jdx#9792](jdx#9792) ### Chore - **(ci)** use non-large runners for release builds by @jdx in [jdx#9786](jdx#9786) - **(ci)** compare registry PRs from fork point by @risu729 in [jdx#9643](jdx#9643) - **(ci)** make build-copr.sh the single source of truth for COPR chroots by @jdx in [jdx#9788](jdx#9788) - **(ci)** use crates.io trusted publishing in release-plz by @jdx in [jdx#9793](jdx#9793) - **(ci)** remove autofix.ci workflow by @jdx in [jdx#9801](jdx#9801) - **(ci)** restore -large runner for Linux release builds by @jdx in [jdx#9815](jdx#9815) - **(ci)** add zizmor workflow for github actions security analysis by @jdx in [jdx#9804](jdx#9804) - **(ci)** assert mise run render produces no diff by @jdx in [jdx#9803](jdx#9803) - **(copr)** publish EL9 builds via centos-stream+epel-next-9 chroot by @jdx in [jdx#9787](jdx#9787) ### Ci - remove pull_request_target workflow by @jdx in [jdx#9799](jdx#9799) - remove caching from publishing workflows by @jdx in [jdx#9800](jdx#9800) ### Security - reject shell metacharacters in version strings and CI inputs by @jdx in [jdx#9814](jdx#9814) ## 📦 Aqua Registry Updates ### New Packages (11) - [`Code-Hex/Neo-cowsay`](https://github.com/Code-Hex/Neo-cowsay) - [`SonarSource/sonarqube-cli`](https://github.com/SonarSource/sonarqube-cli) - [`earendil-works/pi`](https://github.com/earendil-works/pi) - [`hylo-lang/hylo-new`](https://github.com/hylo-lang/hylo-new) - [`jfernandez/bpftop`](https://github.com/jfernandez/bpftop) - [`modem-dev/hunk`](https://github.com/modem-dev/hunk) - [`npm/cli`](https://github.com/npm/cli) - [`racket/racket/minimal`](https://github.com/racket/racket) - [`slackapi/slack-cli`](https://github.com/slackapi/slack-cli) - [`vectordotdev/vector`](https://github.com/vectordotdev/vector) - [`wasilibs/go-yamllint`](https://github.com/wasilibs/go-yamllint) ### Updated Packages (10) - [`DataDog/pup`](https://github.com/DataDog/pup) - [`aquasecurity/trivy`](https://github.com/aquasecurity/trivy) - [`astral-sh/uv`](https://github.com/astral-sh/uv) - [`caarlos0/svu`](https://github.com/caarlos0/svu) - [`cargo-bins/cargo-binstall`](https://github.com/cargo-bins/cargo-binstall) - [`foundry-rs/foundry`](https://github.com/foundry-rs/foundry) - [`gastownhall/beads`](https://github.com/gastownhall/beads) - [`gruntwork-io/terragrunt`](https://github.com/gruntwork-io/terragrunt) - [`pnpm/pnpm`](https://github.com/pnpm/pnpm) - [`santosr2/TerraTidy`](https://github.com/santosr2/TerraTidy)

Summary
task.source_freshness_hash_contents = true, every freshness check currently re-reads and re-blake3-hashes every source file (discussion #9802).(size, mtime_secs, mtime_nanos)— git's stat-info trick — in a per-task file under `STATE/task-sources/-content-cache`. Unchanged files are skipped on subsequent runs; the file is only re-hashed when its size or mtime moves.Design notes
Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Touches task invalidation/freshness logic by persisting and reusing per-file content hashes; bugs could cause tasks to incorrectly be treated as fresh or stale, though it falls back safely on cache errors.
Overview
When
task.source_freshness_hash_contentsis enabled, freshness checks now persist a per-task, per-file blake3 hash cache (keyed by file size + mtime secs/nanos) so unchanged sources aren’t re-read/re-hashed on every run.This introduces a stable
task_state_keyto scope state underSTATE/task-sources/, adds msgpack+zlib load/save with atomic*.part-*writes (and silent fallback on corrupt cache / trace-only write failures), and expands unit tests to cover cache reuse, pruning, and disk round-tripping.Reviewed by Cursor Bugbot for commit f1f039b. Bugbot is set up for automated code reviews on this repo. Configure here.