Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching behavior to always use hashed URLs for the cache filenames #503

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

sirosen
Copy link
Member

@sirosen sirosen commented Nov 29, 2024

This change fixes a significant bug in caching behavior.
Cross-link: resolves GHSA-q6mv-284r-mp36

Why

Under the pre-existing default behavior, two different remote (HTTPS) schemas with the same filename would be cached under the same local filename. e.g., https://example.com/foo/schema.json and https://example.com/bar/schema.json both cache as schema.json.
As a result, cache confusion / poisoning issues would easily occur for users interacting with these schemas.

Prior to this change, users would either have to disable caching or use --cache-filename to get correct caching behavior.

What

To fix, caching is changed here to always use URL hashes, as was developed for the refs/ cache.
This behavior is refined and relocated to be used by both caching paths in the codebase.

The --cache-filename option is no longer important or useful to support, so it is made non-functional and deprecated in this release.
It will be removed in a future release; timeline TBD.

Additional Notes

Both caching codepaths -- initial schema and refs -- now use sha256 hashing. This results in a mild (read: negligible) slowdown in exchange for driving the risk of a collision even lower, to effectively 0.

No effort is made to "clean up" stale files in the cache. As documented, users can always blow the whole cache, as in rm -r ~/.cache/check_jsonschema, if they want to remove these files.
Cache cleanup would also play poorly with users who have different tool versions installed simultaneously in different projects.

I tried some implementation options which required a more significant refactor of the caching layer, but discarded them as

  • not offering enough benefit
  • posing too high of a risk of breakage / getting something wrong
  • requiring more thought and care

which evaluate negatively against the criterion of "ship a bugfix confidently and expeditiously so that users can upgrade".

The two cache dirs are currently "refs" and "downloads". Make these
always explicit, rather than `"downloads"` being an internal default
baked into the downloader.
1. When caching schemas in the `downloads/` dir, hash their URLs, just
   as is done in the `refs/` cache. This fixes buggy behaviors due to
   cache confusion on matching filenames.

2. Deprecate (but do not, at this time, remove) the `--cache-filename`
   option, making it non-functional. This also removes the
   `cache_filename` internal arg from several components.

Some minimal test and doc cleanup is needed, to appropriately handle
these changes.
@sirosen sirosen merged commit c52714b into main Nov 29, 2024
45 checks passed
@sirosen sirosen deleted the fix-caching branch November 29, 2024 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant