Caching of file-set hashes by local path and mtimes #700

tclose · 2023-09-11T22:39:28Z

Types of changes

New feature (non-breaking change which adds functionality)

Summary

Will address #683 by

bytes_repr() overloads for FileSets to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.
hash_single() checks for these key/mtime pairs in a global "local hashes cache" dict
- returns cached hash, if present
- otherwise it proceeds through the remaining byte chunks, and calculates the hash
calculated hashes are saved into/loaded from the cache directory using a hash of the key/mtime

Checklist

I have added tests to cover my changes (if necessary)
I have updated documentation (if necessary)

Notes

I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome

There isn't a clean way to specify the location of the persistent hash cache in the top-level code or put it in the cache_location path (my original plan) given that checksum and hash are object properties instead of methods
- I have therefore just dumped it in ~/.pydra-hash-cache
- not ideal to have to (effectively) hard-code this and having it user dependent
- Having it in the user directory does mean that if the same cache directory is accessed from different machines (e.g. shared network drive) then local paths won't clash (although chance of mtimes being the same would be vanishingly small)
The keys of the caches themselves are hashed and stored as files in the persistent cache directory, and will therefore never be cleaned up.
- This could be ok as they are very small files so if it grows as time goes on it probably won't have much impact
- This is how pydra caches work in general
The resolution of mtime differs depending on OS, with Ubuntu and Windows having quite low resolution, i.e. order of seconds
- I have had to put a sleep call inside the hash test to ensure the mtimes are different
- Could cause issues if you were updating the contents of files within a loop and spinning off different workflows (not sure whether this would ever happen in practice)
- No easy way to disable the mtime caching behaviour if this does become a problem

codecov · 2023-09-11T22:44:37Z

Codecov Report

Attention: Patch coverage is 99.04762% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.22%. Comparing base (ff01e4c) to head (921979c).

Files	Patch %	Lines
pydra/utils/hash.py	98.97%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #700      +/-   ##
==========================================
+ Coverage   83.93%   84.22%   +0.29%     
==========================================
  Files          24       25       +1     
  Lines        5029     5123      +94     
  Branches     1429     1449      +20     
==========================================
+ Hits         4221     4315      +94     
  Misses        802      802              
  Partials        6        6

Flag	Coverage Δ
unittests	`84.22% <99.04%> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ghisvail · 2023-09-14T16:04:14Z

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

tclose · 2023-09-17T22:46:06Z

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed.

It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes.

effigies · 2023-09-19T12:32:10Z

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time():
https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

pydra/utils/hash.py

tclose · 2023-09-19T23:29:36Z

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

Interesting, I assumed that the value for the mtime was controlled by the file-system not Python.

ghisvail

Some suggestions and a question mark regarding the cache path in the home folder. Maybe also worth having a look to mtime mocking instead of sleep steps as suggested by Chris.

Looks solid otherwise 👍

pydra/utils/hash.py

djarecka · 2023-09-24T22:03:45Z

@tclose - you can try with new slurm testing workflow, should work now!

djarecka

lgtm! just left comments regarding the location of the cache directory

tclose · 2023-09-25T06:43:21Z

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures

tclose · 2023-09-26T23:31:48Z

@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it.

…erent

djarecka · 2024-02-28T02:01:04Z

@tclose - the recent commit to this PR comes from some rebasing?

tclose · 2024-02-28T11:49:43Z

@tclose - the recent commit to this PR comes from some rebasing?

I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes

djarecka · 2024-02-28T14:29:39Z

ok, trying to figure out if I should review it again, since I already accepted

Changes addressed

pydra/utils/hash.py

effigies

Sorry, I've completely lost the plot on this PR. I don't understand what's going on, but there are some things that look wrong to me.

pydra/utils/hash.py

…ache

djarecka · 2024-03-07T22:47:19Z

sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right?

djarecka · 2024-03-08T00:31:20Z

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

for more information, see https://pre-commit.ci

tclose · 2024-03-08T06:30:00Z

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

Ok, nice addition

djarecka · 2024-03-08T20:34:10Z

I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set PYDRA_HASH_CACHE. Sorry, for asking for more work, but can we have some test when the persistent cache is used together with running task or workflow

tclose · 2024-03-16T00:55:10Z

@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do

djarecka · 2024-03-16T22:22:22Z

pydra/engine/tests/test_node_task.py

+        return super().contents
+
+
+def test_task_files_persistentcache(tmp_path):


@tclose - where are you setting the persistent cach path? i.e. PYDRA_HASH_CACHE

At @ghisvail's suggestion, the hash cache is stored in a system-dependent user cache directory using platformdirs.user_cache_dir by default (e.g. /Users/<username>/Library/Caches/pydra/<version-number> on MacOS).

I have just tweaked the code so that it is now put in a hashes subdirectory of that user cache dir (accessible in the pydra.utils.user_cache_dir variable) just in case any other cache data needs to be stored at some point in the future.

oh, got it! Sorry I missed that. I've just realized that I made a typo when I was setting PYDRA_HASH_CACHE and that's why it was not saving the hashes there, and was confused where this is being saved..

…ydra user cache dir

djarecka · 2024-03-17T14:35:23Z

@tclose - thanks so much for the work!

tclose added the enhancement New feature or request label Sep 11, 2023

tclose changed the title ~~added code to handle "locally-persistent-ids"~~ Caching of file-set hashes by local path and mtimes Sep 11, 2023

tclose marked this pull request as ready for review September 12, 2023 23:58

tclose requested review from effigies, djarecka and ghisvail September 13, 2023 11:59

ghisvail reviewed Sep 19, 2023

View reviewed changes

pydra/utils/hash.py Outdated Show resolved Hide resolved

ghisvail previously requested changes Sep 20, 2023

View reviewed changes

pydra/utils/hash.py Outdated Show resolved Hide resolved

pydra/utils/hash.py Outdated Show resolved Hide resolved

djarecka approved these changes Sep 24, 2023

View reviewed changes

tclose force-pushed the local-cache-ids branch from 0940383 to 1c1c309 Compare September 25, 2023 04:21

tclose force-pushed the local-cache-ids branch 2 times, most recently from 21540ad to 96dcc48 Compare September 26, 2023 03:24

tclose added 10 commits February 24, 2024 22:05

added code to handle "locally-persistent-ids"

45117ef

implemented persistent hash cache to avoid rehashing files

2b7ca50

touched up persistent_hash_cache test

04b95ff

replaced Cache({}) with Cache() to match new proper class

0c865f4

upped resolution of mtime to nanoseconds

3b3fdb7

added sleep to various tests to ensure file mtimes are different

81a5108

added more sleeps to ensure mtimes of input files are different in tests

0c4b179

debugged setting hash cache via env var and added clean up of directory

615d590

mock mtime writing instead of adding sleeps to ensure mtimes are diff…

55b660e

…erent

undid overzealous black

5d51736

effigies reviewed Feb 28, 2024

View reviewed changes

pydra/utils/hash.py Outdated Show resolved Hide resolved

effigies reviewed Feb 28, 2024

View reviewed changes

pydra/utils/hash.py Outdated Show resolved Hide resolved

pydra/utils/hash.py Show resolved Hide resolved

pydra/utils/hash.py Outdated Show resolved Hide resolved

tclose added 5 commits February 29, 2024 19:11

implementing @effigies suggestions

a031ea5

added comments and doc strings to explain the use of the persistent c…

f2f70a6

…ache

touched up comments

191aa9c

another comment touch up

3076fea

touch up comments again

a094fbc

tclose and others added 5 commits March 8, 2024 17:12

Merge branch 'master' into local-cache-ids

d27201f

[pre-commit.ci] auto fixes from pre-commit.com hooks

291f29f

for more information, see https://pre-commit.ci

added in @djarecka's test for moving file cache locations

0a10f6c

updated cache initialisation

311e3dd

switched to use blake2b isntead of blake2s

4827365

tclose added 2 commits March 8, 2024 19:54

[skip ci] deleted already commented-out code

b6799b6

additional doc strings for hash cache objects

2bb86fe

added test to see that persistent cache is used in the running of tasks

1f601e1

djarecka reviewed Mar 16, 2024

View reviewed changes

tclose added 2 commits March 17, 2024 11:04

moved persistent hash cache within "hash_cache" subdirectory of the p…

7e60c41

…ydra user cache dir

fixed import issue

921979c

djarecka merged commit 811dc45 into nipype:master Mar 17, 2024
43 checks passed

tclose deleted the local-cache-ids branch March 17, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching of file-set hashes by local path and mtimes #700

Caching of file-set hashes by local path and mtimes #700

tclose commented Sep 11, 2023 •

edited

Loading

codecov bot commented Sep 11, 2023 •

edited

Loading

ghisvail commented Sep 14, 2023

tclose commented Sep 17, 2023 •

edited

Loading

effigies commented Sep 19, 2023

tclose commented Sep 19, 2023

ghisvail left a comment

djarecka commented Sep 24, 2023

djarecka left a comment

tclose commented Sep 25, 2023 •

edited

Loading

tclose commented Sep 26, 2023

djarecka commented Feb 28, 2024

tclose commented Feb 28, 2024 •

edited

Loading

djarecka commented Feb 28, 2024

effigies left a comment

djarecka commented Mar 7, 2024

djarecka commented Mar 8, 2024

tclose commented Mar 8, 2024

djarecka commented Mar 8, 2024

tclose commented Mar 16, 2024

djarecka Mar 16, 2024

tclose Mar 17, 2024

djarecka Mar 17, 2024

djarecka commented Mar 17, 2024

		return super().contents


		def test_task_files_persistentcache(tmp_path):

Caching of file-set hashes by local path and mtimes #700

Caching of file-set hashes by local path and mtimes #700

Conversation

tclose commented Sep 11, 2023 • edited Loading

Types of changes

Summary

Checklist

Notes

codecov bot commented Sep 11, 2023 • edited Loading

Codecov Report

ghisvail commented Sep 14, 2023

tclose commented Sep 17, 2023 • edited Loading

effigies commented Sep 19, 2023

tclose commented Sep 19, 2023

ghisvail left a comment

Choose a reason for hiding this comment

djarecka commented Sep 24, 2023

djarecka left a comment

Choose a reason for hiding this comment

tclose commented Sep 25, 2023 • edited Loading

tclose commented Sep 26, 2023

djarecka commented Feb 28, 2024

tclose commented Feb 28, 2024 • edited Loading

djarecka commented Feb 28, 2024

effigies left a comment

Choose a reason for hiding this comment

djarecka commented Mar 7, 2024

djarecka commented Mar 8, 2024

tclose commented Mar 8, 2024

djarecka commented Mar 8, 2024

tclose commented Mar 16, 2024

djarecka Mar 16, 2024

Choose a reason for hiding this comment

tclose Mar 17, 2024

Choose a reason for hiding this comment

djarecka Mar 17, 2024

Choose a reason for hiding this comment

djarecka commented Mar 17, 2024

tclose commented Sep 11, 2023 •

edited

Loading

codecov bot commented Sep 11, 2023 •

edited

Loading

tclose commented Sep 17, 2023 •

edited

Loading

tclose commented Sep 25, 2023 •

edited

Loading

tclose commented Feb 28, 2024 •

edited

Loading