Implement file management on the cache and source dirs #17

jrdh · 2022-05-14T18:11:18Z

This PR contains a few bits and bobs:

Cache management of the source and cache dirs
Remove MSS requirement for media to be associated with collection record
Remove a lot of complexity around the image processing
Sort out error flows and logging
Add content-length header to /original endpoint

Closes #8

This includes files in the cache and source dirs. There are also a number of other changes in this commit, it got away from me a bit. Changes include: - cache management for processed images - cache management for the source images - the concept of a source file being in use so that it remains on disk throughout processing - processing is simplified to use a standard process pool in the hopes this reduces the number of zombie processes we keep getting - exception handling is revamped

This commit removes the requirement for an MSS profile image to have an associated specimen/index lot/artefact record in Elasticsearch. This is because we rely on the APS part of the MSS to do this check for us and because the IDs used for the images (specifically GUIDs) are not guessable, so a user can't accidentally stumble upon images they shouldn't be able to get to by incrementing the ID. The side effects of this are reduced time processing requests because we don't have to go ask Elasticsearch about the collections and it gives us the ability to serve images not associated with collections through this interface which is very useful!

This is useful for large images for example where the size exceeds the limits specified in the ImageMagick policy.xml file.

We don't know how they will be used so don't wrap them.

…ic around

We don't use None returns anymore, we use exceptions.

Now it's not a mystery how big the file is.

It's better if we don't use an AsyncExitStack for this as it means we're leaving connections open after we're really done with them. For example, if we end up needing to go to the dams to get the data, we're leaving 2 connections open to the mss even though we're done with them. Instead, we just open each url in turn and close the connection out when we're done with it. Additionally, this commit adds exception handling so that we can track the URL we were attempting to access up the stack.

Because once we've started streaming the body of the response we can't go back and change the headers we've already sent to show a nice error, we basically just have to hang up the phone and stop sending bits. To this end, there's no point in having various options for handling this, there's only really one choice: raise the exception anyway but before we do (cause it will disappear into the ether), log the error. So now when you stream and original and the backend (mss or dams) fails, we get a log of what happened and the user gets the same experience they had before. Neater and simpler to understand/manage/maintain.

jrdh added 30 commits May 11, 2022 18:55

Pass int instead of str for emu_irn

e81d67e

Catch image errors and try again with pillow

3b5dbb2

This is useful for large images for example where the size exceeds the limits specified in the ImageMagick policy.xml file.

Throw fetch errors up from the FetchCache.use method

9f96134

We don't know how they will be used so don't wrap them.

Check again once the lock is acquired before fetching

c6995b8

Update the status on mss profile to provide cache stats

b3ba8e6

Provide stats straight from the FetchCache instead of copying the log…

0c87980

…ic around

Clear up get_path_stats usages and tests

3ddbc1f

Await the processor status

619071f

Add doc for favicon endpoint

190c6b4

Don't log IIIF request format errors

987ec4f

Add some exception hanlder tests

d5e8823

Catch all kinds of wand exceptions before trying pillow

207daec

Remove old import

3c9a427

Fix exception usage breaking ops tests

3907020

Make base _fetch async

30558fa

Make test _fetch async

819e5f8

Fix test typo failing test

0fd026f

Fix disk profile tests

54f7fff

Update abstract resolve_filename method's return type

4504474

We don't use None returns anymore, we use exceptions.

Add content-length to the /original endpoint

4fb12eb

Now it's not a mystery how big the file is.

Split the error log handling out so that it can be used elsewhere

2f873e3

Remove old import

27b6658

Upgrade comment that is out of date

4a15711

Tidy up some comments and abstract method defs

46194a4

DRY up the disk profile

a8b1165

Alter disk filename test to use real file

aa4d493

jrdh added 3 commits May 13, 2022 23:54

Put in a proper error for no length responses

74af6be

Remove spurious e

53df5a5

Add an update into the test workflow

47eaf06

jrdh merged commit ab62a7c into dev May 14, 2022

jrdh deleted the josh/file_management branch May 14, 2022 18:15

jrdh mentioned this pull request May 16, 2022

v0.12.0 release #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement file management on the cache and source dirs #17

Implement file management on the cache and source dirs #17

jrdh commented May 14, 2022

Implement file management on the cache and source dirs #17

Implement file management on the cache and source dirs #17

Conversation

jrdh commented May 14, 2022