Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement file management on the cache and source dirs #17

Merged
merged 33 commits into from
May 14, 2022

Conversation

jrdh
Copy link
Member

@jrdh jrdh commented May 14, 2022

This PR contains a few bits and bobs:

  • Cache management of the source and cache dirs
  • Remove MSS requirement for media to be associated with collection record
  • Remove a lot of complexity around the image processing
  • Sort out error flows and logging
  • Add content-length header to /original endpoint

Closes #8

jrdh added 30 commits May 11, 2022 18:55
This includes files in the cache and source dirs.
There are also a number of other changes in this commit, it got
away from me a bit. Changes include:

  - cache management for processed images
  - cache management for the source images
  - the concept of a source file being in use so that it remains
    on disk throughout processing
  - processing is simplified to use a standard process pool in the
    hopes this reduces the number of zombie processes we keep
    getting
  - exception handling is revamped
This commit removes the requirement for an MSS profile image to have
an associated specimen/index lot/artefact record in Elasticsearch.
This is because we rely on the APS part of the MSS to do this check
for us and because the IDs used for the images (specifically GUIDs)
are not guessable, so a user can't accidentally stumble upon images
they shouldn't be able to get to by incrementing the ID.

The side effects of this are reduced time processing requests because
we don't have to go ask Elasticsearch about the collections and
it gives us the ability to serve images not associated with collections
through this interface which is very useful!
This is useful for large images for example where the size exceeds the limits specified in the ImageMagick policy.xml file.
We don't know how they will be used so don't wrap them.
We don't use None returns anymore, we use exceptions.
Now it's not a mystery how big the file is.
It's better if we don't use an AsyncExitStack for this as it means we're leaving connections open after we're really done with them. For example, if we end up needing to go to the dams to get the data, we're leaving 2 connections open to the mss even though we're done with them.
Instead, we just open each url in turn and close the connection out when we're done with it.
Additionally, this commit adds exception handling so that we can track the URL we were attempting to access up the stack.
Because once we've started streaming the body of the response we can't go back and change the headers we've already sent to show a nice error, we basically just have to hang up the phone and stop sending bits. To this end, there's no point in having various options for handling this, there's only really one choice: raise the exception anyway but before we do (cause it will disappear into the ether), log the error. So now when you stream and original and the backend (mss or dams) fails, we get a log of what happened and the user gets the same experience they had before. Neater and simpler to understand/manage/maintain.
@jrdh jrdh merged commit ab62a7c into dev May 14, 2022
@jrdh jrdh deleted the josh/file_management branch May 14, 2022 18:15
@jrdh jrdh mentioned this pull request May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant