Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

feat/on-the-fly inference #87

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

feat/on-the-fly inference #87

wants to merge 8 commits into from

Conversation

YoniSchirris
Copy link

@YoniSchirris YoniSchirris commented Jun 6, 2024

Fixes #73.

The commit contains some minor comments that need quick fixing.

This PR implements generating an in-memory database on-the-fly.

This is a useful feature if you want to, e.g., run inference using a segmentation model on a set of slides that you do not with to generate a complete database for.

That is exactly the use-case that it is designed for; running inference of a segmentation model on a glob of slides from a directory, taking only the slide as input (no masks, annotations, labels, patient information).

To achieve this, I have

  • Implemented an OnTheFlyDataDescription class, which contains fewer arguments than DataDescription
  • The most important difference in OnTheFlyDataDescription is that the data_dir is used together with the glob_pattern. This searches for WSIs, populates the in-memory DB with this on-the-fly, which is used by the inference pipeline.
  • To achieve this, the DataManager gets an _on_the_fly property that is set by checking the data_description class.
  • During initialization, it populates the DB and sets an engine, in contrast to the saved DB where a uri from the data_description is used to load the DB
  • In the session, it opens the db either from an engine or a uri depending on the use-case
  • The on-the-fly population utilizes the DataManager's get_all_images function, which is only implemented for a MinimalImage table
  • The MinimalImage table is the sole, and very minimal, part of the on-the-fly DB
  • Creating the dataset from the data_description now uses a wrapper function create_datasets_from_data_description, which, based on the (OnTheFly)DataDescription class, generates a dataset with datasets_from_data_description_with_uri, which assumes a fully populated DB, and datasets_from_on_the_fly_data_description, which assumes a minimally populated DB with only the Minimalimage table
  • Three small test svs files are added. These come from openslide, but I could not find them in the installed openslide-python directories and decided to add them explicitly here, which is likely easier and may be used for any test
  • populate_minimal_db_for_inference.py provides a very simple test for DB population
  • tests/test_run_segmentation_inference_with_on_the_fly_in_memory_database.sh is a very detailed example / documentation on how to run the inference, including configs and required env variables.

Possible limitations

  • Can't give a mask yet, while you may want to run segmentation on only a part of the image, e.g. when you have a tumor bed mask
  • I didn't know exactly which config variables to keep/remove. LMK if the choice makes sense
  • Some naming can be improved
  • May want to refactor DataManager to use an engine for both DataDescriptions to open a session from the same input which may be easier to read and possibly add new features/refactors to the DataManager. In one case it creates an engine by creating a db and populating it. In the other case it creates an engine by reading it from the uri. And the session just opens the sessoin from the engine instead of first creating an engine from the uri and then returning the session to which the engine is bound
  • Current DLUP version has a bug concerning the pyramidal format of the generated segmentation map

@YoniSchirris
Copy link
Author

in practice i've noticed that populating the db can take 10 minutes for 1500 wsis, which happens when initializing the datamodule, which initializes the datamanger, which immediately populates teh DB

the biggest problem here was opening each slide with dlup to extract the mpp, width, and height, which is completely irrelevant for our task here. the image.mpp was only used in the overwrite_mpp when constructing a dataset, which is also not interesting because if the image has no mpp, there's nothing to overwrite it with.

for now i've made the minimal image even more minimal; it only contains the fp to the slide.

we may also want to think about when to populate the db, which may be a bad design choice to do during datamodule initialization.

Honestly, we can even forego the entire DB generation, and just within datasets_from_on_the_fly_data_description do for image in data_description.image_dir.glob(data_description.glob_pattern): and do the rest.

no database models, no engine, no session required.

If we do awnt to keep the database, because it might add some more functionality later (e.g. when doing feature extraction w/ a mask?) we may want to open it, populate it, and close it, all during the call of datasets_from_on_the_fly_data_description,

assert current_dataset.slide_image.identifier
self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset)
curr_filename = current_dataset._path
assert curr_filename
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion is redundant though, since a tiledwsidataset always has a path.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create database on the fly for inference
1 participant