-
Notifications
You must be signed in to change notification settings - Fork 1
feat/on-the-fly inference #87
base: main
Are you sure you want to change the base?
Conversation
in practice i've noticed that populating the db can take 10 minutes for 1500 wsis, which happens when initializing the datamodule, which initializes the datamanger, which immediately populates teh DB the biggest problem here was opening each slide with dlup to extract the mpp, width, and height, which is completely irrelevant for our task here. the for now i've made the minimal image even more minimal; it only contains the fp to the slide. we may also want to think about when to populate the db, which may be a bad design choice to do during datamodule initialization. Honestly, we can even forego the entire DB generation, and just within no database models, no engine, no session required. If we do awnt to keep the database, because it might add some more functionality later (e.g. when doing feature extraction w/ a mask?) we may want to open it, populate it, and close it, all during the call of |
assert current_dataset.slide_image.identifier | ||
self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset) | ||
curr_filename = current_dataset._path | ||
assert curr_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is redundant though, since a tiledwsidataset always has a path.
Fixes #73.
The commit contains some minor comments that need quick fixing.
This PR implements generating an in-memory database on-the-fly.
This is a useful feature if you want to, e.g., run inference using a segmentation model on a set of slides that you do not with to generate a complete database for.
That is exactly the use-case that it is designed for; running inference of a segmentation model on a glob of slides from a directory, taking only the slide as input (no masks, annotations, labels, patient information).
To achieve this, I have
OnTheFlyDataDescription
class, which contains fewer arguments thanDataDescription
OnTheFlyDataDescription
is that thedata_dir
is used together with theglob_pattern
. This searches for WSIs, populates the in-memory DB with this on-the-fly, which is used by the inference pipeline.DataManager
gets an_on_the_fly
property that is set by checking thedata_description
class.engine
, in contrast to the saved DB where auri
from thedata_description
is used to load the DBengine
or auri
depending on the use-caseDataManager
'sget_all_images
function, which is only implemented for aMinimalImage
tableMinimalImage
table is the sole, and very minimal, part of the on-the-fly DBdata_description
now uses a wrapper functioncreate_datasets_from_data_description
, which, based on the (OnTheFly)DataDescription
class, generates a dataset withdatasets_from_data_description_with_uri
, which assumes a fully populated DB, anddatasets_from_on_the_fly_data_description
, which assumes a minimally populated DB with only theMinimalimage
tableopenslide-python
directories and decided to add them explicitly here, which is likely easier and may be used for any testPossible limitations
DataManager
to use an engine for bothDataDescription
s to open a session from the same input which may be easier to read and possibly add new features/refactors to theDataManager
. In one case it creates an engine by creating a db and populating it. In the other case it creates an engine by reading it from the uri. And the session just opens the sessoin from the engine instead of first creating an engine from the uri and then returning the session to which the engine is boundDLUP
version has a bug concerning the pyramidal format of the generated segmentation map