Better handling of custom dataset loaders #2

jproctor · 2022-10-17T15:13:49Z

I’m not 100% sold on MHC-specific loaders being baked into the repo like they are. Possibly they should be moved into a subdirectory/module and stay in this repo, possibly they should be moved to a separate repo.

In custom_datasets.py, load_mhc_libs() relies on the existence of prepro_channels.npy in the data directory. It looks like load_mars_big() needs ccs_channels.npy too. Regardless of whether they stay in this repo or move, they shouldn’t rely on the existence of otherwise undocumented data files. The question is whether they belong with the data (in a documented way) or the loader.

Architecturally, a plugin pattern definitely fits both public datasets and their custom loader functions (and possibly a way to bundle them together), but I think separate repos are overkill without a better use case. Focusing on the loaders:

Let’s define a place to add loader modules and some code to slurp in everything it finds there. backend/loaders/ makes sense to me but I could argue for other places.
Move the MHC loaders into a module there. Be clever with .gitignore so other things in the loaders directory are excluded from the repo.
Move the .npy files into that module (subdir for supporting data?).

There’s also room for a change in datasets.yml to make the files: key a little clearer about what’s going on instead of baking those architectural decisions into the loaders as well (these types of data require two lines and the second is always the metadata file, these require one and assume the metadata filename, &c.), and it would make sense to at least consider that before we declare this task done. This could easily end up being a major API overhaul and split out to its own project.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of custom dataset loaders #2

Better handling of custom dataset loaders #2

jproctor commented Oct 17, 2022

Better handling of custom dataset loaders #2

Better handling of custom dataset loaders #2

Comments

jproctor commented Oct 17, 2022