Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of custom dataset loaders #2

Open
jproctor opened this issue Oct 17, 2022 · 0 comments
Open

Better handling of custom dataset loaders #2

jproctor opened this issue Oct 17, 2022 · 0 comments

Comments

@jproctor
Copy link

I’m not 100% sold on MHC-specific loaders being baked into the repo like they are. Possibly they should be moved into a subdirectory/module and stay in this repo, possibly they should be moved to a separate repo.

In custom_datasets.py, load_mhc_libs() relies on the existence of prepro_channels.npy in the data directory. It looks like load_mars_big() needs ccs_channels.npy too. Regardless of whether they stay in this repo or move, they shouldn’t rely on the existence of otherwise undocumented data files. The question is whether they belong with the data (in a documented way) or the loader.

Architecturally, a plugin pattern definitely fits both public datasets and their custom loader functions (and possibly a way to bundle them together), but I think separate repos are overkill without a better use case. Focusing on the loaders:

  1. Let’s define a place to add loader modules and some code to slurp in everything it finds there. backend/loaders/ makes sense to me but I could argue for other places.
  2. Move the MHC loaders into a module there. Be clever with .gitignore so other things in the loaders directory are excluded from the repo.
  3. Move the .npy files into that module (subdir for supporting data?).

There’s also room for a change in datasets.yml to make the files: key a little clearer about what’s going on instead of baking those architectural decisions into the loaders as well (these types of data require two lines and the second is always the metadata file, these require one and assume the metadata filename, &c.), and it would make sense to at least consider that before we declare this task done. This could easily end up being a major API overhaul and split out to its own project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant