Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

Open
s-kganz opened this issue Nov 30, 2024 · 0 comments
Open

Comments

@s-kganz
Copy link

s-kganz commented Nov 30, 2024

Is your feature request related to a problem?

In my work I find myself writing custom code to do three main ETL tasks:

  • Shuffling data
  • Splitting data into train/test/valid sets
  • Dropping NAs

(The NAs part can be significant for me: 50% or more of my data can be missing, moreso if I'm constructing windowed data)

xbatcher's roadmap alludes to a few of these tasks. However, I don't currently use xbatcher because I find that writing an intermediate, model-ready dataset is easier and gives me better results.

This preference mainly arises from the shuffling task. the other two bullets above are plenty feasible with xbatcher and a little extra. Assume that we are working with an xarray dataset backed by chunked dask arrays. I see two ways to shuffle:

  1. Assign each batch an index and randomly access batches according to that index (this is my understanding of what a xbatcher + torch dataloader workflow does).
  2. Load one chunk into memory, shuffle batches within the chunk, and iterate over all chunks.

I find neither of these approaches entirely satisfactory. 1 is very slow because we are loading a chunk into memory, making one batch, and then dropping the rest of the data. 2 does not yield as good model performance as a full shuffle (notebook w/ a public toy dataset here).

Describe the solution you'd like

The goal of xbatcher is to link xarray datasets to machine learning frameworks. Since shuffling, train/valid/test splits, and dropping NAs are necessary preprocessing tasks for machine learning, xbatcher should provide an API or at least recommend good practices for doing these tasks.

My opinion is that creating an intermediate, cleaned dataset before batching is an acceptable, if suboptimal way to prepare an xarray dataset for machine learning. At least me and one other person have come to this conclusion. Therefore I think the community would benefit from having code in xbatcher (or somewhere else?) to do the ETL tasks above.

Describe alternatives you've considered

At first I thought about writing a custom batcher that would hold some number of chunks in memory and interleave batches from each chunk. The user would set the number of chunks to balance good shuffling and memory usage. I think this would be pretty cool but it's much easier to just write intermediate data. Plus, you get the benefit of having underlying data available for colleagues instead of it disappearing once your model is fit.

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant