Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

s-kganz · 2024-11-30T02:38:11Z

Is your feature request related to a problem?

In my work I find myself writing custom code to do three main ETL tasks:

Shuffling data
Splitting data into train/test/valid sets
Dropping NAs

(The NAs part can be significant for me: 50% or more of my data can be missing, moreso if I'm constructing windowed data)

xbatcher's roadmap alludes to a few of these tasks. However, I don't currently use xbatcher because I find that writing an intermediate, model-ready dataset is easier and gives me better results.

This preference mainly arises from the shuffling task. the other two bullets above are plenty feasible with xbatcher and a little extra. Assume that we are working with an xarray dataset backed by chunked dask arrays. I see two ways to shuffle:

Assign each batch an index and randomly access batches according to that index (this is my understanding of what a xbatcher + torch dataloader workflow does).
Load one chunk into memory, shuffle batches within the chunk, and iterate over all chunks.

I find neither of these approaches entirely satisfactory. 1 is very slow because we are loading a chunk into memory, making one batch, and then dropping the rest of the data. 2 does not yield as good model performance as a full shuffle (notebook w/ a public toy dataset here).

Describe the solution you'd like

The goal of xbatcher is to link xarray datasets to machine learning frameworks. Since shuffling, train/valid/test splits, and dropping NAs are necessary preprocessing tasks for machine learning, xbatcher should provide an API or at least recommend good practices for doing these tasks.

My opinion is that creating an intermediate, cleaned dataset before batching is an acceptable, if suboptimal way to prepare an xarray dataset for machine learning. At least me and one other person have come to this conclusion. Therefore I think the community would benefit from having code in xbatcher (or somewhere else?) to do the ETL tasks above.

Describe alternatives you've considered

At first I thought about writing a custom batcher that would hold some number of chunks in memory and interleave batches from each chunk. The user would set the number of chunks to balance good shuffling and memory usage. I think this would be pretty cool but it's much easier to just write intermediate data. Plus, you get the benefit of having underlying data available for colleagues instead of it disappearing once your model is fit.

Additional context

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

s-kganz commented Nov 30, 2024

Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

Writing an intermediate dataset to ensure proper shuffling/splitting/etc. #252

Comments

s-kganz commented Nov 30, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context