You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my work I find myself writing custom code to do three main ETL tasks:
Shuffling data
Splitting data into train/test/valid sets
Dropping NAs
(The NAs part can be significant for me: 50% or more of my data can be missing, moreso if I'm constructing windowed data)
xbatcher's roadmap alludes to a few of these tasks. However, I don't currently use xbatcher because I find that writing an intermediate, model-ready dataset is easier and gives me better results.
This preference mainly arises from the shuffling task. the other two bullets above are plenty feasible with xbatcher and a little extra. Assume that we are working with an xarray dataset backed by chunked dask arrays. I see two ways to shuffle:
Assign each batch an index and randomly access batches according to that index (this is my understanding of what a xbatcher + torch dataloader workflow does).
Load one chunk into memory, shuffle batches within the chunk, and iterate over all chunks.
I find neither of these approaches entirely satisfactory. 1 is very slow because we are loading a chunk into memory, making one batch, and then dropping the rest of the data. 2 does not yield as good model performance as a full shuffle (notebook w/ a public toy dataset here).
Describe the solution you'd like
The goal of xbatcher is to link xarray datasets to machine learning frameworks. Since shuffling, train/valid/test splits, and dropping NAs are necessary preprocessing tasks for machine learning, xbatcher should provide an API or at least recommend good practices for doing these tasks.
My opinion is that creating an intermediate, cleaned dataset before batching is an acceptable, if suboptimal way to prepare an xarray dataset for machine learning. At least me and one other person have come to this conclusion. Therefore I think the community would benefit from having code in xbatcher (or somewhere else?) to do the ETL tasks above.
Describe alternatives you've considered
At first I thought about writing a custom batcher that would hold some number of chunks in memory and interleave batches from each chunk. The user would set the number of chunks to balance good shuffling and memory usage. I think this would be pretty cool but it's much easier to just write intermediate data. Plus, you get the benefit of having underlying data available for colleagues instead of it disappearing once your model is fit.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
In my work I find myself writing custom code to do three main ETL tasks:
(The NAs part can be significant for me: 50% or more of my data can be missing, moreso if I'm constructing windowed data)
xbatcher's roadmap alludes to a few of these tasks. However, I don't currently use xbatcher because I find that writing an intermediate, model-ready dataset is easier and gives me better results.
This preference mainly arises from the shuffling task. the other two bullets above are plenty feasible with xbatcher and a little extra. Assume that we are working with an xarray dataset backed by chunked dask arrays. I see two ways to shuffle:
I find neither of these approaches entirely satisfactory. 1 is very slow because we are loading a chunk into memory, making one batch, and then dropping the rest of the data. 2 does not yield as good model performance as a full shuffle (notebook w/ a public toy dataset here).
Describe the solution you'd like
The goal of xbatcher is to link xarray datasets to machine learning frameworks. Since shuffling, train/valid/test splits, and dropping NAs are necessary preprocessing tasks for machine learning, xbatcher should provide an API or at least recommend good practices for doing these tasks.
My opinion is that creating an intermediate, cleaned dataset before batching is an acceptable, if suboptimal way to prepare an xarray dataset for machine learning. At least me and one other person have come to this conclusion. Therefore I think the community would benefit from having code in xbatcher (or somewhere else?) to do the ETL tasks above.
Describe alternatives you've considered
At first I thought about writing a custom batcher that would hold some number of chunks in memory and interleave batches from each chunk. The user would set the number of chunks to balance good shuffling and memory usage. I think this would be pretty cool but it's much easier to just write intermediate data. Plus, you get the benefit of having underlying data available for colleagues instead of it disappearing once your model is fit.
Additional context
No response
The text was updated successfully, but these errors were encountered: