Replies: 6 comments 8 replies
-
Are you asking about this issue? #2778 I believe you said that it worked in the I am hoping the |
Beta Was this translation helpful? Give feedback.
-
To train on a subset of data efficiently is a challenge. True random access will be limited to thousands of rows per second (if your rows are small, like 100 bytes or so, this is going to be woefully slow) I have been thinking about this problem in the background for a while and am hoping to take a good look at the dataset loader soon. I am generally thinking something like this:
|
Beta Was this translation helpful? Give feedback.
-
Hey @westonpace, here's some additional context and observations based on my experience with a large dataset: My source dataset consists of 10M rows (~3TB), each relatively small with columns (key, source, split, jpg_512). I've set up bitmap indices on Originally, these were separate datasets in Currently, for reading the data, I'm using a modified version of the torch dataset that retrieves data using lance/python/python/lance/sampler.py Line 256 in eb87bfa Re: #2778, this was about filtering with Regarding filtering performance: currently, filtering by I can try the It seems like implementing filtering on |
Beta Was this translation helpful? Give feedback.
-
Another option seems like |
Beta Was this translation helpful? Give feedback.
-
Implemented a row_id based sampler https://gist.github.com/tonyf/d512e26183d97eb4fbae9c0b6abe5072 Decent performance |
Beta Was this translation helpful? Give feedback.
-
Wrote a new sampler using the scanner api that seems to max out the throughput available on my machine https://gist.github.com/tonyf/3941c6176527c88c4f19446c45b17d9d Can submit a PR for this |
Beta Was this translation helpful? Give feedback.
-
One of the big advantages of a having a columnar data store for training datasets is the ability to have one common dataset for all of your training data. However, at different stages of training, it's common to want to only train on a subset of this data (i.e. some have certain properties that are useful in pretraining but not fine-tuning or instruction tuning).
For the lance torch dataset to be useful here,
dataset.get_fragments
needs to be able to filter data. Doing this at the fragment level (fragement.to_batches(filter=...)
) is inefficient when working on the order of millions of rows.Any timeline for supporting
filter
indataset.get_fragments
(currently it's marked asNotImplemented
)?Thanks!
Tony
Beta Was this translation helpful? Give feedback.
All reactions