Is there a timeline on support for `filter` in `get_fragments`? #2781

tonyf · 2024-08-24T01:18:32Z

tonyf
Aug 24, 2024

One of the big advantages of a having a columnar data store for training datasets is the ability to have one common dataset for all of your training data. However, at different stages of training, it's common to want to only train on a subset of this data (i.e. some have certain properties that are useful in pretraining but not fine-tuning or instruction tuning).

For the lance torch dataset to be useful here, dataset.get_fragments needs to be able to filter data. Doing this at the fragment level (fragement.to_batches(filter=...)) is inefficient when working on the order of millions of rows.

Any timeline for supporting filter in dataset.get_fragments (currently it's marked as NotImplemented)?

Thanks!
Tony

westonpace · 2024-08-24T12:50:14Z

westonpace
Aug 24, 2024
Maintainer

Are you asking about this issue? #2778

I believe you said that it worked in the stable but not the legacy format correct? Is there a particular reason you are unable to use the stable format?

I am hoping the stable format will become the default next week and, at that point, I'm not sure that fixing things in the legacy format will be high priority.

0 replies

westonpace · 2024-08-24T13:05:03Z

westonpace
Aug 24, 2024
Maintainer

However, at different stages of training, it's common to want to only train on a subset of this data (i.e. some have certain properties that are useful in pretraining but not fine-tuning or instruction tuning).

To train on a subset of data efficiently is a challenge. True random access will be limited to thousands of rows per second (if your rows are small, like 100 bytes or so, this is going to be woefully slow) I have been thinking about this problem in the background for a while and am hoping to take a good look at the dataset loader soon. I am generally thinking something like this:

Run a query, with your filter, to fetch the row ids that match the filter (you can use with_row_id on the scanner).
Take your row ids (they should be sorted at this point) and divide them into "batches". More batches will mean higher randomness, fewer batches will mean higher performance.
Randomize your batches and distribute them to workers.
Have each worker take some # of batches, and download them locally:
If your filter was highly selective (selected less than 1% of all rows) then the fastest way to do this is probably Dataset._take_rows.
If your filter was not highly selective then the fastest way to do this is probably some kind of ranged gets that throw away the remaining data
Randomize the order in which you access the batches.

1 reply

westonpace Aug 24, 2024
Maintainer

Here's an example of the _take_rows approach:

import lance
import pyarrow as pa

import datetime
import random

BATCH_SIZE = 1024

try:
    ds = lance.dataset("/tmp/training.lance")
except:
    ds = lance.write_dataset(pa.table({"x": range(100_000_000)}), "/tmp/training.lance", data_storage_version="stable")

row_ids = ds.scanner(filter="x % 128 == 0", with_row_id=True, columns=[]).to_table().column("_rowid").to_pylist()

random.shuffle(row_ids)

start = datetime.datetime.now()
offset = 0
while offset < len(row_ids):
    next_batch_ids = row_ids[offset:offset+BATCH_SIZE]
    batch = ds._take_rows(next_batch_ids)
    if (offset / BATCH_SIZE) % 100 == 0:
        print(offset / len(row_ids))
    offset += BATCH_SIZE
    # process batch                                                                                                                                                                                                

end = datetime.datetime.now()
duration = (end - start).total_seconds()
print(f"Read in {len(row_ids)} rows in {duration} seconds")

If I run this locally (NVME) I can process about 200,000 rows/second
If I run this against cloud storage (Google Storage) I get about 7,800 rows/second

I'll try and tackle the more complex approach soonish but have a few things I'm trying to wrap up at the moment.

tonyf · 2024-08-25T17:38:21Z

tonyf
Aug 25, 2024
Author

Hey @westonpace, here's some additional context and observations based on my experience with a large dataset:

My source dataset consists of 10M rows (~3TB), each relatively small with columns (key, source, split, jpg_512). I've set up bitmap indices on source and split. For context, each source represents about a third of the total data, while the split is divided into 90% train and 10% val.

Originally, these were separate datasets in tar format, but lance's ability to only return requested columns and add columns as we pre-process the data in separate ways seems advantageous and easier to manage. But, it does depend on being able to efficiently filter.

Currently, for reading the data, I'm using a modified version of the torch dataset that retrieves data using dataset.get_fragments and returns batches via fragment.to_batches.

lance/python/python/lance/sampler.py

Line 256 in eb87bfa

class ShardedFragmentSampler(FragmentSampler):

Re: #2778, this was about filtering with fragment.to_batches. Initially, I was accidentally using the legacy format as it was the default in my code. After re-exporting the dataset, I found that filtering in to_batches works in both legacy and stable formats. This makes me wonder if my original dataset might have been corrupted somehow.

Regarding filtering performance: currently, filtering by split at the fragment.to_batches level is relatively efficient since each fragment contains some val rows. However, filtering by source often results in misses for a given source, causing long waits as it cycles through all fragments without relevant data. Ideally, only returning fragments with relevant data seems like it would result in the highest performance.

I can try the take_rows approach, but given that we don't really need random access, it seems like streaming fragments might be faster for our purposes? On that note, I've managed to improve dataset read throughput by 3x using a read-ahead fragment sampler I implemented (https://gist.github.com/tonyf/7087dd3130ee5df1e93b862d55230f1c).

It seems like implementing filtering on dataset.get_fragments would yield the highest throughput, especially when dealing with massive dataset? Also, I'm curious about your thoughts on the read-ahead fragment sampler. Maybe this should just happen at the rust level?

2 replies

westonpace Aug 26, 2024
Maintainer

Yeah, it sounds like source would benefit a lot from a clustered index (e.g. partitioning) on source but we don't have that capability yet. Since you have many empty fragments I am guessing your data was inserted one source at a time? This effectively gives you a starting order that is partitioned but future updates / deletes will mess up this partitioning if you do modifications later.

It seems like implementing filtering on dataset.get_fragments would yield the highest throughput

Can you help me understand what filtering on dataset.get_fragments would return? Are you looking to figure out which fragments include rows for a given filter? We can play some games with row addresses to get that information today:

import pyarrow as pa
import pyarrow.compute as pc
import lance

tab = pa.table({"x": range(1000)})
ds = lance.write_dataset(tab, "/tmp/foo.lance", max_rows_per_file=100, mode="overwrite")

# The row address is a 64 bit unsigned integer containing (fragment_id:u32, offset_in_frag:u32) of each row                                                                                                        
row_addrs = ds.to_table(columns=[], filter="x > 300", with_row_address=True).column("_rowaddr")
# Shift right to convert row address into fragment id                                                                                                                                                              
frag_ids = pc.shift_right(row_addrs, 32)
# Grab the unique fragment ids                                                                                                                                                                                     
frag_ids = pc.unique(frags).to_pylist()
print(frags) # [3, 4, 5, 6, 7, 8, 9]

I can try the take_rows approach, but given that we don't really need random access, it seems like streaming fragments might be faster for our purposes?

Yes, if it's a 90/10 split from cloud storage then streaming is likely to be fastest.

However, filtering by source often results in misses for a given source, causing long waits as it cycles through all fragments without relevant data.

Is this because the torch loader is currently implemented as "grab all data and discard the bits I don't need"?

tonyf Aug 26, 2024
Author

Can you help me understand what filtering on dataset.get_fragments would return? Are you looking to figure out which fragments include rows for a given filter? We can play some games with row addresses to get that information today:

Yeah, ideally it'd return fragments that have some data given the filter string.

Is this because the torch loader is currently implemented as "grab all data and discard the bits I don't need"?

Yep, currently, the ShardedFragmentSampler used the torch dataset, iterates through all fragments and filters when yielding batches.

def iter_fragments(
    self, dataset: lance.LanceDataset, **kwargs
) -> Generator[lance.LanceFragment, None, None]:
    fragments = dataset.get_fragments()
    if self._randomize:
        random.seed(self._seed)
        random.shuffle(fragments)
    for idx in range(self._rank, len(fragments), self._world_size):
        yield fragments[idx]

def __call__(...):
    for fragment in self.iter_fragments(dataset, *args, **kwargs):
        for batch in fragment.to_batches(
            batch_size=batch_size,
            columns=columns,
            filter=filter,
            with_row_id=with_row_id,
            batch_readahead=batch_readahead,
        ):
            yield batch

Since the sources were added one at a time, this means iterating through lots of fragments without any data that matches the filter. Shuffling helps here but the problem gets worse as we add more sources

tonyf · 2024-08-25T17:47:21Z

tonyf
Aug 25, 2024
Author

Another option seems like filtered_ds = dataset.filter(...), however that's blocked by #2777

5 replies

westonpace Aug 26, 2024
Maintainer

Another option seems like filtered_ds = dataset.filter(...), however that's blocked by #2777

Also, I'm curious about your thoughts on the read-ahead fragment sampler. Maybe this should just happen at the rust level?

Yes, I am mainly worried about re-inventing things that already exist in Rust. Rust does already have readahead in the scanner. Also, we can do a filtered scan, but I think the pytorch loaders don't use it today because it would result in an uneven distribution of workload. There are surprisingly many ways to tackle this data loading problem and I think it might be a case of "different situations need different strategies". I will try and brainstorm a list.

tonyf Aug 26, 2024
Author

It seems like the rust read-ahead is for batches which doesn't really affect I/O performance unless if there's a lot overhead reading the downloaded fragment file. It seems like most of the bottleneck is in actually downloading the fragments

tonyf Aug 26, 2024
Author

Yeah also, you make a good point re: uneven distribution of workload. Currently, ShardedFragmentSampler which distributes across workers, divides the list of fragments evenly across workers. But, there's no guarantee that each fragment has the same number of rows, especially when trying to match a filter. This can lead to deadlocks in training as a torch worker waits to receive a batch that never arrives

westonpace Aug 27, 2024
Maintainer

Ok...I was thinking through the parameters during my jog today. I think the following algorithm will work well for most non-shuffling scenarios:

Query the database to get the row addresses of all rows matching the filter
Divide the row addresses into batches (keeping order, e.g. if row addresses are [1, 4, 15, 20, 25, 32] and dividing into two batches then use [1, 4, 15] and [20, 25, 32]
Give each batch to a worker
Worker looks at the range and determines both the "total number of rows covered by the range" (this is tricky because the range may span fragments) and the "number of rows selected" (just the length of the list). Use this to determine if late or early materialization should be applied (the formula is something like "use late if ((rows_selected * num_columns) / (total_rows * bytes per row)) < 1 / 1_000_000)". When in doubt, we should probably use early materialization. The situations you listed are all candidates for early materialization.
Early materialization: each worker then performs a range query (from starting row id to ending row id) with a filter
Late materialization: each worker then performs a take with the given row ids

Unfortunately, the API for step 5 is missing. There is no easy way to fetch a range of row addresses (e.g. from row address 7 to row address 53) with a filter.

If shuffling is desired then we can probably divide the row addresses into many batches (e.g. 100 batches per worker), shuffle the batches, send the batches, have the workers fetch each batch, shuffle the fetched batch, then process.

tonyf Aug 27, 2024
Author

Could 5 just be dataset.get_fragment(id) for each of the row's fragments and then a filtered to_batches or to_table? I'm assuming even in the case of a range query, we'd have to download the entire fragment file so performance would be roughly the same if some sort of readahead is implemented.

Shuffling can just be done at a fragment level and at the batch level (i.e. once a batch is collated, just shuffle the rows within the batch).

I was starting to work on a python version of this last night though with all the threading involved its probably best done at a rust level which I'm unfortunately not familiar with

tonyf · 2024-08-26T16:07:58Z

tonyf
Aug 26, 2024
Author

Implemented a row_id based sampler https://gist.github.com/tonyf/d512e26183d97eb4fbae9c0b6abe5072

Decent performance

0 replies

tonyf · 2024-09-01T15:43:28Z

tonyf
Sep 1, 2024
Author

Wrote a new sampler using the scanner api that seems to max out the throughput available on my machine https://gist.github.com/tonyf/3941c6176527c88c4f19446c45b17d9d

Can submit a PR for this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a timeline on support for `filter` in `get_fragments`? #2781

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Is there a timeline on support for filter in get_fragments? #2781

tonyf Aug 24, 2024

Replies: 6 comments · 8 replies

westonpace Aug 24, 2024 Maintainer

westonpace Aug 24, 2024 Maintainer

westonpace Aug 24, 2024 Maintainer

tonyf Aug 25, 2024 Author

westonpace Aug 26, 2024 Maintainer

tonyf Aug 26, 2024 Author

tonyf Aug 25, 2024 Author

westonpace Aug 26, 2024 Maintainer

tonyf Aug 26, 2024 Author

tonyf Aug 26, 2024 Author

westonpace Aug 27, 2024 Maintainer

tonyf Aug 27, 2024 Author

tonyf Aug 26, 2024 Author

tonyf Sep 1, 2024 Author

Is there a timeline on support for `filter` in `get_fragments`? #2781

tonyf
Aug 24, 2024

Replies: 6 comments 8 replies

westonpace
Aug 24, 2024
Maintainer

westonpace
Aug 24, 2024
Maintainer

westonpace Aug 24, 2024
Maintainer

tonyf
Aug 25, 2024
Author

westonpace Aug 26, 2024
Maintainer

tonyf Aug 26, 2024
Author

tonyf
Aug 25, 2024
Author

westonpace Aug 26, 2024
Maintainer

tonyf Aug 26, 2024
Author

tonyf Aug 26, 2024
Author

westonpace Aug 27, 2024
Maintainer

tonyf Aug 27, 2024
Author

tonyf
Aug 26, 2024
Author

tonyf
Sep 1, 2024
Author