Best way to generate large `Dataset`s #2363

caseytomlin · 2022-10-11T12:38:56Z

I have a long pandas DataFrame with a very large number of item_ids (~3 million) - is there a recommended way to process it into a Dataset that can be passed to models?

PandasDataset.from_long_dataframe works, but is rather compute-intensive (~60 minutes + ~70gb RAM). It can be sped up quite a bit with multiprocessing, but with the new arrow functionality (and issues like this one) I feel I am missing a simpler solution that is perhaps just not yet extensively documented.

The text was updated successfully, but these errors were encountered:

jaheba · 2022-10-12T11:38:33Z

I've done some trying around.

What should work is to sort the long-dataframe by item_id and index.

Then we can do something like this:

idx = np.concatenate([[0], np.where(df["item_id"][:-1].values != df["item_id"][1:].values)[0] + 1, [len(df)]])

for start, stop in zip(idx, idx[1:]):
    slice_ = long.iloc[start: stop]

@caseytomlin Can you share a simplified example of your dataframe?

caseytomlin · 2022-10-13T06:49:55Z

Thanks @jaheba - I don't immediately understand how to use your suggestion but I will think about it. Please find a sample below.

I was rather thinking one might parallelize the creation of multiple PandasDatasets, save to arrow/parquet, then load again as a single dataset using the new arrow support, but it seems a bit convoluted and again like I'm missing something.

long_df_sample.zip

jaheba · 2022-10-14T07:30:15Z

Thanks @caseytomlin for sharing the sample.

What actually takes a lot of time is checking that the timestamp column/index is correct (delta between each two consecutive timestamp is equal to the frequency). In the sample that you've shared it appears to be the case. Thus, removing these checks would speed up the process by a lot.

Maybe we can add an unchecked flag or something similar, which would allow to skip these checks.

lostella · 2022-10-14T07:42:13Z

Maybe we can add an unchecked flag or something similar, which would allow to skip these checks.

I think it makes sense. What kind of speedup are we talking about?

jaheba · 2022-10-14T08:41:12Z

Running on the provided sample data:

jaheba · 2022-10-14T08:42:38Z

I don't think that this would change the problem with the used memory.

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

lostella · 2022-10-14T10:28:55Z

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

Yes, eventually that's the way to go for these use cases, and what @caseytomlin was hinting at

jaheba · 2022-10-14T13:01:43Z

Here is my nightmare scenario:

There is a huge parquet file in long format, which is sorted by date. All values for a given timestamps are grouped on top of each other.

One solution would be to load the entire file into memory, which would simplify things. However, if the file is too big, things get worse. The above sample of 25k time series is around 140MB when loaded into a table. * 120 (for 3MM time series) we are looking at a base memory consumption of > 16GB.

If we want to extract all time-series, we would need to iterate the entire file once per time-series, which sounds very slow.

In this case things should be much simpler, since the data is already ordered.

For the provided sample, this code runs in ~7.5s on my laptop:

for name, group in sample.groupby("item_id"):
    idx = pd.PeriodIndex(group["timestamp"], freq="M")
    is_uniform(idx)

Scaled up to 3MM time series, it would run in roughly 15 minutes.

caseytomlin · 2022-10-16T13:40:15Z

Thanks @jaheba and @lostella for the attention so far - it looks like the proposed check_index would help quite a bit already.

I don't think that this would change the problem with the used memory.

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

Once the dataset/metadata is created, I suppose it's not critical for it to fit in memory if one can properly overwrite the __iter__ method (similar to pytorch IterableDataset usage) - or am I way off here?

jaheba · 2022-10-18T08:22:00Z

In #2377 I'm basically introducing iterable versions for long datasets.

What takes time is generating dictionaries for each slice. Iterating over the groups is fairly fast, but just calling to_dict on them takes ~11s using pandas and ~9s using polars on my machine. Polars can be a lot faster using partition_by over groupby, but that allocates additional memory, since it's returning the partitions as a Python list and it might as well produce copies (that might also generally be true for group by, because it can't select ranges).

What still takes time is doing the timestamp checks. We can speedup creating periods by using lru_cache, but not sure that would also work for period_ranges.

jaheba · 2022-10-18T11:35:24Z

I've added use_partition and unchecked, which is now much faster. unchecked will assume that the index is correct and will just take the first value and turn it into a period.

This:

from gluonts.dataset.polars import LongDataset 


ds = LongDataset(
    df
    item_id="item_id",
    timestamp="timestamp",
    freq="M",
    assume_sorted=True,
    use_partition=True,
    unchecked=True,
)

run in under a second on my machine, but does not save on memory.

Adding

    translate={"static_cat": [f"static_cat_{i}" for i in range(8)]},

increases the runtime to ~3.8s.

lostella · 2022-11-21T12:16:48Z

@caseytomlin #2435 speeds up dataset construction quite significantly, and it's released as part of 0.11.2. Do you think that works for your use case?

There will be other improvements in the future, like #2441, or using polars as proposed by @jaheba in #2377, and the schema story in general.

caseytomlin · 2022-12-01T08:30:47Z

@lostella yes! many thanks to you and @jaheba for the quick engagement and effort!

jaheba mentioned this issue Oct 17, 2022

Add LongDataset. #2377

Closed

This was referenced Nov 11, 2022

Speed up PandasDataset for long dataframes #2435

Merged

Speed up PandasDataset further #2441

Merged

caseytomlin closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to generate large `Dataset`s #2363

Best way to generate large `Dataset`s #2363

caseytomlin commented Oct 11, 2022

jaheba commented Oct 12, 2022

caseytomlin commented Oct 13, 2022

jaheba commented Oct 14, 2022

lostella commented Oct 14, 2022

jaheba commented Oct 14, 2022

jaheba commented Oct 14, 2022

lostella commented Oct 14, 2022

jaheba commented Oct 14, 2022

caseytomlin commented Oct 16, 2022

jaheba commented Oct 18, 2022

jaheba commented Oct 18, 2022

lostella commented Nov 21, 2022

caseytomlin commented Dec 1, 2022

Best way to generate large Datasets #2363

Best way to generate large Datasets #2363

Comments

caseytomlin commented Oct 11, 2022

jaheba commented Oct 12, 2022

caseytomlin commented Oct 13, 2022

jaheba commented Oct 14, 2022

lostella commented Oct 14, 2022

jaheba commented Oct 14, 2022

jaheba commented Oct 14, 2022

lostella commented Oct 14, 2022

jaheba commented Oct 14, 2022

caseytomlin commented Oct 16, 2022

jaheba commented Oct 18, 2022

jaheba commented Oct 18, 2022

lostella commented Nov 21, 2022

caseytomlin commented Dec 1, 2022

Best way to generate large `Dataset`s #2363

Best way to generate large `Dataset`s #2363