Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to generate large Datasets #2363

Closed
caseytomlin opened this issue Oct 11, 2022 · 13 comments
Closed

Best way to generate large Datasets #2363

caseytomlin opened this issue Oct 11, 2022 · 13 comments

Comments

@caseytomlin
Copy link

I have a long pandas DataFrame with a very large number of item_ids (~3 million) - is there a recommended way to process it into a Dataset that can be passed to models?

PandasDataset.from_long_dataframe works, but is rather compute-intensive (~60 minutes + ~70gb RAM). It can be sped up quite a bit with multiprocessing, but with the new arrow functionality (and issues like this one) I feel I am missing a simpler solution that is perhaps just not yet extensively documented.

@jaheba
Copy link
Contributor

jaheba commented Oct 12, 2022

I've done some trying around.

What should work is to sort the long-dataframe by item_id and index.

Then we can do something like this:

idx = np.concatenate([[0], np.where(df["item_id"][:-1].values != df["item_id"][1:].values)[0] + 1, [len(df)]])

for start, stop in zip(idx, idx[1:]):
    slice_ = long.iloc[start: stop]

@caseytomlin Can you share a simplified example of your dataframe?

@caseytomlin
Copy link
Author

Thanks @jaheba - I don't immediately understand how to use your suggestion but I will think about it. Please find a sample below.

I was rather thinking one might parallelize the creation of multiple PandasDatasets, save to arrow/parquet, then load again as a single dataset using the new arrow support, but it seems a bit convoluted and again like I'm missing something.

long_df_sample.zip

@jaheba
Copy link
Contributor

jaheba commented Oct 14, 2022

Thanks @caseytomlin for sharing the sample.

What actually takes a lot of time is checking that the timestamp column/index is correct (delta between each two consecutive timestamp is equal to the frequency). In the sample that you've shared it appears to be the case. Thus, removing these checks would speed up the process by a lot.

Maybe we can add an unchecked flag or something similar, which would allow to skip these checks.

@lostella
Copy link
Contributor

Maybe we can add an unchecked flag or something similar, which would allow to skip these checks.

I think it makes sense. What kind of speedup are we talking about?

@jaheba
Copy link
Contributor

jaheba commented Oct 14, 2022

Running on the provided sample data:

Screenshot 2022-10-14 at 10 40 42

@jaheba
Copy link
Contributor

jaheba commented Oct 14, 2022

I don't think that this would change the problem with the used memory.

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

@lostella
Copy link
Contributor

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

Yes, eventually that's the way to go for these use cases, and what @caseytomlin was hinting at

@jaheba
Copy link
Contributor

jaheba commented Oct 14, 2022

Here is my nightmare scenario:

There is a huge parquet file in long format, which is sorted by date. All values for a given timestamps are grouped on top of each other.

One solution would be to load the entire file into memory, which would simplify things. However, if the file is too big, things get worse. The above sample of 25k time series is around 140MB when loaded into a table. * 120 (for 3MM time series) we are looking at a base memory consumption of > 16GB.

If we want to extract all time-series, we would need to iterate the entire file once per time-series, which sounds very slow.

In this case things should be much simpler, since the data is already ordered.

For the provided sample, this code runs in ~7.5s on my laptop:

for name, group in sample.groupby("item_id"):
    idx = pd.PeriodIndex(group["timestamp"], freq="M")
    is_uniform(idx)

Scaled up to 3MM time series, it would run in roughly 15 minutes.

@caseytomlin
Copy link
Author

Thanks @jaheba and @lostella for the attention so far - it looks like the proposed check_index would help quite a bit already.

I don't think that this would change the problem with the used memory.

One thing we can do is to have an iterative version, which one then serialise using arrow or something alike.

Once the dataset/metadata is created, I suppose it's not critical for it to fit in memory if one can properly overwrite the __iter__ method (similar to pytorch IterableDataset usage) - or am I way off here?

@jaheba
Copy link
Contributor

jaheba commented Oct 18, 2022

In #2377 I'm basically introducing iterable versions for long datasets.

What takes time is generating dictionaries for each slice. Iterating over the groups is fairly fast, but just calling to_dict on them takes ~11s using pandas and ~9s using polars on my machine. Polars can be a lot faster using partition_by over groupby, but that allocates additional memory, since it's returning the partitions as a Python list and it might as well produce copies (that might also generally be true for group by, because it can't select ranges).

What still takes time is doing the timestamp checks. We can speedup creating periods by using lru_cache, but not sure that would also work for period_ranges.

@jaheba
Copy link
Contributor

jaheba commented Oct 18, 2022

I've added use_partition and unchecked, which is now much faster. unchecked will assume that the index is correct and will just take the first value and turn it into a period.

This:

from gluonts.dataset.polars import LongDataset 


ds = LongDataset(
    df
    item_id="item_id",
    timestamp="timestamp",
    freq="M",
    assume_sorted=True,
    use_partition=True,
    unchecked=True,
)

run in under a second on my machine, but does not save on memory.

Adding

    translate={"static_cat": [f"static_cat_{i}" for i in range(8)]},

increases the runtime to ~3.8s.

@lostella
Copy link
Contributor

@caseytomlin #2435 speeds up dataset construction quite significantly, and it's released as part of 0.11.2. Do you think that works for your use case?

There will be other improvements in the future, like #2441, or using polars as proposed by @jaheba in #2377, and the schema story in general.

@caseytomlin
Copy link
Author

@lostella yes! many thanks to you and @jaheba for the quick engagement and effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants