-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to generate large Dataset
s
#2363
Comments
I've done some trying around. What should work is to sort the long-dataframe by Then we can do something like this: idx = np.concatenate([[0], np.where(df["item_id"][:-1].values != df["item_id"][1:].values)[0] + 1, [len(df)]])
for start, stop in zip(idx, idx[1:]):
slice_ = long.iloc[start: stop] @caseytomlin Can you share a simplified example of your dataframe? |
Thanks @jaheba - I don't immediately understand how to use your suggestion but I will think about it. Please find a sample below. I was rather thinking one might parallelize the creation of multiple |
Thanks @caseytomlin for sharing the sample. What actually takes a lot of time is checking that the timestamp column/index is correct (delta between each two consecutive timestamp is equal to the frequency). In the sample that you've shared it appears to be the case. Thus, removing these checks would speed up the process by a lot. Maybe we can add an |
I think it makes sense. What kind of speedup are we talking about? |
I don't think that this would change the problem with the used memory. One thing we can do is to have an iterative version, which one then serialise using arrow or something alike. |
Yes, eventually that's the way to go for these use cases, and what @caseytomlin was hinting at |
Here is my nightmare scenario: There is a huge parquet file in long format, which is sorted by date. All values for a given timestamps are grouped on top of each other. One solution would be to load the entire file into memory, which would simplify things. However, if the file is too big, things get worse. The above sample of 25k time series is around 140MB when loaded into a table. If we want to extract all time-series, we would need to iterate the entire file once per time-series, which sounds very slow. In this case things should be much simpler, since the data is already ordered. For the provided sample, this code runs in ~7.5s on my laptop: for name, group in sample.groupby("item_id"):
idx = pd.PeriodIndex(group["timestamp"], freq="M")
is_uniform(idx) Scaled up to 3MM time series, it would run in roughly 15 minutes. |
Thanks @jaheba and @lostella for the attention so far - it looks like the proposed
Once the dataset/metadata is created, I suppose it's not critical for it to fit in memory if one can properly overwrite the |
In #2377 I'm basically introducing iterable versions for long datasets. What takes time is generating dictionaries for each slice. Iterating over the groups is fairly fast, but just calling What still takes time is doing the timestamp checks. We can speedup creating periods by using |
I've added This: from gluonts.dataset.polars import LongDataset
ds = LongDataset(
df
item_id="item_id",
timestamp="timestamp",
freq="M",
assume_sorted=True,
use_partition=True,
unchecked=True,
) run in under a second on my machine, but does not save on memory. Adding translate={"static_cat": [f"static_cat_{i}" for i in range(8)]}, increases the runtime to ~3.8s. |
@caseytomlin #2435 speeds up dataset construction quite significantly, and it's released as part of 0.11.2. Do you think that works for your use case? There will be other improvements in the future, like #2441, or using |
I have a long pandas DataFrame with a very large number of
item_id
s (~3 million) - is there a recommended way to process it into aDataset
that can be passed to models?PandasDataset.from_long_dataframe
works, but is rather compute-intensive (~60 minutes + ~70gb RAM). It can be sped up quite a bit with multiprocessing, but with the new arrow functionality (and issues like this one) I feel I am missing a simpler solution that is perhaps just not yet extensively documented.The text was updated successfully, but these errors were encountered: