Looking for a way to shuffle sampling from multiple time series #2919

yuvalarbel · 2023-06-13T16:28:19Z

yuvalarbel
Jun 13, 2023

Hi!
I have a dataset with multiple time series of varying lengths and start times.

I am using an InstanceSplitter and InstanceSampler to get sliding windows from these time series.
I am then batching these into batches for my train/val/test sets.
The problem is that batches are not shuffled - i.e all of the first batches have samples only from the first time series. This is problematic for training.
I wanted to find a way to shuffle the sampling from it.

I created a new ShuffledInstanceSplitter to do this (code below).

Is there a better way to do this, built into GluonTS?
There are many very ugly parts of this implementation:
a. I'm saving the time series in a list so that I can access them by index, because the PandasDataset saves them in a starmap iterator
b. I am permuting the access indices, which means I am saving an array in memory of all optional sample indices
c. The flatmap_transform method of the InstanceSplitter class is not called but also is not overridden. I didn't change it at all, I just needed to call the inner nested code block separately.

Anyways, I hope this helps someone:

from typing import Iterator, Iterable
import numpy as np

from gluonts.transform import InstanceSplitter
from gluonts.dataset.common import DataEntry


class ShuffledInstanceSplitter(InstanceSplitter):
    def __call__(
        self, data_it: Iterable[DataEntry], is_train: bool
    ) -> Iterator:

        data_lst = []
        ts_indices_to_sample = []
        ts_ids = []

        for i, data_entry in enumerate(data_it):
            data = data_entry.copy()
            data_lst.append(data)

            target = data[self.target_field]

            sampled_indices = self.instance_sampler(target)
            ts_indices_to_sample.append(sampled_indices)
            ts_ids.append(np.full(len(sampled_indices), i))

        shuffled_sample_idxs = np.random.permutation(
            np.stack([np.concatenate(ts_ids),
                      np.concatenate(ts_indices_to_sample)]).T)

        del ts_indices_to_sample
        del ts_ids

        for time_series_index, ts_sample_idx in shuffled_sample_idxs:
            data = data_lst[time_series_index]
            yield self._flatmap_transform_helper(data, ts_sample_idx)

    def _flatmap_transform_helper(self, data: DataEntry, i):
        pl = self.future_length
        lt = self.lead_time
        slice_cols = self.ts_fields + [self.target_field]

        pad_length = max(self.past_length - i, 0)
        d = data.copy()
        for ts_field in slice_cols:
            if i > self.past_length:
                # truncate to past_length
                past_piece = d[ts_field][..., i - self.past_length: i]
            elif i < self.past_length:
                pad_block = (
                        np.ones(
                            d[ts_field].shape[:-1] + (pad_length,),
                            dtype=d[ts_field].dtype,
                        )
                        * self.dummy_value
                )
                past_piece = np.concatenate(
                    [pad_block, d[ts_field][..., :i]], axis=-1
                )
            else:
                past_piece = d[ts_field][..., :i]
            d[self._past(ts_field)] = past_piece
            d[self._future(ts_field)] = d[ts_field][
                                        ..., i + lt: i + lt + pl
                                        ]
            del d[ts_field]
        pad_indicator = np.zeros(self.past_length, dtype=data[self.target_field].dtype)
        if pad_length > 0:
            pad_indicator[:pad_length] = 1

        if self.output_NTC:
            for ts_field in slice_cols:
                d[self._past(ts_field)] = d[
                    self._past(ts_field)
                ].transpose()
                d[self._future(ts_field)] = d[
                    self._future(ts_field)
                ].transpose()

        d[self._past(self.is_pad_field)] = pad_indicator
        d[self.forecast_start_field] = d[self.start_field] + i + lt
        return d

abdulfatir · 2023-06-13T16:36:33Z

abdulfatir
Jun 13, 2023
Collaborator

Have you tried setting the shuffle buffer length:

gluonts/src/gluonts/torch/model/deepar/estimator.py

Lines 339 to 345 in 1cc17c1

    
           def create_training_data_loader( 
        
               self, 
        
               data: Dataset, 
        
               module: DeepARLightningModule, 
        
               shuffle_buffer_length: Optional[int] = None, 
        
               **kwargs, 
        
           ) -> Iterable:

6 replies

yuvalarbel Jun 13, 2023
Author

Yep, I forgot to mention - all pseudo-shuffling isn't good enough (for example using the PseudoShuffle transform). I have ~6K time series with ~6K sample points each, so even a very big (inefficient) buffer won't give me a good enough shuffling (won't be diverse enough in the batches offered to my training epoch.)

abdulfatir Jun 13, 2023
Collaborator

What buffer sizes did you try?

yuvalarbel Jun 14, 2023
Author

1000, 10000, 100,000. Is this not enough? Bringing the buffer up to a million will save a huge amount of data in-memory, and still not uniformly (or fairly-uniformly) sample data from different time series. But maybe I'm mistaken.

abdulfatir Jun 19, 2023
Collaborator

For the number of time series that you mentioned, 10K/100K should be enough. What problems did you face with 100K?

yuvalarbel Jun 22, 2023
Author

For 10K - that most of the examples in the first batches were only instances from the first few time series.
For 100K - Improved, but still was felt. And this was already quite a big buffer size to hold in RAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for a way to shuffle sampling from multiple time series #2919

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Looking for a way to shuffle sampling from multiple time series #2919

yuvalarbel Jun 13, 2023

Replies: 1 comment · 6 replies

abdulfatir Jun 13, 2023 Collaborator

yuvalarbel Jun 13, 2023 Author

abdulfatir Jun 13, 2023 Collaborator

yuvalarbel Jun 14, 2023 Author

abdulfatir Jun 19, 2023 Collaborator

yuvalarbel Jun 22, 2023 Author

yuvalarbel
Jun 13, 2023

Replies: 1 comment 6 replies

abdulfatir
Jun 13, 2023
Collaborator

yuvalarbel Jun 13, 2023
Author

abdulfatir Jun 13, 2023
Collaborator

yuvalarbel Jun 14, 2023
Author

abdulfatir Jun 19, 2023
Collaborator

yuvalarbel Jun 22, 2023
Author