Skip to content

StreamingDataset state dict not correct #19122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awaelchli opened this issue Dec 7, 2023 · 0 comments · Fixed by #19123
Closed

StreamingDataset state dict not correct #19122

awaelchli opened this issue Dec 7, 2023 · 0 comments · Fixed by #19123
Labels
bug Something isn't working data (external) litdata package ver: 2.2.x

Comments

@awaelchli
Copy link
Contributor

Bug description

The state dict returned by the streaming dataset always returns index 0. It does not reflect the state of the dataset. In the following code sample, we iterate over the streaming dataset and simulate enough time passing by.

What version are you seeing the problem on?

master

How to reproduce the bug

import torch
import time
from lightning.data import StreamingDataset
from lightning.data.streaming.item_loader import TokensLoader
from torch.utils.data import DataLoader
from lightning.fabric import Fabric


def main():
    fabric = Fabric(accelerator="cuda", devices=4)
    fabric.launch()

    dataset = StreamingDataset(
        input_dir="lit-gpt/data/slimpajama/val",
        item_loader=TokensLoader(block_size=2048),
        shuffle=True,
        drop_last=True,
    )
    dataloader = DataLoader(dataset, batch_size=4, pin_memory=True, num_workers=8, drop_last=True)
    dataloader = fabric.setup_dataloaders(dataloader)
    iterator = iter(dataloader)

    for i in range(60):
        next(iterator)
        time.sleep(2)
        fabric.print(i)

    print(dataset.state_dict())


if __name__ == "__main__":
    main()

Error messages and logs

No erros. The output I get is:

{'0': {'rank': 0, 'current_epoch': 1, 'input_dir_path': '/cache/chunks/5587867a69455f09b8d5564b7b4ff7b5/0', 'input_dir_url': 's3://tiny-llama-template/slimpajama/val', 'item_loader': {'block_size': 2048}, 'drop_last': True, 'seed': 42, 'checkpoint_interval': 60, 'chunk_index': 0, 'global_index': 0, 'index': 0, 'world_size': 4, 'num_workers': 8, 'shuffle': True}}

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): Master (commit 4d154685557881b4ff47083267b5a6328b465d61)
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0): 
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@awaelchli awaelchli added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 7, 2023
@awaelchli awaelchli changed the title StreamingDataset state dict StreamingDataset state dict not correct Dec 7, 2023
@awaelchli awaelchli added data (external) litdata package and removed needs triage Waiting to be triaged by maintainers labels Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data (external) litdata package ver: 2.2.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant