Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Has the data been shuffled? #127

Open
Lisennlp opened this issue Nov 2, 2023 · 2 comments
Open

Has the data been shuffled? #127

Lisennlp opened this issue Nov 2, 2023 · 2 comments

Comments

@Lisennlp
Copy link

Lisennlp commented Nov 2, 2023

Hello, I see your batch_view.py, found that the data does not use a shuffle, but in the gpt-neox library, the data is shuffled.
So I want to make sure that the author did or did not shuffle during the training? Hope to get your answer, thank you!

@pietrolesci
Copy link

I think this might provide an answer #123 (comment)

@itsnamgyu
Copy link

The data is shuffled in terms of documents. The repo-id says preshuffled in https://github.com/EleutherAI/pythia#exploring-the-dataset, i.e., EleutherAI/pile-standard-pythia-preshuffled.

I'm actually not sure about https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps mentioned in https://github.com/EleutherAI/pythia#reproducing-training. I will add a quesiton about this on #123.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants