-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix train data script #156
Conversation
scripts/prepare_train_data.sh
Outdated
|
||
echo "Downloading WizardLM dataset..." | ||
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json | ||
# original data removed wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json | ||
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped/resolve/main/data/train-00000-of-00001-004cd1ba9dc05e6c.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified this is the same data?
LGTM so long as you've verified the data is the same as our released mixture. |
Code for comparing diffs:
|
There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:
|
So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script? |
@hamishivi, yes. |
I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0" |
Okay, I think this is okay to merge. I ran the script through without errors. I'll add a note saying a samples have shifted, pointing to this PR. Since we have the mixture we actually trained on uploaded already, I think this is okay. |
Needs the special approval button :) |
Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7