Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix train data script #156

Merged
merged 5 commits into from
Jul 6, 2024
Merged

Fix train data script #156

merged 5 commits into from
Jul 6, 2024

Conversation

natolambert
Copy link
Collaborator

@natolambert natolambert commented Apr 29, 2024

Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7


echo "Downloading WizardLM dataset..."
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
# original data removed wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped/resolve/main/data/train-00000-of-00001-004cd1ba9dc05e6c.parquet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified this is the same data?

@hamishivi
Copy link
Collaborator

LGTM so long as you've verified the data is the same as our released mixture.

@natolambert
Copy link
Collaborator Author

Code for comparing diffs:

from datasets import load_dataset

def load_and_compare_datasets(dataset_name1, dataset_name2, split='train'):
    # Load datasets
    dataset1 = load_dataset(dataset_name1, split=split)
    dataset2 = load_dataset(dataset_name2, split=split)

    # Create dictionaries indexed by 'id'
    dict1 = {row['id']: row for row in dataset1}
    dict2 = {row['id']: row for row in dataset2}

    # Find unique ids in each dataset
    unique_ids_to_set1 = set(dict1.keys()) - set(dict2.keys())
    unique_ids_to_set2 = set(dict2.keys()) - set(dict1.keys())

    # Print unique entries
    print("Entries unique to dataset 1:")
    for id in unique_ids_to_set1:
        print(dict1[id])

    print("\nEntries unique to dataset 2:")
    for id in unique_ids_to_set2:
        print(dict2[id])

# Example usage with dataset names and optional split specification
load_and_compare_datasets('ai2-adapt-dev/tulu2-tmp', 'allenai/tulu-v2-sft-mixture', split='train')

@natolambert
Copy link
Collaborator Author

There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:

@hamishivi
Copy link
Collaborator

So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script?

@natolambert
Copy link
Collaborator Author

@hamishivi, yes.
Backup was taken from your nfs raw train files. I can get the date it was last modified, but I think there are multiple things that have potentially changed since we uploaded tulu v2 sft mix

@cbfcbf
Copy link

cbfcbf commented Jun 30, 2024

I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0"

@hamishivi
Copy link
Collaborator

Okay, I think this is okay to merge. I ran the script through without errors. I'll add a note saying a samples have shifted, pointing to this PR. Since we have the mixture we actually trained on uploaded already, I think this is okay.
Also fixed up the bug, thanks!

@natolambert
Copy link
Collaborator Author

Needs the special approval button :)

@hamishivi hamishivi merged commit f1df6b0 into main Jul 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WizardLM Data Gone (prep data script error)
3 participants