Fix train data script #156

natolambert · 2024-04-29T19:02:58Z

Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7

hamishivi · 2024-05-24T21:51:44Z

scripts/prepare_train_data.sh


 echo "Downloading WizardLM dataset..."
-wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
+# original data removed wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
+wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped/resolve/main/data/train-00000-of-00001-004cd1ba9dc05e6c.parquet


Have you verified this is the same data?

hamishivi · 2024-05-24T21:52:12Z

LGTM so long as you've verified the data is the same as our released mixture.

natolambert · 2024-05-28T18:41:46Z

Code for comparing diffs:

from datasets import load_dataset

def load_and_compare_datasets(dataset_name1, dataset_name2, split='train'):
    # Load datasets
    dataset1 = load_dataset(dataset_name1, split=split)
    dataset2 = load_dataset(dataset_name2, split=split)

    # Create dictionaries indexed by 'id'
    dict1 = {row['id']: row for row in dataset1}
    dict2 = {row['id']: row for row in dataset2}

    # Find unique ids in each dataset
    unique_ids_to_set1 = set(dict1.keys()) - set(dict2.keys())
    unique_ids_to_set2 = set(dict2.keys()) - set(dict1.keys())

    # Print unique entries
    print("Entries unique to dataset 1:")
    for id in unique_ids_to_set1:
        print(dict1[id])

    print("\nEntries unique to dataset 2:")
    for id in unique_ids_to_set2:
        print(dict2[id])

# Example usage with dataset names and optional split specification
load_and_compare_datasets('ai2-adapt-dev/tulu2-tmp', 'allenai/tulu-v2-sft-mixture', split='train')

natolambert · 2024-05-28T18:58:38Z

There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:

Tulu 2 as of now: https://huggingface.co/datasets/ai2-adapt-dev/tulu2-tmp
WizardLM from original tulu 2 mix: https://huggingface.co/datasets/ai2-adapt-dev/wizardlm-backup

hamishivi · 2024-05-28T23:51:53Z

So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script?

natolambert · 2024-05-29T00:10:42Z

@hamishivi, yes.
Backup was taken from your nfs raw train files. I can get the date it was last modified, but I think there are multiple things that have potentially changed since we uploaded tulu v2 sft mix

cbfcbf · 2024-06-30T11:56:35Z

I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0"

hamishivi · 2024-07-06T18:30:02Z

Okay, I think this is okay to merge. I ran the script through without errors. I'll add a note saying a samples have shifted, pointing to this PR. Since we have the mixture we actually trained on uploaded already, I think this is okay.
Also fixed up the bug, thanks!

natolambert · 2024-07-06T18:32:23Z

Needs the special approval button :)

fix wizardlm

4b6bcb2

hamishivi reviewed May 24, 2024

View reviewed changes

create our own file

0f0c83c

natolambert mentioned this pull request Jun 7, 2024

./scripts/prepare_train_data.sh--throws an error #171

Closed

hamishivi added 3 commits July 6, 2024 11:26

bugfix

46759e2

minor bugfix

23a7c67

a note on reproducibility

1f9f0aa

hamishivi approved these changes Jul 6, 2024

View reviewed changes

hamishivi merged commit f1df6b0 into main Jul 6, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix train data script #156

Fix train data script #156

natolambert commented Apr 29, 2024 •

edited

Loading

hamishivi May 24, 2024

hamishivi commented May 24, 2024

natolambert commented May 28, 2024

natolambert commented May 28, 2024

hamishivi commented May 28, 2024

natolambert commented May 29, 2024

cbfcbf commented Jun 30, 2024

hamishivi commented Jul 6, 2024

natolambert commented Jul 6, 2024

Fix train data script #156

Fix train data script #156

Conversation

natolambert commented Apr 29, 2024 • edited Loading

hamishivi May 24, 2024

Choose a reason for hiding this comment

hamishivi commented May 24, 2024

natolambert commented May 28, 2024

natolambert commented May 28, 2024

hamishivi commented May 28, 2024

natolambert commented May 29, 2024

cbfcbf commented Jun 30, 2024

hamishivi commented Jul 6, 2024

natolambert commented Jul 6, 2024

natolambert commented Apr 29, 2024 •

edited

Loading