-
Notifications
You must be signed in to change notification settings - Fork 31.9k
CodeParrot data pretokenization #16932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3ed23cf
d548f92
06ac4e0
ae7f87e
da0e3ae
c0f7a20
2ffeeca
10bd16f
1aad6e2
0199e59
af8258d
1e33725
b22d222
25d6556
ea9f992
e8924df
55fd5cc
1cddf27
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| import multiprocessing | ||
| import time | ||
|
|
||
| from datasets import load_dataset | ||
|
|
||
| from arguments import PretokenizationArguments | ||
| from transformers import AutoTokenizer, HfArgumentParser | ||
|
|
||
|
|
||
| def tokenize(example): | ||
| output = dict() | ||
| output["input_ids"] = tokenizer(example["content"], truncation=False)["input_ids"] | ||
| output["ratio_char_token"] = len(example["content"]) / len(output["input_ids"]) | ||
| return output | ||
|
|
||
|
|
||
| parser = HfArgumentParser(PretokenizationArguments) | ||
| args = parser.parse_args() | ||
| if args.num_workers is None: | ||
| args.num_workers = multiprocessing.cpu_count() | ||
| tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir) | ||
|
|
||
| t_start = time.time() | ||
| ds = load_dataset(args.dataset_name, split="train") | ||
|
||
| print(f"Dataset loaded in {time.time()-t_start:.2f}s") | ||
|
|
||
| t_start = time.time() | ||
| ds = ds.map( | ||
| tokenize, | ||
| num_proc=args.num_workers, | ||
loubnabnl marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| remove_columns=[ | ||
| "repo_name", | ||
| "path", | ||
| "copies", | ||
| "size", | ||
| "content", | ||
| "license", | ||
| "hash", | ||
| "line_mean", | ||
| "line_max", | ||
| "alpha_frac", | ||
| "autogenerated", | ||
| ], | ||
| ) | ||
| print(f"Dataset tokenized in {time.time()-t_start:.2f}s") | ||
|
|
||
| t_start = time.time() | ||
| ds.push_to_hub(args.tokenized_data_repo) | ||
| print(f"Data pushed to the hub in {time.time()-t_start:.2f}s") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a note to the training section how to leverage the pretokenized data?