You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since tiktoken tokenizer is not pickable, it is not possible to use it inside dataset.map() with multiprocessing enabled. However, you made tiktoken's tokenizers pickable in datasets==2.10.0 for caching. For some reason, this logic does not work in dataset processing and raises TypeError: cannot pickle 'builtins.CoreBPE' object
Steps to reproduce the bug
from datasets import load_dataset
import tiktoken
dataset = load_dataset("stas/openwebtext-10k")
enc = tiktoken.get_encoding("gpt2")
tokenized = dataset.map(
process,
remove_columns=['text'],
desc="tokenizing the OWT splits",
num_proc=2,
)
def process(example):
ids = enc.encode(example['text'])
ids.append(enc.eot_token)
out = {'ids': ids, 'len': len(ids)}
return out
Describe the bug
Since tiktoken tokenizer is not pickable, it is not possible to use it inside
dataset.map()
with multiprocessing enabled. However, you made tiktoken's tokenizers pickable indatasets==2.10.0
for caching. For some reason, this logic does not work in dataset processing and raisesTypeError: cannot pickle 'builtins.CoreBPE' object
Steps to reproduce the bug
Expected behavior
starts processing dataset
Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: