Tiktoken tokenizers are not pickable #5769

markovalexander · 2023-04-18T16:07:40Z

Describe the bug

Since tiktoken tokenizer is not pickable, it is not possible to use it inside dataset.map() with multiprocessing enabled. However, you made tiktoken's tokenizers pickable in datasets==2.10.0 for caching. For some reason, this logic does not work in dataset processing and raises TypeError: cannot pickle 'builtins.CoreBPE' object

Steps to reproduce the bug

from datasets import load_dataset
import tiktoken

dataset = load_dataset("stas/openwebtext-10k")

enc = tiktoken.get_encoding("gpt2")

tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
    num_proc=2,
)

def process(example):
        ids = enc.encode(example['text'])
        ids.append(enc.eot_token)
        out = {'ids': ids, 'len': len(ids)}
        return out

Expected behavior

starts processing dataset

Environment info

datasets version: 2.11.0
Platform: Linux-5.15.0-1021-oracle-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.13.4
PyArrow version: 9.0.0
Pandas version: 2.0.0

The text was updated successfully, but these errors were encountered:

albertvillanova · 2023-04-20T06:03:03Z

Thanks for reporting, @markovalexander.

Unfortunately, I'm not able to reproduce the issue: the tiktoken tokenizer can be used within Dataset.map, both in my local machine and in a Colab notebook: https://colab.research.google.com/drive/1DhJroZgk0sNFJ2Mrz-jYgrmh9jblXaCG?usp=sharing

Are you sure you are using datasets version 2.11.0?

mariosasko closed this as completed May 4, 2023

hiyouga mentioned this issue Aug 3, 2023

preprocess_dataset dataset.map crashed with TypeError: cannot pickle 'builtins.CoreBPE' object hiyouga/LLaMA-Factory#328

Closed

jklj077 mentioned this issue Aug 4, 2023

tiktoken不支持多线程tokenize? QwenLM/Qwen#36

Closed

This was referenced Nov 28, 2023

[text] refine tokenizer wenet-e2e/wenet#2165

Merged

[text] fix whisper tokens and others wenet-e2e/wenet#2179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiktoken tokenizers are not pickable #5769

Tiktoken tokenizers are not pickable #5769

markovalexander commented Apr 18, 2023

albertvillanova commented Apr 20, 2023

Tiktoken tokenizers are not pickable #5769

Tiktoken tokenizers are not pickable #5769

Comments

markovalexander commented Apr 18, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

albertvillanova commented Apr 20, 2023