Failure to hash function when using .map() #5536

venzen · 2023-02-16T03:12:07Z

Describe the bug

Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

This issue with .map() happens for me consistently, as also described in closed issue #4506

Dataset indices can be individually serialized using dill and pickle without any errors. I'm using tiktoken to encode in the function passed to map(). Similarly, indices can be individually encoded without error.

Steps to reproduce the bug

from datasets import load_dataset
import tiktoken

dataset = load_dataset("stas/openwebtext-10k")

enc = tiktoken.get_encoding("gpt2")

tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
)

def process(example):
        ids = enc.encode(example['text'])
        ids.append(enc.eot_token)
        out = {'ids': ids, 'len': len(ids)}
        return out

Expected behavior

Should encode simple text objects.

Environment info

Python versions tried: both 3.8 and 3.10.10
PYTHONUTF8=1 as env variable

Datasets tried:

stas/openwebtext-10k
rotten_tomatoes
local text file

OS: Ubuntu Linux 20.04

Package versions:

torch 1.13.1
dill 0.3.4 (if using 0.3.6 - same issue)
datasets 2.9.0
tiktoken 0.2.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-02-16T11:02:56Z

Hi ! enc is not hashable:

import tiktoken
from datasets.fingerprint import Hasher

enc = tiktoken.get_encoding("gpt2")
Hasher.hash(enc)
# raises TypeError: cannot pickle 'builtins.CoreBPE' object

It happens because it's not picklable, and because of that it's not possible to cache the result of map, hence the warning message.

You can find more details about caching here: https://huggingface.co/docs/datasets/about_cache

You can also provide your own unique hash in map if you want, with the new_fingerprint argument.
Or disable caching using

import datasets
datasets.disable_caching()

venzen · 2023-02-16T14:56:41Z

@lhoestq Thank you for the explanation and advice. Will relay all of this to the repo where this (non)issue arose.

Great job with huggingface!

lhoestq · 2023-02-22T13:11:14Z

We made tiktoken tokenizers hashable in #5552, which is included in today's release datasets==2.10.0

edhenry · 2023-02-28T00:11:39Z

Just a heads up that when I'm trying to use TikToken along with the a given Dataset .map() method, I am still met with the following error :

  File "/opt/conda/lib/python3.8/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/opt/conda/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'builtins.CoreBPE' object

My current environment is running datasets v2.10.0.

lhoestq · 2023-02-28T10:21:28Z

cc @mariosasko

moinnadeem · 2023-05-18T18:57:42Z

@lhoestq @edhenry I am also seeing this, do you have any suggested solution?

lhoestq · 2023-05-19T12:59:43Z

With which datasets version ? Can you try to udpate ?

moinnadeem · 2023-05-22T20:02:16Z

@lhoestq @edhenry I am on datasets version '2.12.0'. I see the same TypeError: cannot pickle 'builtins.CoreBPE' object` that others are seeing.

sytelus · 2023-08-03T04:01:46Z

I am able to reproduce this on datasets 2.14.2. The datasets.disable_caching() doesn't work around it.

@lhoestq - you might want to reopen this issue. Because of this issue folks won't be able run Karpathy's NanoGPT :(.

mengban · 2023-08-10T08:19:12Z

update： temporarily solved the problem by setting

--preprocess_num_workers  1

I have met the same problem, here is my env:

datasets                      2.14.4
transformers                  4.31.0
tiktoken                      0.4.0
torch                         1.13.1

mariosasko · 2023-08-16T16:43:41Z

@mengban I cannot reproduce the issue even with these versions installed. It would help if you could provide info about your system and the pip list output.

maxwellzh · 2023-08-17T08:01:04Z

@mariosasko Please take a look at this

from typing import Any
from datasets import Dataset
import tiktoken

dataset = Dataset.from_list([{"n": str(i)} for i in range(20)])
enc = tiktoken.get_encoding("gpt2")


class A:
    tokenizer = enc #tiktoken.get_encoding("gpt2")

    def __call__(self, example) -> Any:
        ids = self.tokenizer.encode(example["n"])
        example["len"] = len(ids)
        return example

a = A()

def process(example):
    ids = a.tokenizer.encode(example["n"])
    example["len"] = len(ids)
    return example

# success
tokenized = dataset.map(process, desc="tiktoken", num_proc=2)

# raise TypeError: cannot pickle 'builtins.CoreBPE' object
tokenized = dataset.map(a, desc="tiktoken", num_proc=2)

pip list

datasets                      2.14.4
tiktoken                      0.4.0

mariosasko · 2023-08-17T13:41:31Z

Thanks @maxwellzh! Our Hasher works with this snippet, but the problem is running multiprocessing with a non-serializable tiktoken.Encoding object.

Inserting the following code before the map should fix this:

import copyreg

def pickle_Encoding(enc):
  return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())

copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding)

But the best fix would be implementing __reduce__ for tiktoken.Encoding or tiktoken.CoreBPE. If I find time, I'll try to fix this in the tiktoken repo.

sytelus · 2023-09-08T21:06:00Z

I think the right way to fix this would be to have new tokenizer instance for each process. This applies to many other tokenizers that don't support multi-process or have bugs. To do this, first define tokenizer factory class like this:

    class TikTokenFactory:
        def __init__(self):
            self._enc = None
            self.eot_token = None

        def encode_ordinary(self, text):
            if self._enc is None:
                self._enc = tiktoken.get_encoding("gpt2")
                self.eot_token = self._enc.eot_token
            return self._enc.encode_ordinary(text)

Now use this in .map() like this:

    # tokenize the dataset
    tokenized = dataset.map(
        partial(process, TikTokenFactory()),
        remove_columns=['text'],
        desc="tokenizing the splits",
        num_proc=max(1, cpu_count()//2),
    )

A full working example is here: https://github.com/sytelus/nanoGPT/blob/refactor/nanogpt_common/hf_data_prepare.py

venzen mentioned this issue Feb 16, 2023

huggingface dataset issue in data/openwebtext/prepare.py karpathy/nanoGPT#155

Closed

venzen changed the title ~~Failure to hash when using the .map() function~~ Failure to hash function when using .map() Feb 16, 2023

venzen closed this as completed Feb 16, 2023

markovalexander mentioned this issue Apr 18, 2023

Tiktoken tokenizers are not pickable #5769

Closed

hiyouga mentioned this issue Aug 3, 2023

preprocess_dataset dataset.map crashed with TypeError: cannot pickle 'builtins.CoreBPE' object hiyouga/LLaMA-Factory#328

Closed

SlapDrone mentioned this issue Aug 6, 2023

[Bug]: ChatMemoryBuffer unpicklable since 0.7.18 run-llama/llama_index#7169

Closed

onebula mentioned this issue Aug 16, 2023

微调Qwen-7B-Chat模型的数据处理阶段出现 “cannot pickle 'builtins.CoreBPE' object” 错误 hiyouga/LLaMA-Factory#535

Closed

mariosasko mentioned this issue Aug 27, 2023

Make Encoding serializable openai/tiktoken#181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to hash function when using .map() #5536

Failure to hash function when using .map() #5536

venzen commented Feb 16, 2023 •

edited

Loading

lhoestq commented Feb 16, 2023

venzen commented Feb 16, 2023

lhoestq commented Feb 22, 2023

edhenry commented Feb 28, 2023

lhoestq commented Feb 28, 2023

moinnadeem commented May 18, 2023

lhoestq commented May 19, 2023

moinnadeem commented May 22, 2023

sytelus commented Aug 3, 2023 •

edited

Loading

mengban commented Aug 10, 2023 •

edited

Loading

mariosasko commented Aug 16, 2023

maxwellzh commented Aug 17, 2023

mariosasko commented Aug 17, 2023

sytelus commented Sep 8, 2023

Failure to hash function when using .map() #5536

Failure to hash function when using .map() #5536

Comments

venzen commented Feb 16, 2023 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Feb 16, 2023

venzen commented Feb 16, 2023

lhoestq commented Feb 22, 2023

edhenry commented Feb 28, 2023

lhoestq commented Feb 28, 2023

moinnadeem commented May 18, 2023

lhoestq commented May 19, 2023

moinnadeem commented May 22, 2023

sytelus commented Aug 3, 2023 • edited Loading

mengban commented Aug 10, 2023 • edited Loading

mariosasko commented Aug 16, 2023

maxwellzh commented Aug 17, 2023

mariosasko commented Aug 17, 2023

sytelus commented Sep 8, 2023

venzen commented Feb 16, 2023 •

edited

Loading

sytelus commented Aug 3, 2023 •

edited

Loading

mengban commented Aug 10, 2023 •

edited

Loading