-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to hash function when using .map() #5536
Comments
Hi ! import tiktoken
from datasets.fingerprint import Hasher
enc = tiktoken.get_encoding("gpt2")
Hasher.hash(enc)
# raises TypeError: cannot pickle 'builtins.CoreBPE' object It happens because it's not picklable, and because of that it's not possible to cache the result of You can find more details about caching here: https://huggingface.co/docs/datasets/about_cache You can also provide your own unique hash in import datasets
datasets.disable_caching() |
@lhoestq Thank you for the explanation and advice. Will relay all of this to the repo where this (non)issue arose. Great job with huggingface! |
We made tiktoken tokenizers hashable in #5552, which is included in today's release |
Just a heads up that when I'm trying to use TikToken along with the a given Dataset
My current environment is running datasets v2.10.0. |
cc @mariosasko |
With which |
I am able to reproduce this on datasets 2.14.2. The @lhoestq - you might want to reopen this issue. Because of this issue folks won't be able run Karpathy's NanoGPT :(. |
update: temporarily solved the problem by setting
I have met the same problem, here is my env:
|
@mengban I cannot reproduce the issue even with these versions installed. It would help if you could provide info about your system and the |
@mariosasko Please take a look at this from typing import Any
from datasets import Dataset
import tiktoken
dataset = Dataset.from_list([{"n": str(i)} for i in range(20)])
enc = tiktoken.get_encoding("gpt2")
class A:
tokenizer = enc #tiktoken.get_encoding("gpt2")
def __call__(self, example) -> Any:
ids = self.tokenizer.encode(example["n"])
example["len"] = len(ids)
return example
a = A()
def process(example):
ids = a.tokenizer.encode(example["n"])
example["len"] = len(ids)
return example
# success
tokenized = dataset.map(process, desc="tiktoken", num_proc=2)
# raise TypeError: cannot pickle 'builtins.CoreBPE' object
tokenized = dataset.map(a, desc="tiktoken", num_proc=2) pip list
|
Thanks @maxwellzh! Our Inserting the following code before the import copyreg
def pickle_Encoding(enc):
return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())
copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding) But the best fix would be implementing |
I think the right way to fix this would be to have new tokenizer instance for each process. This applies to many other tokenizers that don't support multi-process or have bugs. To do this, first define tokenizer factory class like this:
Now use this in
A full working example is here: https://github.com/sytelus/nanoGPT/blob/refactor/nanogpt_common/hf_data_prepare.py |
Describe the bug
Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
This issue with
.map()
happens for me consistently, as also described in closed issue #4506Dataset indices can be individually serialized using dill and pickle without any errors. I'm using tiktoken to encode in the function passed to map(). Similarly, indices can be individually encoded without error.
Steps to reproduce the bug
Expected behavior
Should encode simple text objects.
Environment info
Python versions tried: both 3.8 and 3.10.10
PYTHONUTF8=1
as env variableDatasets tried:
OS: Ubuntu Linux 20.04
Package versions:
The text was updated successfully, but these errors were encountered: