Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to hash function when using .map() #5536

Closed
venzen opened this issue Feb 16, 2023 · 14 comments
Closed

Failure to hash function when using .map() #5536

venzen opened this issue Feb 16, 2023 · 14 comments

Comments

@venzen
Copy link

venzen commented Feb 16, 2023

Describe the bug

Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

This issue with .map() happens for me consistently, as also described in closed issue #4506

Dataset indices can be individually serialized using dill and pickle without any errors. I'm using tiktoken to encode in the function passed to map(). Similarly, indices can be individually encoded without error.

Steps to reproduce the bug

from datasets import load_dataset
import tiktoken

dataset = load_dataset("stas/openwebtext-10k")

enc = tiktoken.get_encoding("gpt2")

tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the OWT splits",
)

def process(example):
        ids = enc.encode(example['text'])
        ids.append(enc.eot_token)
        out = {'ids': ids, 'len': len(ids)}
        return out

Expected behavior

Should encode simple text objects.

Environment info

Python versions tried: both 3.8 and 3.10.10
PYTHONUTF8=1 as env variable

Datasets tried:

  • stas/openwebtext-10k
  • rotten_tomatoes
  • local text file

OS: Ubuntu Linux 20.04

Package versions:

  • torch 1.13.1
  • dill 0.3.4 (if using 0.3.6 - same issue)
  • datasets 2.9.0
  • tiktoken 0.2.0
@lhoestq
Copy link
Member

lhoestq commented Feb 16, 2023

Hi ! enc is not hashable:

import tiktoken
from datasets.fingerprint import Hasher

enc = tiktoken.get_encoding("gpt2")
Hasher.hash(enc)
# raises TypeError: cannot pickle 'builtins.CoreBPE' object

It happens because it's not picklable, and because of that it's not possible to cache the result of map, hence the warning message.

You can find more details about caching here: https://huggingface.co/docs/datasets/about_cache

You can also provide your own unique hash in map if you want, with the new_fingerprint argument.
Or disable caching using

import datasets
datasets.disable_caching()

@venzen venzen changed the title Failure to hash when using the .map() function Failure to hash function when using .map() Feb 16, 2023
@venzen
Copy link
Author

venzen commented Feb 16, 2023

@lhoestq Thank you for the explanation and advice. Will relay all of this to the repo where this (non)issue arose.

Great job with huggingface!

@venzen venzen closed this as completed Feb 16, 2023
@lhoestq
Copy link
Member

lhoestq commented Feb 22, 2023

We made tiktoken tokenizers hashable in #5552, which is included in today's release datasets==2.10.0

@edhenry
Copy link

edhenry commented Feb 28, 2023

Just a heads up that when I'm trying to use TikToken along with the a given Dataset .map() method, I am still met with the following error :

  File "/opt/conda/lib/python3.8/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/opt/conda/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'builtins.CoreBPE' object

My current environment is running datasets v2.10.0.

@lhoestq
Copy link
Member

lhoestq commented Feb 28, 2023

cc @mariosasko

@moinnadeem
Copy link

@lhoestq @edhenry I am also seeing this, do you have any suggested solution?

@lhoestq
Copy link
Member

lhoestq commented May 19, 2023

With which datasets version ? Can you try to udpate ?

@moinnadeem
Copy link

@lhoestq @edhenry I am on datasets version '2.12.0'. I see the same TypeError: cannot pickle 'builtins.CoreBPE' object` that others are seeing.

@sytelus
Copy link

sytelus commented Aug 3, 2023

I am able to reproduce this on datasets 2.14.2. The datasets.disable_caching() doesn't work around it.

@lhoestq - you might want to reopen this issue. Because of this issue folks won't be able run Karpathy's NanoGPT :(.

@mengban
Copy link

mengban commented Aug 10, 2023

update: temporarily solved the problem by setting

--preprocess_num_workers  1

I have met the same problem, here is my env:

datasets                      2.14.4
transformers                  4.31.0
tiktoken                      0.4.0
torch                         1.13.1

@mariosasko
Copy link
Collaborator

@mengban I cannot reproduce the issue even with these versions installed. It would help if you could provide info about your system and the pip list output.

@maxwellzh
Copy link

@mariosasko Please take a look at this

from typing import Any
from datasets import Dataset
import tiktoken

dataset = Dataset.from_list([{"n": str(i)} for i in range(20)])
enc = tiktoken.get_encoding("gpt2")


class A:
    tokenizer = enc #tiktoken.get_encoding("gpt2")

    def __call__(self, example) -> Any:
        ids = self.tokenizer.encode(example["n"])
        example["len"] = len(ids)
        return example

a = A()

def process(example):
    ids = a.tokenizer.encode(example["n"])
    example["len"] = len(ids)
    return example

# success
tokenized = dataset.map(process, desc="tiktoken", num_proc=2)

# raise TypeError: cannot pickle 'builtins.CoreBPE' object
tokenized = dataset.map(a, desc="tiktoken", num_proc=2)

pip list

datasets                      2.14.4
tiktoken                      0.4.0

@mariosasko
Copy link
Collaborator

Thanks @maxwellzh! Our Hasher works with this snippet, but the problem is running multiprocessing with a non-serializable tiktoken.Encoding object.

Inserting the following code before the map should fix this:

import copyreg

def pickle_Encoding(enc):
  return (functools.partial(tiktoken.core.Encoding, enc.name, pat_str=enc._pat_str, mergeable_ranks=enc._mergeable_ranks, special_tokens=enc._special_tokens), ())

copyreg.pickle(tiktoken.core.Encoding, pickle_Encoding)

But the best fix would be implementing __reduce__ for tiktoken.Encoding or tiktoken.CoreBPE. If I find time, I'll try to fix this in the tiktoken repo.

@sytelus
Copy link

sytelus commented Sep 8, 2023

I think the right way to fix this would be to have new tokenizer instance for each process. This applies to many other tokenizers that don't support multi-process or have bugs. To do this, first define tokenizer factory class like this:

    class TikTokenFactory:
        def __init__(self):
            self._enc = None
            self.eot_token = None

        def encode_ordinary(self, text):
            if self._enc is None:
                self._enc = tiktoken.get_encoding("gpt2")
                self.eot_token = self._enc.eot_token
            return self._enc.encode_ordinary(text)

Now use this in .map() like this:

    # tokenize the dataset
    tokenized = dataset.map(
        partial(process, TikTokenFactory()),
        remove_columns=['text'],
        desc="tokenizing the splits",
        num_proc=max(1, cpu_count()//2),
    )

A full working example is here: https://github.com/sytelus/nanoGPT/blob/refactor/nanogpt_common/hf_data_prepare.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants