huggingface dataset issue in data/openwebtext/prepare.py #155

venzen · 2023-02-16T03:41:14Z

When running the script data/openwebtext/prepare.py some people might get the following function hashing error reported during the function process(example):

Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

This is an error related to the huggingface dataset.map() function used in the script:

tokenized = split_dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the splits",
)

I've reported the issue here: huggingface/datasets#5536

For a smaller dataset the shakespeare/prepare.py seems adequate (no need to leverage memory efficiency of a Dataset) but large datasets require this. However, if you get the above error when processing with the .map() function, then you will probably have some of the original data missing from your encoded set.

The text was updated successfully, but these errors were encountered:

venzen · 2023-02-16T15:20:28Z

Dev at huggingface explained that enc is not serializable and, therefore, not hashable, hence the (legit) warning about the transform not having a reusable hash stored to cache. However, the tiktoken encoding is being processed and there is no negative effect on the data being processed.

As the warning message explains, the negative effect is on the dataset cache, because huggingface datasets keeps track of dataset transforms via hashes. Because the encoding transform could not be hashed correctly, subsequent transforms on the dataset will be processed from scratch instead of building on the previous dataset transform hashes in the cache, i.e. you lose datasets efficiency when processing big data.

In the case of data/openwebtext/prepare.py this is not a concern because @karpathy 's script only runs a single transform (encoding with tiktoken) on the cached OpenWebText dataset. The warning does not imply dataset mangling or encoding errors.

If you plan to do additional transforms on the same dataset then be aware that you are losing processing efficiency.

More about dataset caching here: https://huggingface.co/docs/datasets/about_cache

…ictionary

venzen closed this as completed Feb 23, 2023

gkielian pushed a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Apr 20, 2024

Merge pull request karpathy#155 from gkielian/add_softmax_variation_d…

d2e1d7a

…ictionary

gkielian pushed a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Sep 5, 2024

Merge pull request karpathy#155 from gkielian/add_softmax_variation_d…

3e1bd3c

…ictionary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huggingface dataset issue in data/openwebtext/prepare.py #155

huggingface dataset issue in data/openwebtext/prepare.py #155

venzen commented Feb 16, 2023 •

edited

Loading

venzen commented Feb 16, 2023 •

edited

Loading

huggingface dataset issue in data/openwebtext/prepare.py #155

huggingface dataset issue in data/openwebtext/prepare.py #155

Comments

venzen commented Feb 16, 2023 • edited Loading

venzen commented Feb 16, 2023 • edited Loading

venzen commented Feb 16, 2023 •

edited

Loading

venzen commented Feb 16, 2023 •

edited

Loading