Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface dataset issue in data/openwebtext/prepare.py #155

Closed
venzen opened this issue Feb 16, 2023 · 1 comment
Closed

huggingface dataset issue in data/openwebtext/prepare.py #155

venzen opened this issue Feb 16, 2023 · 1 comment

Comments

@venzen
Copy link

venzen commented Feb 16, 2023

When running the script data/openwebtext/prepare.py some people might get the following function hashing error reported during the function process(example):

Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

This is an error related to the huggingface dataset.map() function used in the script:

tokenized = split_dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the splits",
)

I've reported the issue here: huggingface/datasets#5536

For a smaller dataset the shakespeare/prepare.py seems adequate (no need to leverage memory efficiency of a Dataset) but large datasets require this. However, if you get the above error when processing with the .map() function, then you will probably have some of the original data missing from your encoded set.

@venzen
Copy link
Author

venzen commented Feb 16, 2023

Dev at huggingface explained that enc is not serializable and, therefore, not hashable, hence the (legit) warning about the transform not having a reusable hash stored to cache. However, the tiktoken encoding is being processed and there is no negative effect on the data being processed.

As the warning message explains, the negative effect is on the dataset cache, because huggingface datasets keeps track of dataset transforms via hashes. Because the encoding transform could not be hashed correctly, subsequent transforms on the dataset will be processed from scratch instead of building on the previous dataset transform hashes in the cache, i.e. you lose datasets efficiency when processing big data.

In the case of data/openwebtext/prepare.py this is not a concern because @karpathy 's script only runs a single transform (encoding with tiktoken) on the cached OpenWebText dataset. The warning does not imply dataset mangling or encoding errors.

If you plan to do additional transforms on the same dataset then be aware that you are losing processing efficiency.

More about dataset caching here: https://huggingface.co/docs/datasets/about_cache

@venzen venzen closed this as completed Feb 23, 2023
gkielian pushed a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Apr 20, 2024
gkielian pushed a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant