-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
huggingface dataset issue in data/openwebtext/prepare.py #155
Comments
Dev at huggingface explained that As the warning message explains, the negative effect is on the dataset cache, because huggingface In the case of data/openwebtext/prepare.py this is not a concern because @karpathy 's script only runs a single transform (encoding with tiktoken) on the cached OpenWebText dataset. The warning does not imply dataset mangling or encoding errors. If you plan to do additional transforms on the same dataset then be aware that you are losing processing efficiency. More about dataset caching here: https://huggingface.co/docs/datasets/about_cache |
When running the script data/openwebtext/prepare.py some people might get the following function hashing error reported during the function
process(example)
:Parameter 'function'=<function process at 0x7f1ec4388af0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
This is an error related to the huggingface
dataset.map()
function used in the script:I've reported the issue here: huggingface/datasets#5536
For a smaller dataset the shakespeare/prepare.py seems adequate (no need to leverage memory efficiency of a
Dataset
) but large datasets require this. However, if you get the above error when processing with the.map()
function, then you will probably have some of the original data missing from your encoded set.The text was updated successfully, but these errors were encountered: