You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Windows 11, Python 3.9.0, when running prepare.py, it gets an error when tokenizing the splits. The callstack shows the error at line 50, but it actually occurs on line 42, in the process() function. The "enc" defined at line 39 is not seen when process() is called.
An easy (and verified) workaround: copy line 39 into the first line of process().
FYI, here's the full callstack:
(tpx) d:\github\nanoGPT>python data/openwebtext/prepare.py
tokenizing the splits (num_proc=8): 0%| | 0/8009762 [00:07<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1354, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3450, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 43, in process
ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
NameError: name 'enc' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 50, in <module>
tokenized = split_dataset.map(
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 853, in map
{
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 854, in <dictcomp>
k: dataset.map(
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 771, in get
raise self._value
NameError: name 'enc' is not defined
The text was updated successfully, but these errors were encountered:
On Windows 11, Python 3.9.0, when running prepare.py, it gets an error when tokenizing the splits. The callstack shows the error at line 50, but it actually occurs on line 42, in the process() function. The "enc" defined at line 39 is not seen when process() is called.
An easy (and verified) workaround: copy line 39 into the first line of process().
FYI, here's the full callstack:
The text was updated successfully, but these errors were encountered: