Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running data/openwebtext/prepare.py gives "enc is not defined" error #371

Closed
rfernand2 opened this issue Sep 6, 2023 · 2 comments
Closed

Comments

@rfernand2
Copy link

On Windows 11, Python 3.9.0, when running prepare.py, it gets an error when tokenizing the splits. The callstack shows the error at line 50, but it actually occurs on line 42, in the process() function. The "enc" defined at line 39 is not seen when process() is called.

An easy (and verified) workaround: copy line 39 into the first line of process().

FYI, here's the full callstack:

(tpx) d:\github\nanoGPT>python data/openwebtext/prepare.py
tokenizing the splits (num_proc=8):   0%|                                                                                  | 0/8009762 [00:07<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1354, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3450, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 43, in process
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
NameError: name 'enc' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "d:\github\nanoGPT\data\openwebtext\prepare.py", line 50, in <module>
    tokenized = split_dataset.map(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 853, in map
    {
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\dataset_dict.py", line 854, in <dictcomp>
    k: dataset.map(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\datasets\utils\py_utils.py", line 1394, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "d:\Users\rfernand\AppData\Local\anaconda3\envs\tpx\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
NameError: name 'enc' is not defined
@jdietzChina
Copy link

Did you ever sort this one out?

@calmitchell617
Copy link

The PR from @vinjn above ^^ worked for me.

muxitox pushed a commit to muxitox/nanoGPT that referenced this issue Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants