Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate error occuring for ray #354

Closed
kosmitive opened this issue May 16, 2023 · 2 comments
Closed

Investigate error occuring for ray #354

kosmitive opened this issue May 16, 2023 · 2 comments
Labels
bug Something isn't working invalid Issue doesn't seem right or is no longer valid

Comments

@kosmitive
Copy link
Contributor

kosmitive commented May 16, 2023

On a DGX the following error occurs:

  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/tqdm/std.py", line 1492, in display                                                                             
    self.sp(self.__str__() if msg is None else msg)                                                                                                                                           
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/tqdm/std.py", line 347, in print_status                                                                             fp_write('\r' + s + (' ' * max(last_len[0] - len_s, 0)))                                                                                                                                    File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/tqdm/std.py", line 340, in fp_write                                                                                 fp.write(str(s))                                                                                                                                                                            File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/tqdm/utils.py", line 127, in inner                                                                                  return func(*args, **kwargs)                                                                                                                                                              OSError: [Errno 28] No space left on device                                                                                                                                                    69%|██████▉   | 68.8/100 [03:59<01:48,  3.48s/%]                                                                                                                                             
Exception ignored in atexit callback: <function shutdown at 0x7fc2278cc5e0>                                                                                                                   
Traceback (most recent call last):                                                                                                                                                            
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper                                                         
    return func(*args, **kwargs)                                                                                                                                                              
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/ray/_private/worker.py", line 1640, in shutdown                                                                 
    _global_node.destroy_external_storage()                                                                                                                                                   
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/ray/_private/node.py", line 1490, in destroy_external_storage                                                   
    storage = external_storage.setup_external_storage(                                                                                                                                        
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/ray/_private/external_storage.py", line 628, in setup_external_storage                                          
    _external_storage = FileSystemStorage(**config["params"])                                                                                                                                 
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/site-packages/ray/_private/external_storage.py", line 280, in __init__                                                        
    os.makedirs(full_dir_path, exist_ok=True)                                                                                                                                                 
  File "/home/markus/miniconda3/envs/csshapley22/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)                          
OSError: [Errno 28] No space left on device: '/tmp/ray/session_2023-05-15_09-51-45_689904_3852705/ray_spilled_objects'

We need to further investigate this issue as it affects the user and is our sole way of parallelising computations at the moment.

@kosmitive kosmitive added the bug Something isn't working label May 16, 2023
@mdbenito
Copy link
Collaborator

This might be related to #292 but it's unclear: The problem with the DGX was really no space left because the NFS mounts were not being used, but why was the storage used up? Was it related to pydvl / ray?

@mdbenito mdbenito changed the title Investigate error occuring for ray. Investigate error occuring for ray May 19, 2023
@mdbenito mdbenito added the invalid Issue doesn't seem right or is no longer valid label May 19, 2023
@mdbenito
Copy link
Collaborator

This is impossible to debug without more information. If it's a problem with ray spawning too many processes, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid Issue doesn't seem right or is no longer valid
Projects
None yet
Development

No branches or pull requests

2 participants