Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run with DeepSpeed on Windows (partially implemented) #1225

Closed
janvarev opened this issue Apr 15, 2023 · 8 comments
Closed

Run with DeepSpeed on Windows (partially implemented) #1225

janvarev opened this issue Apr 15, 2023 · 8 comments
Labels
enhancement New feature or request stale

Comments

@janvarev
Copy link

Description

It'll be pretty good to have implemented Deepspeed running on Windows.

Additional Context

I've tried to solve some problems, so I'll provide the details:

  1. Deepspeed install on Windows. It's not easy (i can't compile it manually), but I've finally found ready Wheels for Windows here: [REQUEST] Hey, Microsoft...Could you PLEASE Support Your Own OS? microsoft/DeepSpeed#2427 (comment) . So, I've installed deepspedd succesfully

  2. You can't run "deepspeed" on windows (like docs say) because there is no EXE file on windows. I've prepared file https://gist.github.com/janvarev/8b6563b5da269533f9ec4e92e0327451 in main folder, and can start deepspeed that way: call python deepspeedrun.py --num_gpus=1 server.py --auto-devices

  3. I finally gain error RuntimeError: Distributed package doesn't have NCCL built in and can't solve it. (Setting backend to gloo os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" will have no effect on me)

If someone can solve last problem, we can run deepspeed on Win, and it'll be very cool!

@janvarev janvarev added the enhancement New feature or request label Apr 15, 2023
@janvarev janvarev changed the title Run with DeepSpeed on Windows Run with DeepSpeed on Windows (partially implemented) Apr 15, 2023
@jllllll
Copy link
Contributor

jllllll commented Apr 15, 2023

Got deepspeed to run. However, it looks like there isn't much, if any, support for gloo. Either that or it needs to be more directly implemented in webui.

G:\F\Projects\AI\text-generation-webui\text-generation-webui>python %INSTALL_ENV_DIR%\scripts\ds --num_gpus=1 server.py --deepspeed --chat --model alpaca-native
[2023-04-15 14:16:41,031] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-15 14:16:41,042] [INFO] [runner.py:527:main] cmd = G:\F\Projects\AI\text-generation-webui\installer_files\env\python.exe -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None server.py --deepspeed --chat --model alpaca-native
[2023-04-15 14:16:43,280] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-15 14:16:43,280] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-15 14:16:43,281] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-15 14:16:43,281] [INFO] [launch.py:151:main] dist_world_size=1
[2023-04-15 14:16:43,281] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-15 14:16:48,685] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
bin G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
Loading alpaca-native...
[2023-04-15 14:16:49,680] [INFO] [partition_parameters.py:437:__exit__] finished initializing model with 0.13B parameters
Traceback (most recent call last):
  File "G:\F\Projects\AI\text-generation-webui\text-generation-webui\server.py", line 471, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "G:\F\Projects\AI\text-generation-webui\text-generation-webui\modules\models.py", line 85, in load_model
    model = AutoModelForCausalLM.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 2629, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 383, in wrapper
    f(module, *args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in __init__
    self.model = LlamaModel(config)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 383, in wrapper
    f(module, *args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 444, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 390, in wrapper
    self._post_init_method(module)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 782, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\comm.py", line 217, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\torch.py", line 81, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\torch\distributed\distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\torch\distributed\distributed_c10d.py", line 1559, in broadcast
    work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

In \lib\site-packages\deepspeed\comm\comm.py on line 526 change dist_backend=None to dist_backend='gloo'

@jllllll
Copy link
Contributor

jllllll commented Apr 15, 2023

Deepspeed can be run as-is with this from text-generation-webui folder:
python ..\installer_files\env\scripts\ds --num_gpus=1 server.py --deepspeed --chat --model modelname

@janvarev
Copy link
Author

janvarev commented Apr 15, 2023

@jllllll Yes, the same problem after your change to gloo backend:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

I've also tried 'mpi' backend, and it doesn't work too (custom mpi pytorch compile required)

Also, some links:
https://pytorch.org/docs/stable/distributed.html#backends
https://issuehint.com/issue/pytorch/pytorch/89688

@ThatOneShortGuy
Copy link

ThatOneShortGuy commented Apr 16, 2023

I was somehow able to build the wheel ".whl" file for deepspeed 0.8.3 on Python 3.9 in Windows 10. I don't know much about how pip installs it, or if it would be at least slightly cross compatible. If you would like, I can share the .whl file. Let me know if you want it.

@bubbabug
Copy link

I have followed all steps in this thread and come up against the same leaferror.

@AngelTs
Copy link

AngelTs commented May 11, 2023

I also have the NCCL error:
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
untimeError: Distributed package doesn't have NCCL built in

@AngelTs
Copy link

AngelTs commented May 12, 2023

Finally and i, after changed dist_backend=None to dist_backend='gloo' on line 562 in "C:\oobabooga_windows\installer_files\env\Lib\site-packages\deepspeed\comm\comm.py"
end up with the same error:
"RuntimeError: a leaf Variable that requires grad is being used in an in-place operation."
Game over ...

@github-actions github-actions bot added the stale label Sep 1, 2023
@github-actions
Copy link

github-actions bot commented Sep 1, 2023

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants