Run with DeepSpeed on Windows (partially implemented) #1225

janvarev · 2023-04-15T11:19:34Z

Description

It'll be pretty good to have implemented Deepspeed running on Windows.

Additional Context

I've tried to solve some problems, so I'll provide the details:

Deepspeed install on Windows. It's not easy (i can't compile it manually), but I've finally found ready Wheels for Windows here: [REQUEST] Hey, Microsoft...Could you PLEASE Support Your Own OS? microsoft/DeepSpeed#2427 (comment) . So, I've installed deepspedd succesfully
You can't run "deepspeed" on windows (like docs say) because there is no EXE file on windows. I've prepared file https://gist.github.com/janvarev/8b6563b5da269533f9ec4e92e0327451 in main folder, and can start deepspeed that way: call python deepspeedrun.py --num_gpus=1 server.py --auto-devices
I finally gain error RuntimeError: Distributed package doesn't have NCCL built in and can't solve it. (Setting backend to gloo os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" will have no effect on me)

If someone can solve last problem, we can run deepspeed on Win, and it'll be very cool!

The text was updated successfully, but these errors were encountered:

jllllll · 2023-04-15T19:26:16Z

Got deepspeed to run. However, it looks like there isn't much, if any, support for gloo. Either that or it needs to be more directly implemented in webui.

G:\F\Projects\AI\text-generation-webui\text-generation-webui>python %INSTALL_ENV_DIR%\scripts\ds --num_gpus=1 server.py --deepspeed --chat --model alpaca-native
[2023-04-15 14:16:41,031] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-15 14:16:41,042] [INFO] [runner.py:527:main] cmd = G:\F\Projects\AI\text-generation-webui\installer_files\env\python.exe -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None server.py --deepspeed --chat --model alpaca-native
[2023-04-15 14:16:43,280] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-15 14:16:43,280] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-15 14:16:43,281] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-15 14:16:43,281] [INFO] [launch.py:151:main] dist_world_size=1
[2023-04-15 14:16:43,281] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-15 14:16:48,685] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
bin G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll
Loading alpaca-native...
[2023-04-15 14:16:49,680] [INFO] [partition_parameters.py:437:__exit__] finished initializing model with 0.13B parameters
Traceback (most recent call last):
  File "G:\F\Projects\AI\text-generation-webui\text-generation-webui\server.py", line 471, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "G:\F\Projects\AI\text-generation-webui\text-generation-webui\modules\models.py", line 85, in load_model
    model = AutoModelForCausalLM.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 2629, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 383, in wrapper
    f(module, *args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in __init__
    self.model = LlamaModel(config)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 383, in wrapper
    f(module, *args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 444, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 390, in wrapper
    self._post_init_method(module)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\runtime\zero\partition_parameters.py", line 782, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\comm.py", line 217, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\deepspeed\comm\torch.py", line 81, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\torch\distributed\distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "G:\F\Projects\AI\text-generation-webui\installer_files\env\lib\site-packages\torch\distributed\distributed_c10d.py", line 1559, in broadcast
    work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

In \lib\site-packages\deepspeed\comm\comm.py on line 526 change dist_backend=None to dist_backend='gloo'

jllllll · 2023-04-15T19:29:36Z

Deepspeed can be run as-is with this from text-generation-webui folder:
python ..\installer_files\env\scripts\ds --num_gpus=1 server.py --deepspeed --chat --model modelname

janvarev · 2023-04-15T22:13:01Z

@jllllll Yes, the same problem after your change to gloo backend:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

I've also tried 'mpi' backend, and it doesn't work too (custom mpi pytorch compile required)

Also, some links:
https://pytorch.org/docs/stable/distributed.html#backends
https://issuehint.com/issue/pytorch/pytorch/89688

ThatOneShortGuy · 2023-04-16T01:18:18Z

I was somehow able to build the wheel ".whl" file for deepspeed 0.8.3 on Python 3.9 in Windows 10. I don't know much about how pip installs it, or if it would be at least slightly cross compatible. If you would like, I can share the .whl file. Let me know if you want it.

bubbabug · 2023-04-17T00:55:14Z

I have followed all steps in this thread and come up against the same leaferror.

AngelTs · 2023-05-11T06:38:54Z

I also have the NCCL error:
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
untimeError: Distributed package doesn't have NCCL built in

AngelTs · 2023-05-12T06:40:07Z

Finally and i, after changed dist_backend=None to dist_backend='gloo' on line 562 in "C:\oobabooga_windows\installer_files\env\Lib\site-packages\deepspeed\comm\comm.py"
end up with the same error:
"RuntimeError: a leaf Variable that requires grad is being used in an in-place operation."
Game over ...

github-actions · 2023-09-01T23:16:15Z

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

janvarev added the enhancement New feature or request label Apr 15, 2023

janvarev changed the title ~~Run with DeepSpeed on Windows~~ Run with DeepSpeed on Windows (partially implemented) Apr 15, 2023

jllllll mentioned this issue May 10, 2023

How to install specific compiled deepspeed whl file in the conda environment under Windows without WSL? #1966

Closed

github-actions bot added the stale label Sep 1, 2023

github-actions bot closed this as completed Sep 1, 2023

eastchun mentioned this issue Feb 24, 2024

[REQUEST] - Installing DeepSpeed on Windows! (Correct instructions HERE. Please update the Front page of GitHub) microsoft/DeepSpeed#4729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run with DeepSpeed on Windows (partially implemented) #1225

Run with DeepSpeed on Windows (partially implemented) #1225

janvarev commented Apr 15, 2023

jllllll commented Apr 15, 2023

jllllll commented Apr 15, 2023 •

edited

Loading

janvarev commented Apr 15, 2023 •

edited

Loading

ThatOneShortGuy commented Apr 16, 2023 •

edited

Loading

bubbabug commented Apr 17, 2023

AngelTs commented May 11, 2023

AngelTs commented May 12, 2023

github-actions bot commented Sep 1, 2023

Run with DeepSpeed on Windows (partially implemented) #1225

Run with DeepSpeed on Windows (partially implemented) #1225

Comments

janvarev commented Apr 15, 2023

jllllll commented Apr 15, 2023

jllllll commented Apr 15, 2023 • edited Loading

janvarev commented Apr 15, 2023 • edited Loading

ThatOneShortGuy commented Apr 16, 2023 • edited Loading

bubbabug commented Apr 17, 2023

AngelTs commented May 11, 2023

AngelTs commented May 12, 2023

github-actions bot commented Sep 1, 2023

jllllll commented Apr 15, 2023 •

edited

Loading

janvarev commented Apr 15, 2023 •

edited

Loading

ThatOneShortGuy commented Apr 16, 2023 •

edited

Loading