-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Might be a solution to get built/compiles Flash Attention 2 on Windows #595
Comments
I did try replacing you files .h files on my venv, with
And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja? Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it. EDIT: Tried with CUDA 12.2, no luck either. EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM. It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work. EDIT3: I can confirm it works for exllamav2 + FA v2 Without FA
With FA
|
This is very helpful, thanks @Akatsuki030 and @Panchovix. |
@tridao just tested the compilation with your latest push, and now it works. I did use
|
Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later. |
Great! I did built a whl with |
@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well. We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile. |
I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows. |
Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1). https://github.com/bdashore3/flash-attention/releases/tag/2.3.2 |
Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version. |
Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible). If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there. |
你真乃神人也! |
Works like a charm. I used:
I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM. If you get a "ninja: build stopped: subcommand failed" error, do this: |
Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution? |
@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8. |
I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention I used py3.10, vs2019,cuda12.1 |
You don't have to use layer_norm. |
However, I made it work. The trouble is in ln_bwd_kernels.cuh line 54 For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as
That's stupid way, but it works ,and now is compiling. |
Does it mean I can use FA2 on windows if build it from source? |
您好!信件已收到,感谢您的来信。
|
Any compiled wheel for Windows 11, note: This error originates from a subprocess, and is likely not a problem with pip. |
您好!信件已收到,感谢您的来信。
|
I am trying to install Flash Attention 2 on Windows 11, with Python 3.12.3, and here is my setup - So I have setup MSVC Build Tools 2022, alongside MS VS Community 2022. Once I cloned the Flash Attention git repo, I ran
I'm pretty new to this, so was hoping if someone could point me in the right direction. Couldn't find anyway to fix my issue elsewhere online. Any help would be appreciated. Thanks! |
您好!信件已收到,感谢您的来信。
|
Seems like you are missing Cuda Toolkit Download it from Nvidia's website I recently recompiled mine with the following: If you wan to use my batch file, its hosted here: |
Oh sorry, I forgot to mention, I do have Cuda toolkit installed. Below is my nvcc -V
And below is my nvidia-smi
|
Have you tried installing Visual Studio 2022? |
Yes, I had installed Visual Studio 2022 along with the Build Tools 2022. But the issue seemed to be stemming from Visual Studio itself, since I managed to build Flash Attention 2 after modifying the Visual Studio Community 2022 installation and adding the Windows 11 SDK (available under Desktop Development with C++ >> Optional). Thanks! |
Just sharing, I was able to build this repo on windows without the need for changes above with these settings :
|
Seems like CUDA 12.4 and 12.5 not yet supported? |
I was able to compile and build from the source repository on Windows 11 with: CUDA 12.5 I have a Visual Studio 2019 that came with Windows and I've never used it.
|
Successfully install on Windows 11 23H2 (OS Build 22631.3737) via
Python 3.11.5 & PIP 24.1.1
PIP dependencies:
System Specs: Intel Core i9 13900KF |
Windows roughly an 1 hour, Ubuntu (Linux) some seconds to a few minutes.... |
Thanks for the information. I compiled it as you said and it was successful. I set MAX_JOBS=8 as the parameter, other parameters are the same as yours. compilation information: |
I've been installing flash attention on multiple system and made some batch files to clone and compile for convenience. |
I have tried all kind of things, but still cannot make the Flash Attention to compile on my windows laptop. This is my settings, I do not know if I have to upgrade CUDA to 12.x. Any advice? Python 3.10.8 ninja 1.11.1 |
I ran set MAX_JOBS=4 And restarted my computer. |
It worked, but it took hours to install on Windows. |
It does not work in my case, :( Settings: Package Version python 3.10.8 VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\ Commands : Errors: Building wheels for collected packages: flash-attn × python setup.py bdist_wheel did not run successfully.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk. FAILED: C:/Users/15023/AppData/Local/Temp/pip-install-dfkun1cn/flash-attn_b24e1ea8cfd04a7980b436f7faaf577f/build/temp.win-amd64-cpython-310/Release/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.obj RuntimeError: Error compiling objects for extension note: This error originates from a subprocess, and is likely not a problem with pip. |
The key line is: This occurs because the setup.py script for flash-attention is trying to run a Git command to update submodules. Clone the flash-attn git repo and run the |
I did that before, no good results . I am not sure if I need to upgrade the CUDA from 11.8 to 12.4. PS C:\Users\15023\Documents\Models\Tiny> cd flash-attention
PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation
PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation × python setup.py bdist_wheel did not run successfully.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk. File "C:\Users\15023\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default RuntimeError: Error compiling objects for extension note: This error originates from a subprocess, and is likely not a problem with pip. |
@Julianvaldesv mate you need to start reading those error messages! The git issue has been resolved and the error has changed so there's progress. It's screaming at you to upgrade PIP:
It's even giving you the command to use there and if that doesn't work, simply Google how to upgrade PIP! It's also telling you your version of MSVS is unsupported: Upgrade pip, then refer to the instructions in my repo to install VisualStudio Build Tools and try again: https://github.com/abgulati/LARS?tab=readme-ov-file#1-build-tools |
@abgulati my friend, thanks for your help. Something else is going on. I upgraded PIP days ago. PS C:\Users\15023\Documents\Models\Tiny\flash-attention> python -m pip install --upgrade pip Requirement already satisfied: pip in c:\users\15023\documents\models\tiny.venv\lib\site-packages (24.1.2) Also I have installed the VisualStudio Build Tools 2022. |
@Julianvaldesv In that case, try pasting this error in GPT-4/o or any other good LLM you have access to, describe the problem and background and see what it says |
@Julianvaldesv You are upgrading pip in that tiny.venv. Seems like your system is a mess. Much easier and faster to nuke your system from orbit and start from scratch. Sometimes that's the only way. |
What Torch version did you install that it's compatible with CUDA 12.5? According to Pytorch site, only 12.1 is fully supported (or 12.4 from source). |
Looks like oobabooga has Windows wheels for cu122, but sadly, no CU118 wheels. https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10") |
If pip isn't working for you, you may need more RAM. I was not able to compile in any way on 16GB of RAM, pip worked fine after upgrading to 64GB -- Took a few hours. |
Windows 10 Pro x64 |
您好!信件已收到,感谢您的来信。
|
As a Windows user, I tried to compile this and found the problem was on these two files "
flash_fwd_launch_template.h
" and "flash_bwd_launch_template.h
". below "./flash-attention/csrc/flash_attn/src
". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "
Headdim == XX
", and some templates like this:run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure)
, the thing I did is change all the "Headdim
" in these templates in the function. Take an example, if the function calledrun_mha_bwd_hdim128
and has a constant declaration"
Headdim == 128
", you have to change Headdim as 128 in the templates, which likesrun_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure)
, and I did the same thing to the functions "run_mha_fwd_hdimXX
" and also the templates.Second, another error is from the "
flash_fwd_launch_template.h
", line 107, also the problem of referencing the constant "kBlockM
" in the below if-else statement, and I rewrote it toThird, for the function"
run_mha_fwd_splitkv_dispatch
" in "flash_fwd_launch_template.h
", line 194, you also have to change "kBlockM
" in the template as 64. And then you can try to compile it.These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used
The text was updated successfully, but these errors were encountered: