Might be a solution to get built/compiles Flash Attention 2 on Windows #595

Akatsuki030 · 2023-10-08T16:10:49Z

As a Windows user, I tried to compile this and found the problem was on these two files "flash_fwd_launch_template.h" and "flash_bwd_launch_template.h". below "./flash-attention/csrc/flash_attn/src". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:

First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "Headdim == XX", and some templates like this: run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure), the thing I did is change all the "Headdim" in these templates in the function. Take an example, if the function called run_mha_bwd_hdim128 and has a constant declaration
"Headdim == 128", you have to change Headdim as 128 in the templates, which likes run_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure), and I did the same thing to the functions "run_mha_fwd_hdimXX" and also the templates.

Second, another error is from the "flash_fwd_launch_template.h", line 107, also the problem of referencing the constant "kBlockM" in the below if-else statement, and I rewrote it to

		if constexpr(Kernel_traits::kHeadDim % 128 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 4 - 1) / 4);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else if constexpr(Kernel_traits::kHeadDim % 64 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 8 - 1) / 8);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else{
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 16 - 1) / 16);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}

Third, for the function"run_mha_fwd_splitkv_dispatch" in "flash_fwd_launch_template.h", line 194, you also have to change "kBlockM" in the template as 64. And then you can try to compile it.
These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used

python 3.11
Pytorch 2.2+cu121 Nightly
CUDA 12.2
Anaconda
Windows 11 22H2

The text was updated successfully, but these errors were encountered:

Panchovix · 2023-10-08T20:33:15Z

I did try replacing you files .h files on my venv, with

Python 3.10
Pytorch 2.2 Nightly
CUDA 12.1
Visual Studio 2022
Ninja

And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja?

Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it.

EDIT: Tried with CUDA 12.2, no luck either.

EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM.

It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work.

EDIT3: I can confirm it works for exllamav2 + FA v2

Without FA

-- Measuring token speed...
 ** Position     1 + 127 tokens:   13.5848 t/s
 ** Position   128 + 128 tokens:   13.8594 t/s
 ** Position   256 + 128 tokens:   14.1394 t/s
 ** Position   384 + 128 tokens:   13.8138 t/s
 ** Position   512 + 128 tokens:   13.4949 t/s
 ** Position   640 + 128 tokens:   13.6474 t/s
 ** Position   768 + 128 tokens:   13.7073 t/s
 ** Position   896 + 128 tokens:   12.3254 t/s
 ** Position  1024 + 128 tokens:   13.8960 t/s
 ** Position  1152 + 128 tokens:   13.7677 t/s
 ** Position  1280 + 128 tokens:   12.9869 t/s
 ** Position  1408 + 128 tokens:   12.1336 t/s
 ** Position  1536 + 128 tokens:   13.0463 t/s
 ** Position  1664 + 128 tokens:   13.2463 t/s
 ** Position  1792 + 128 tokens:   12.6211 t/s
 ** Position  1920 + 128 tokens:   13.1429 t/s
 ** Position  2048 + 128 tokens:   12.5674 t/s
 ** Position  2176 + 128 tokens:   12.5847 t/s
 ** Position  2304 + 128 tokens:   13.3471 t/s
 ** Position  2432 + 128 tokens:   12.9135 t/s
 ** Position  2560 + 128 tokens:   12.2195 t/s
 ** Position  2688 + 128 tokens:   11.6120 t/s
 ** Position  2816 + 128 tokens:   11.2545 t/s
 ** Position  2944 + 128 tokens:   11.5304 t/s
 ** Position  3072 + 128 tokens:   11.7982 t/s
 ** Position  3200 + 128 tokens:   11.8041 t/s
 ** Position  3328 + 128 tokens:   12.8038 t/s
 ** Position  3456 + 128 tokens:   12.7324 t/s
 ** Position  3584 + 128 tokens:   11.7733 t/s
 ** Position  3712 + 128 tokens:   10.7961 t/s
 ** Position  3840 + 128 tokens:   11.1014 t/s
 ** Position  3968 + 128 tokens:   10.8474 t/s

With FA

-- Measuring token speed...
** Position     1 + 127 tokens:   22.6606 t/s
** Position   128 + 128 tokens:   22.5140 t/s
** Position   256 + 128 tokens:   22.6111 t/s
** Position   384 + 128 tokens:   22.6027 t/s
** Position   512 + 128 tokens:   22.3392 t/s
** Position   640 + 128 tokens:   22.0570 t/s
** Position   768 + 128 tokens:   22.3728 t/s
** Position   896 + 128 tokens:   22.4983 t/s
** Position  1024 + 128 tokens:   21.9384 t/s
** Position  1152 + 128 tokens:   22.3509 t/s
** Position  1280 + 128 tokens:   22.3189 t/s
** Position  1408 + 128 tokens:   22.2739 t/s
** Position  1536 + 128 tokens:   22.4145 t/s
** Position  1664 + 128 tokens:   21.9608 t/s
** Position  1792 + 128 tokens:   21.7645 t/s
** Position  1920 + 128 tokens:   22.1468 t/s
** Position  2048 + 128 tokens:   22.3400 t/s
** Position  2176 + 128 tokens:   21.9830 t/s
** Position  2304 + 128 tokens:   21.8387 t/s
** Position  2432 + 128 tokens:   20.2306 t/s
** Position  2560 + 128 tokens:   21.0056 t/s
** Position  2688 + 128 tokens:   22.2157 t/s
** Position  2816 + 128 tokens:   22.1912 t/s
** Position  2944 + 128 tokens:   22.1835 t/s
** Position  3072 + 128 tokens:   22.1393 t/s
** Position  3200 + 128 tokens:   22.1182 t/s
** Position  3328 + 128 tokens:   22.0821 t/s
** Position  3456 + 128 tokens:   22.0308 t/s
** Position  3584 + 128 tokens:   22.0060 t/s
** Position  3712 + 128 tokens:   21.9909 t/s
** Position  3840 + 128 tokens:   21.9816 t/s
** Position  3968 + 128 tokens:   21.9757 t/s

tridao · 2023-10-08T23:30:10Z

This is very helpful, thanks @Akatsuki030 and @Panchovix.
@Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with constexpr static int instead of constexpr int? I've just pushed a commit that does it. Can you check if that compile on Windows?
A while back someone (I think it was Daniel Haziza from the xformers team) told me that they need constexpr static int for Windows compilation.

Panchovix · 2023-10-09T00:05:46Z

@tridao just tested the compilation with your latest push, and now it works.

I did use

Python 3.10
Pytorch 2.2+cu121 Nightly
CUDA 12.2
Visual Studio 2022
Ninja

tridao · 2023-10-09T00:16:49Z

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Panchovix · 2023-10-09T00:18:41Z

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

Panchovix · 2023-10-09T02:36:06Z

@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well.

We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile.

tridao · 2023-10-09T02:41:15Z

I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows.

bdashore3 · 2023-10-09T02:57:48Z

Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1).

https://github.com/bdashore3/flash-attention/releases/tag/2.3.2

bdashore3 · 2023-10-09T03:39:51Z

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

tridao · 2023-10-09T04:25:49Z

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible).

If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there.

dunbin · 2023-10-09T04:47:35Z

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

你真乃神人也！

mattiamazzari · 2023-10-11T20:02:30Z

Works like a charm. I used:

CUDA 12.2
PyTorch 2.2.0.dev20231011+cu121 (installed with the command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121). Be sure you install this CUDA version and not the CPU version.

I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM.

If you get a "ninja: build stopped: subcommand failed" error, do this:
git clean -xdf
python setup.py clean
git submodule sync
git submodule deinit -f .
git submodule update --init --recursive
python setup.py install

YuehChuan · 2023-10-12T12:18:00Z

GOOD🎶
RTX4090 24GB RAM AMD7950X 64GM RAM
python3.8 python3.10 both work

python3.10
https://www.python.org/downloads/release/python-3100/
win11

python -m venv venv

cd venc/Scripts
activate
-----------------------

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention

pip install packaging 
pip install wheel

set MAX_JOBS=4
python setup.py install

Nicoolodion2 · 2023-10-17T17:27:19Z

Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution?

bdashore3 · 2023-10-17T23:56:33Z

@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8.

neocao123 · 2023-10-18T04:19:10Z

I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention
and rotary has been built by @bdashore3 's branch, while layer_norm in error.

I used py3.10, vs2019,cuda12.1

tridao · 2023-10-18T04:20:09Z

You don't have to use layer_norm.

neocao123 · 2023-10-18T06:38:10Z

You don't have to use layer_norm.

However, I made it work.

The trouble is in ln_bwd_kernels.cuh line 54

For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as

if(HasColscaleConst){
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  true,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
					}else{
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  false,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);

That's stupid way, but it works ,and now is compiling.

havietisov · 2023-12-14T00:16:39Z

Does it mean I can use FA2 on windows if build it from source?

dunbin · 2023-12-14T00:17:00Z

您好！信件已收到，感谢您的来信。

Piscabo · 2024-01-10T11:32:02Z

Any compiled wheel for Windows 11,
Python 3.11
Cuda 12.2
Torch 2.1.2

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: Could not build wheels for flash_attn, which is required to install pyproject.toml-based projects

dunbin · 2024-01-10T11:32:22Z

您好！信件已收到，感谢您的来信。

Dao-AILab/flash-attention#595

C0D3-BR3AK3R · 2024-06-12T06:35:56Z

I am trying to install Flash Attention 2 on Windows 11, with Python 3.12.3, and here is my setup -
RTX 3050 Laptop
16 GB RAM
Core i7 12650H.

So I have setup MSVC Build Tools 2022, alongside MS VS Community 2022. Once I cloned the Flash Attention git repo, I ran python setup.py install and it gives error below -

running build_ext
D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py:384: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'flash_attn_2_cuda' extension
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src      
Emitting ninja build file D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\build.ninja...
Compiling objects...
Using envvar MAX_JOBS (1) as the number of workers...
[1/49] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
FAILED: D:/Github/Deep-Learning-Basics/LLM Testing/MultiModalAI/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/flash_api.obj
cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++17'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "C:\Python312\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\setup.py", line 311, in <module>
    setup(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 184, in setup     
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 200, in run_commands
    dist.run_commands()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 87, in run        
    self.do_egg_install()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 139, in do_egg_install
    self.run_command('bdist_egg')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 167, in run     
    cmd = self.call_command('install_lib', warn_dir=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install_lib.py", line 11, in run    
    self.build()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\install_lib.py", line 110, in build
    self.run_command('build_ext')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 91, in run      
    _build_ext.run(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 359, in run
    self.build_extensions()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 870, in build_extensions
    build_ext.build_extensions(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 479, in build_extensions
    self._build_extensions_serial()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 505, in _build_extensions_serial
    self.build_extension(ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 252, in build_extension
    _build_ext.build_extension(self, ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 560, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 842, in win_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 1783, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2123, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I'm pretty new to this, so was hoping if someone could point me in the right direction. Couldn't find anyway to fix my issue elsewhere online. Any help would be appreciated. Thanks!

dunbin · 2024-06-12T06:36:31Z

您好！信件已收到，感谢您的来信。

dicksondickson · 2024-06-12T07:03:41Z

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website
cuda

I recently recompiled mine with the following:
Windows 11
Python 3.12.4
pyTorch Nightly 2.4.0.dev20240606+cu124
Cuda 12.5.0_555.85
Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here:
batch file

C0D3-BR3AK3R · 2024-06-12T07:35:38Z

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website cuda

I recently recompiled mine with the following: Windows 11 Python 3.12.4 pyTorch Nightly 2.4.0.dev20240606+cu124 Cuda 12.5.0_555.85 Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here: batch file

Oh sorry, I forgot to mention, I do have Cuda toolkit installed. Below is my nvcc -V

 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

And below is my nvidia-smi

nvidia-smi
Wed Jun 12 13:05:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8              3W /   72W |      32MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     26140    C+G   ...8bbwe\SnippingTool\SnippingTool.exe      N/A      |
+-----------------------------------------------------------------------------------------+

dicksondickson · 2024-06-12T07:51:45Z

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

C0D3-BR3AK3R · 2024-06-12T13:57:47Z

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

Yes, I had installed Visual Studio 2022 along with the Build Tools 2022. But the issue seemed to be stemming from Visual Studio itself, since I managed to build Flash Attention 2 after modifying the Visual Studio Community 2022 installation and adding the Windows 11 SDK (available under Desktop Development with C++ >> Optional).

Thanks!

konan009 · 2024-06-12T15:59:10Z

Just sharing, I was able to build this repo on windows without the need for changes above with these settings :

Python 3.11
VS 2022 C++ (v14.38-17.9)
CUDA 12.2

d-kleine · 2024-06-14T02:48:36Z

Seems like CUDA 12.4 and 12.5 not yet supported?

fangyizhu · 2024-06-25T21:43:22Z

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5
Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

abgulati · 2024-06-26T21:55:53Z

Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install (took about an hours time, system specs at the end):

pip install flash-attn --no-build-isolation

Python 3.11.5 & PIP 24.1.1
CUDA 12.4
PyTorch installed via:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

PIP dependencies:

pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2

System Specs:

Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)

d-kleine · 2024-06-26T22:04:46Z

took about an hours time

Windows roughly an 1 hour, Ubuntu (Linux) some seconds to a few minutes....

NovaYear · 2024-07-04T11:08:46Z

Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install (took about an hours time, system specs at the end):
pip install flash-attn --no-build-isolation
Python 3.11.5 & PIP 24.1.1 CUDA 12.4 PyTorch installed via:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
PIP dependencies:
pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2
System Specs:

Intel Core i9 13900KF Nvidia RTX 3090FE 32GB DDR5 5600MT/s (16x2)

Thanks for the information. I compiled it as you said and it was successful. I set MAX_JOBS=8 as the parameter, other parameters are the same as yours. compilation information:
winver: w11 24h2 26100.836
ram: 32gb dd4 4000mhz
cpu: 5700g
gpu: rtx3090 24gb
runing: 8 compiling thread
cpu usage: ~70%
ram usage: ~31gb
time: ~50mins

dicksondickson · 2024-07-06T19:50:31Z

I've been installing flash attention on multiple system and made some batch files to clone and compile for convenience.
You can get them here: https://github.com/dicksondickson/ComfyUI-Clean-Install

Julianvaldesv · 2024-07-09T19:46:43Z

I have tried all kind of things, but still cannot make the Flash Attention to compile on my windows laptop. This is my settings, I do not know if I have to upgrade CUDA to 12.x. Any advice?
C:\Users\15023>nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Python 3.10.8
Intel(R) Core(TM) i9-14900HX 2.20 GHz
64-bit operating system, x64-based processor
Windows 11 Pro
Nvidia RTX 4080
Package Version

ninja 1.11.1
numpy 1.26.4
packaging 23.2
pillow 10.4.0
pip 24.1.2
pyparsing 3.1.2
python-dateutil 2.9.0.post0
requests 2.32.3
safetensors 0.4.3
setuptools 70.2.0
tokenizers 0.19.1
torch 2.3.1+cu118
torchaudio 2.3.1+cu118
torchvision 0.18.1+cu118
tqdm 4.66.4
urllib3 2.2.2
wheel 0.43.0

Boubou78000 · 2024-07-10T10:42:08Z

I ran

set MAX_JOBS=4

And restarted my computer.
Then I ran the pip command and it worked

jhj0517 · 2024-07-10T14:31:45Z

set MAX_JOBS=1
pip install flash-attn

It worked, but it took hours to install on Windows.
( Stuck at "Building wheel for flash-attn (setup.py)...", building wheel was super slow )

Julianvaldesv · 2024-07-10T15:37:16Z

It does not work in my case, :(
PC Specs:
Intel(R) Core(TM) i9-14900HX 2.20 GHz
64-bit operating system, x64-based processor
Windows 11 Pro
Nvidia RTX 4080

Settings:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Package Version

python 3.10.8
ninja 1.11.1
numpy 1.26.4
packaging 23.2
pillow 10.4.0
pip 24.1.2
pyparsing 3.1.2
python-dateutil 2.9.0.post0
requests 2.32.3
setuptools 70.2.0
tokenizers 0.19.1
torch 2.3.1+cu118
torchaudio 2.3.1+cu118
torchvision 0.18.1+cu118
tqdm 4.66.4
urllib3 2.2.2
wheel 0.43.0

VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\

Commands :
set MAX_JOBS=1
pip install flash-attn --no-build-isolation

Errors:

Building wheels for collected packages: flash-attn
Building wheel for flash-attn (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [271 lines of output]
fatal: not a git repository (or any of the parent directories): .git

  torch.__version__  = 2.3.1+cu118

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

FAILED: C:/Users/15023/AppData/Local/Temp/pip-install-dfkun1cn/flash-attn_b24e1ea8cfd04a7980b436f7faaf577f/build/temp.win-amd64-cpython-310/Release/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.obj

RuntimeError: Error compiling objects for extension
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)

abgulati · 2024-07-10T17:49:44Z

@Julianvaldesv

The key line is: fatal: not a git repository (or any of the parent directories): .git

This occurs because the setup.py script for flash-attention is trying to run a Git command to update submodules.

Clone the flash-attn git repo and run the pip install command from within it. If you encounter errors stating no flash-attn or something, try running pip install . --no-build-isolation

Julianvaldesv · 2024-07-10T18:38:35Z

pip install . --no-build-isolation

I did that before, no good results . I am not sure if I need to upgrade the CUDA from 11.8 to 12.4.
Run from git repo:

PS C:\Users\15023\Documents\Models\Tiny> cd flash-attention

set MAX_JOBS=4

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation
Processing c:\users\15023\documents\models\tiny\flash-attention
Preparing metadata (setup.py) ... done
PS C:\Users\15023\Documents\Models\Tiny> cd flash-attention

set MAX_JOBS=4

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation
Processing c:\users\15023\documents\models\tiny\flash-attention
Preparing metadata (setup.py) ... done
Building wheels for collected packages: flash_attn
Building wheel for flash_attn (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [274 lines of output]

  torch.__version__  = 2.3.1+cu118


  C:\Users\15023\Documents\Models\Tiny\.venv\lib\site-packages\setuptools\__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!

          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************

  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu118torch2.3cxx11abiFALSE-cp310-cp310-win_amd64.whl

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

File "C:\Users\15023\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

RuntimeError: Error compiling objects for extension
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash_attn)

abgulati · 2024-07-10T18:43:15Z

@Julianvaldesv mate you need to start reading those error messages!

The git issue has been resolved and the error has changed so there's progress. It's screaming at you to upgrade PIP:

********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
********************************************************************************

It's even giving you the command to use there and if that doesn't work, simply Google how to upgrade PIP!

It's also telling you your version of MSVS is unsupported: fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

Upgrade pip, then refer to the instructions in my repo to install VisualStudio Build Tools and try again: https://github.com/abgulati/LARS?tab=readme-ov-file#1-build-tools

Julianvaldesv · 2024-07-10T19:35:16Z

@Julianvaldesv mate you need to start reading those error messages!

The git issue has been resolved and the error has changed so there's progress. It's screaming at you to upgrade PIP:
********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
********************************************************************************
It's even giving you the command to use there and if that doesn't work, simply Google how to upgrade PIP!

It's also telling you your version of MSVS is unsupported: fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

Upgrade pip, then refer to the instructions in my repo to install VisualStudio Build Tools and try again: https://github.com/abgulati/LARS?tab=readme-ov-file#1-build-tools

@abgulati my friend, thanks for your help. Something else is going on. I upgraded PIP days ago.

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> python -m pip install --upgrade pip

Requirement already satisfied: pip in c:\users\15023\documents\models\tiny.venv\lib\site-packages (24.1.2)

Also I have installed the VisualStudio Build Tools 2022.
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\Bin\MSBuild

abgulati · 2024-07-10T19:46:57Z

@Julianvaldesv In that case, try pasting this error in GPT-4/o or any other good LLM you have access to, describe the problem and background and see what it says

dicksondickson · 2024-07-10T19:50:39Z

@Julianvaldesv You are upgrading pip in that tiny.venv. Seems like your system is a mess. Much easier and faster to nuke your system from orbit and start from scratch. Sometimes that's the only way.

Julianvaldesv · 2024-07-15T14:14:34Z

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5 Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

What Torch version did you install that it's compatible with CUDA 12.5? According to Pytorch site, only 12.1 is fully supported (or 12.4 from source).

i486 · 2024-07-18T19:31:48Z

Looks like oobabooga has Windows wheels for cu122, but sadly, no CU118 wheels.

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10")

pwillia7 · 2024-08-19T17:11:40Z

If pip isn't working for you, you may need more RAM. I was not able to compile in any way on 16GB of RAM, pip worked fine after upgrading to 64GB -- Took a few hours.

SGrebenkin · 2024-09-09T10:01:26Z

Windows 10 Pro x64
cuda 12.5
torch 2.4.1
RTX4070 12GB RAM Core I5 14400F 16GM RAM
python3.9 works

dunbin · 2024-09-09T10:01:59Z

您好！信件已收到，感谢您的来信。

kairin · 2024-09-14T13:31:36Z

it took me an hr 15 minutes or so.

initially I have issue whereby the installation can't figure out where lcuda is located.

i installed pytorch nightly 12.4
cuda 12.6
windows 11 - but using ubuntu 24.04 in WSL2
nvidia 4080 16gb

Akatsuki030 changed the title ~~Migbe be a solution to get built/compiles Flash Attention 2 on Windows~~ Might be a solution to get built/compiles Flash Attention 2 on Windows Oct 8, 2023

Panchovix mentioned this issue Oct 9, 2023

Adding flash attention to one click installer oobabooga/text-generation-webui#4015

Closed

bdashore3 mentioned this issue Oct 9, 2023

Add flash-attention 2 for windows oobabooga/text-generation-webui#4235

Merged

1 task

rbertus2000 mentioned this issue Oct 9, 2023

Flash attention unavailable after 0.0.21 on Windows system facebookresearch/xformers#863

Open

Panchovix mentioned this issue Oct 10, 2023

Flash Attention 2 doesn't get built/compiles on Windows. #553

Closed

YuehChuan mentioned this issue Oct 12, 2023

How to fix install cuda12.1, python=3.9, flash-atten=2.3.2 #598

Closed

Drakosfire mentioned this issue Oct 13, 2023

Failed to install flash-attn in Windows system dvlab-research/LISA#67

Open

seonglae added a commit to seonglae/yokhal that referenced this issue Mar 9, 2024

deps: for flash attn, pytorch 2.2, cuda 12.1, python 3.10

a6e331a

Dao-AILab/flash-attention#595

dicksondickson mentioned this issue Jul 6, 2024

Latest ComfyUI Update broke Supir Node Compatability kijai/ComfyUI-SUPIR#144

Closed

Might be a solution to get built/compiles Flash Attention 2 on Windows #595

Might be a solution to get built/compiles Flash Attention 2 on Windows #595

Comments

Akatsuki030 commented Oct 8, 2023 • edited Loading

Panchovix commented Oct 8, 2023 • edited Loading

tridao commented Oct 8, 2023

Panchovix commented Oct 9, 2023 • edited Loading

tridao commented Oct 9, 2023

Panchovix commented Oct 9, 2023

Panchovix commented Oct 9, 2023 • edited Loading

tridao commented Oct 9, 2023

bdashore3 commented Oct 9, 2023 • edited Loading

bdashore3 commented Oct 9, 2023

tridao commented Oct 9, 2023

dunbin commented Oct 9, 2023

mattiamazzari commented Oct 11, 2023 • edited Loading

YuehChuan commented Oct 12, 2023

Nicoolodion2 commented Oct 17, 2023

bdashore3 commented Oct 17, 2023

neocao123 commented Oct 18, 2023 • edited Loading

tridao commented Oct 18, 2023

neocao123 commented Oct 18, 2023 • edited Loading

havietisov commented Dec 14, 2023

dunbin commented Dec 14, 2023 via email

Piscabo commented Jan 10, 2024

dunbin commented Jan 10, 2024 via email

C0D3-BR3AK3R commented Jun 12, 2024 • edited Loading

dunbin commented Jun 12, 2024 via email

dicksondickson commented Jun 12, 2024 • edited Loading

C0D3-BR3AK3R commented Jun 12, 2024

dicksondickson commented Jun 12, 2024

C0D3-BR3AK3R commented Jun 12, 2024 • edited Loading

konan009 commented Jun 12, 2024 • edited Loading

d-kleine commented Jun 14, 2024

fangyizhu commented Jun 25, 2024

abgulati commented Jun 26, 2024 • edited Loading

d-kleine commented Jun 26, 2024

NovaYear commented Jul 4, 2024

dicksondickson commented Jul 6, 2024

Julianvaldesv commented Jul 9, 2024 • edited Loading

Boubou78000 commented Jul 10, 2024 • edited Loading

jhj0517 commented Jul 10, 2024

Julianvaldesv commented Jul 10, 2024

abgulati commented Jul 10, 2024

Julianvaldesv commented Jul 10, 2024

abgulati commented Jul 10, 2024 • edited Loading

Julianvaldesv commented Jul 10, 2024

abgulati commented Jul 10, 2024 • edited Loading

dicksondickson commented Jul 10, 2024

Julianvaldesv commented Jul 15, 2024

i486 commented Jul 18, 2024

pwillia7 commented Aug 19, 2024

SGrebenkin commented Sep 9, 2024 • edited Loading

dunbin commented Sep 9, 2024 via email

kairin commented Sep 14, 2024 • edited Loading

Akatsuki030 commented Oct 8, 2023 •

edited

Loading

Panchovix commented Oct 8, 2023 •

edited

Loading

Panchovix commented Oct 9, 2023 •

edited

Loading

Panchovix commented Oct 9, 2023 •

edited

Loading

bdashore3 commented Oct 9, 2023 •

edited

Loading

mattiamazzari commented Oct 11, 2023 •

edited

Loading

neocao123 commented Oct 18, 2023 •

edited

Loading

neocao123 commented Oct 18, 2023 •

edited

Loading

C0D3-BR3AK3R commented Jun 12, 2024 •

edited

Loading

dicksondickson commented Jun 12, 2024 •

edited

Loading

C0D3-BR3AK3R commented Jun 12, 2024 •

edited

Loading

konan009 commented Jun 12, 2024 •

edited

Loading

abgulati commented Jun 26, 2024 •

edited

Loading

Julianvaldesv commented Jul 9, 2024 •

edited

Loading

Boubou78000 commented Jul 10, 2024 •

edited

Loading

abgulati commented Jul 10, 2024 •

edited

Loading

abgulati commented Jul 10, 2024 •

edited

Loading

SGrebenkin commented Sep 9, 2024 •

edited

Loading

kairin commented Sep 14, 2024 •

edited

Loading