Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized model is not working properly when CUBLAS is ON #1661

Closed
bobqianic opened this issue Dec 20, 2023 Discussed in #1656 · 12 comments · Fixed by #1667
Closed

Quantized model is not working properly when CUBLAS is ON #1661

bobqianic opened this issue Dec 20, 2023 Discussed in #1656 · 12 comments · Fixed by #1667
Labels
bug Something isn't working help wanted Extra attention is needed high priority Very important issue

Comments

@bobqianic
Copy link
Collaborator

Discussed in #1656

Originally posted by Sing303 December 19, 2023
Now when I try to use quantization models with FULL GPU CUBLAS, instead of recognizing text it writes nonsense, all kinds of signs instead of words.

@bobqianic bobqianic added bug Something isn't working high priority Very important issue labels Dec 20, 2023
@Sing303
Copy link

Sing303 commented Dec 20, 2023

At the same time we created :)
#1662

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 20, 2023

Tried syncing the latest ggml from llama.cpp, but no luck. Both talk-llama and whisper are behaving identically.

tinyllama-1.1b-chat-v0.3.Q4_0.gguf ggml-model-whisper-base.bin

image

ggml-model-whisper-base-q5_1.bin

image

I've found that when I use main in llama.cpp, tinyllama-1.1b-chat-v0.3.Q4_0.gguf works properly.

image

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 20, 2023

The output logits of quantized models are near zero which is not correct. I dumped the embd_enc values, and I noticed that when using CUDA as the backend, the results significantly differ from those obtained using a CPU backend.

wstate_embd_enc.zip

@bobqianic
Copy link
Collaborator Author

Do you know how to fix this bug? @slaren @ggerganov

@bobqianic bobqianic added the help wanted Extra attention is needed label Dec 20, 2023
@bobqianic
Copy link
Collaborator Author

I find it strange, if the error occurred during the build_graph process, then all backends should have a problem. Currently, it is known that only the CUDA backend has an issue. I have already synchronized the latest ggml from llama.cpp. But if the CUDA backend really has a problem, llama.cpp should have discovered this issue much earlier. Maybe it's because few people use it? I will try other quantization modes besides Q5_1.

@bobqianic
Copy link
Collaborator Author

Okay, I've tested all the quantization modes, and none of them work properly on CUDA. I wonder if our quantization method has changed? All wstate_embd_enc outputs have one thing in common: there is a significant difference between the outputs on CUDA and CPU backends, which does not exist in the properly functioning FP16 mode.

@slaren
Copy link
Collaborator

slaren commented Dec 20, 2023

As far as I can tell, quantized models work correctly both with the gpt-2 example in ggml and in llama.cpp. I am not sure what is different in whisper.cpp.

@bobqianic
Copy link
Collaborator Author

FP16
image

Q8_0
image

Q6_K
image

Q5_K
image

Q5_1
image

Q5_0
image

Q4_K
image

Q4_1
image

Q4_0
image

Q3_K
image

Q2_K
image

@slaren
Copy link
Collaborator

slaren commented Dec 20, 2023

Can you test changing the GGML_CUDA_SOURCES section in CMakeLists.txt to this (from ggml)? It may be an issue with the CUDA architectures.

if (GGML_CUDA_SOURCES)
    message(STATUS "GGML CUDA sources found")
    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
        # Only configure gmml CUDA architectures is not globally set
        if (NOT DEFINED GGML_CUDA_ARCHITECTURES)
            # Not overriden by user, so set defaults
            set(GGML_CUDA_ARCHITECTURES 52 61 70)
        endif()
        message(STATUS "GGML Configuring CUDA architectures ${GGML_CUDA_ARCHITECTURES}")
        set_property(TARGET ggml  PROPERTY CUDA_ARCHITECTURES ${GGML_CUDA_ARCHITECTURES})
    endif()
    set_property(TARGET ggml  PROPERTY CUDA_SELECT_NVCC_ARCH_FLAGS "Auto")
    if (NOT MSVC)
        target_link_libraries(ggml PUBLIC stdc++)
    endif()
endif()

Maybe only the part that sets CMAKE_CUDA_ARCHITECTURES needs to be changed though.

@bobqianic
Copy link
Collaborator Author

Is this the section you're referring to?

whisper.cpp/CMakeLists.txt

Lines 522 to 526 in 9286d3f

if (GGML_SOURCES_CUDA)
message(STATUS "GGML CUDA sources found, configuring CUDA architecture")
set_property(TARGET whisper PROPERTY CUDA_ARCHITECTURES OFF)
set_property(TARGET whisper PROPERTY CUDA_SELECT_NVCC_ARCH_FLAGS "Auto")
endif()

@slaren
Copy link
Collaborator

slaren commented Dec 20, 2023

Yes. Replace the first set_property with this fragment:

        # Only configure gmml CUDA architectures is not globally set
        if (NOT DEFINED GGML_CUDA_ARCHITECTURES)
            # Not overriden by user, so set defaults
            set(GGML_CUDA_ARCHITECTURES 52 61 70)
        endif()
        message(STATUS "GGML Configuring CUDA architectures ${GGML_CUDA_ARCHITECTURES}")
        set_property(TARGET whisper PROPERTY CUDA_ARCHITECTURES ${GGML_CUDA_ARCHITECTURES})

@bobqianic
Copy link
Collaborator Author

Wow, it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants