Quantized model is not working properly when CUBLAS is ON #1661

bobqianic · 2023-12-20T16:25:34Z

Discussed in #1656

^{Originally posted by Sing303 December 19, 2023}
Now when I try to use quantization models with FULL GPU CUBLAS, instead of recognizing text it writes nonsense, all kinds of signs instead of words.

Sing303 · 2023-12-20T16:27:14Z

At the same time we created :)
#1662

bobqianic · 2023-12-20T16:30:07Z

Tried syncing the latest ggml from llama.cpp, but no luck. Both talk-llama and whisper are behaving identically.

tinyllama-1.1b-chat-v0.3.Q4_0.gguf ggml-model-whisper-base.bin

ggml-model-whisper-base-q5_1.bin

I've found that when I use main in llama.cpp, tinyllama-1.1b-chat-v0.3.Q4_0.gguf works properly.

bobqianic · 2023-12-20T21:21:25Z

The output logits of quantized models are near zero which is not correct. I dumped the embd_enc values, and I noticed that when using CUDA as the backend, the results significantly differ from those obtained using a CPU backend.

wstate_embd_enc.zip

bobqianic · 2023-12-20T21:31:22Z

Do you know how to fix this bug? @slaren @ggerganov

bobqianic · 2023-12-20T21:52:53Z

I find it strange, if the error occurred during the build_graph process, then all backends should have a problem. Currently, it is known that only the CUDA backend has an issue. I have already synchronized the latest ggml from llama.cpp. But if the CUDA backend really has a problem, llama.cpp should have discovered this issue much earlier. Maybe it's because few people use it? I will try other quantization modes besides Q5_1.

bobqianic · 2023-12-20T22:26:32Z

Okay, I've tested all the quantization modes, and none of them work properly on CUDA. I wonder if our quantization method has changed? All wstate_embd_enc outputs have one thing in common: there is a significant difference between the outputs on CUDA and CPU backends, which does not exist in the properly functioning FP16 mode.

slaren · 2023-12-20T22:30:52Z

As far as I can tell, quantized models work correctly both with the gpt-2 example in ggml and in llama.cpp. I am not sure what is different in whisper.cpp.

bobqianic · 2023-12-20T22:52:03Z

FP16

Q8_0

Q6_K

Q5_K

Q5_1

Q5_0

Q4_K

Q4_1

Q4_0

Q3_K

Q2_K

slaren · 2023-12-20T23:00:42Z

Can you test changing the GGML_CUDA_SOURCES section in CMakeLists.txt to this (from ggml)? It may be an issue with the CUDA architectures.

if (GGML_CUDA_SOURCES)
    message(STATUS "GGML CUDA sources found")
    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
        # Only configure gmml CUDA architectures is not globally set
        if (NOT DEFINED GGML_CUDA_ARCHITECTURES)
            # Not overriden by user, so set defaults
            set(GGML_CUDA_ARCHITECTURES 52 61 70)
        endif()
        message(STATUS "GGML Configuring CUDA architectures ${GGML_CUDA_ARCHITECTURES}")
        set_property(TARGET ggml  PROPERTY CUDA_ARCHITECTURES ${GGML_CUDA_ARCHITECTURES})
    endif()
    set_property(TARGET ggml  PROPERTY CUDA_SELECT_NVCC_ARCH_FLAGS "Auto")
    if (NOT MSVC)
        target_link_libraries(ggml PUBLIC stdc++)
    endif()
endif()

Maybe only the part that sets CMAKE_CUDA_ARCHITECTURES needs to be changed though.

bobqianic · 2023-12-20T23:08:06Z

Is this the section you're referring to?

whisper.cpp/CMakeLists.txt

Lines 522 to 526 in 9286d3f

    
           if (GGML_SOURCES_CUDA) 
        
               message(STATUS "GGML CUDA sources found, configuring CUDA architecture") 
        
               set_property(TARGET whisper PROPERTY CUDA_ARCHITECTURES OFF) 
        
               set_property(TARGET whisper PROPERTY CUDA_SELECT_NVCC_ARCH_FLAGS "Auto") 
        
           endif()

slaren · 2023-12-20T23:11:10Z

Yes. Replace the first set_property with this fragment:

        # Only configure gmml CUDA architectures is not globally set
        if (NOT DEFINED GGML_CUDA_ARCHITECTURES)
            # Not overriden by user, so set defaults
            set(GGML_CUDA_ARCHITECTURES 52 61 70)
        endif()
        message(STATUS "GGML Configuring CUDA architectures ${GGML_CUDA_ARCHITECTURES}")
        set_property(TARGET whisper PROPERTY CUDA_ARCHITECTURES ${GGML_CUDA_ARCHITECTURES})

bobqianic · 2023-12-20T23:16:04Z

Wow, it works!

bobqianic added bug Something isn't working high priority Very important issue labels Dec 20, 2023

Sing303 mentioned this issue Dec 20, 2023

CUBLAS GPU not working with Q models #1662

Closed

bobqianic added the help wanted Extra attention is needed label Dec 20, 2023

bobqianic mentioned this issue Dec 20, 2023

cmake : Resolve quantized model issue when CUBLAS enabled #1667

Merged

bobqianic linked a pull request Dec 20, 2023 that will close this issue

cmake : Resolve quantized model issue when CUBLAS enabled #1667

Merged

ggerganov closed this as completed in #1667 Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized model is not working properly when CUBLAS is ON #1661

Quantized model is not working properly when CUBLAS is ON #1661

bobqianic commented Dec 20, 2023

Sing303 commented Dec 20, 2023

bobqianic commented Dec 20, 2023 •

edited

Loading

bobqianic commented Dec 20, 2023 •

edited

Loading

bobqianic commented Dec 20, 2023

bobqianic commented Dec 20, 2023

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023 •

edited

Loading

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023 •

edited

Loading

bobqianic commented Dec 20, 2023

Quantized model is not working properly when CUBLAS is ON #1661

Quantized model is not working properly when CUBLAS is ON #1661

Comments

bobqianic commented Dec 20, 2023

Discussed in #1656

Sing303 commented Dec 20, 2023

bobqianic commented Dec 20, 2023 • edited Loading

bobqianic commented Dec 20, 2023 • edited Loading

bobqianic commented Dec 20, 2023

bobqianic commented Dec 20, 2023

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023 • edited Loading

bobqianic commented Dec 20, 2023

slaren commented Dec 20, 2023 • edited Loading

bobqianic commented Dec 20, 2023

bobqianic commented Dec 20, 2023 •

edited

Loading

bobqianic commented Dec 20, 2023 •

edited

Loading

slaren commented Dec 20, 2023 •

edited

Loading

slaren commented Dec 20, 2023 •

edited

Loading