-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating garbage output on CUDA when GGML_CUDA_FORCE_DMMV is set to false #2136
Comments
I ran the latest 19:07:18 master $ make -j && ./bin/main -m ../models/7B/ggml-model-q4_0.bin -ngl 66 -p "Hello, my name is" -n 128
[ 2%] Built target BUILD_INFO
[ 8%] Built target ggml
[ 10%] Built target ggml_static
[ 15%] Built target llama
[ 19%] Built target test-quantize-fns
[ 23%] Built target test-sampling
[ 32%] Built target quantize-stats
[ 32%] Built target test-tokenizer-0
[ 34%] Built target common
[ 39%] Built target test-quantize-perf
[ 43%] Built target quantize
[ 47%] Built target baby-llama
[ 52%] Built target perplexity
[ 56%] Built target benchmark
[ 60%] Built target embedding
[ 65%] Built target train-text-from-scratch
[ 73%] Built target q8dot
[ 78%] Built target main
[ 78%] Built target vdot
[ 84%] Built target server
[ 86%] Built target save-load-state
[ 91%] Built target simple
[ 95%] Built target embdinput
[100%] Built target embd-input-test
main: build = 802 (7242140)
main: seed = 1688746045
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5
llama.cpp: loading model from ../models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0,08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1862,39 MB (+ 1026,00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4892 MB
llama_new_context_with_model: kv self size = 256,00 MB
system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello, my name is Katrina (sometimes known as “Jar”). I’m a full-time writer and an aspiring author. I started writing stories at a very young age and it was always my dream to write a book!
I believe every writer has their own unique style. Some writers like to use a pen, others prefer the keyboard. I like to use both when it comes to writing my books! I have a great passion for reading and I hope that this will influence my future as an author. My current interests are science fiction, urban fantasy and horror genres.
I write in my free time on week
llama_print_timings: load time = 1184,95 ms
llama_print_timings: sample time = 85,22 ms / 128 runs ( 0,67 ms per token, 1501,98 tokens per second)
llama_print_timings: prompt eval time = 225,13 ms / 6 tokens ( 37,52 ms per token, 26,65 tokens per second)
llama_print_timings: eval time = 2228,14 ms / 127 runs ( 17,54 ms per token, 57,00 tokens per second)
llama_print_timings: total time = 2569,25 ms Maybe it is related to the 3B model. Does it work for you with the 7B model? |
Much better results. I think you are right, something with the 3B model doesn't work.
|
I also noticed bad results with the 3B model with a 3080. |
My intuition is that it's an issue with padding when converting the vector to q8_1. |
The issue seems to be in the mul mat. Running it with
|
OS: Windows 10 LTSC 1809, using provided CU 11.7.1 runtimes from this repo's workflow.
I have a RTX 2060 card, and ever since #2067 was merged, my system generates garbage output with CuBLAS, if any GPU layers are offloaded. This does not happen if GGML_CUDA_FORCE_DMMV is set to true, or if 0 layers are offloaded.
Example output:
another attempt, with fewer layers:
The text was updated successfully, but these errors were encountered: