You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Prioritize use of VRAM, and start using shared memory when memory is exceeded
and
Fast inference
Current Behavior
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
When you use this option, RAM will be used first instead of VRAM.
Also, the specified GPU will not be used first. llama_print_timings: total time = 56361.73 ms / 45 tokens
Hiding the option makes it super fast llama_print_timings: total time = 40.95 ms / 143 tokens
Environment and Context
Windows11 WSL2 Ubuntu 22.04.4 LTS
CUDA12.1
Python 3.10.11
GNU Make 4.3 x86_64-pc-linux-gnu
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
The text was updated successfully, but these errors were encountered:
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Prioritize use of VRAM, and start using shared memory when memory is exceeded
and
Fast inference
Current Behavior
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
When you use this option, RAM will be used first instead of VRAM.
Also, the specified GPU will not be used first.
llama_print_timings: total time = 56361.73 ms / 45 tokens
Hiding the option makes it super fast
llama_print_timings: total time = 40.95 ms / 143 tokens
Environment and Context
The text was updated successfully, but these errors were encountered: