Getting out of memory error running mistral on 3090 #44

ggilley · 2024-03-30T03:38:11Z

I can see others having success running mistral on a 3090. Am I doing something wrong?

Request at 2024-03-29 20:33:44.758325187 -07:00: {"messages":[{"content":"What is the capital of France?","role":"user","name":null}],"model":"mistral","logit_bias":null,"logprobs":false,"top_logprobs":null,"max_tokens":256,"n":1,"presence_penalty":null,"frequency_penalty":null,"stop":null,"temperature":0.1,"top_p":0.1,"top_k":1}

$ mistralrs-server --port 1234 --log output.log mistral
Loading model on Cuda(CudaDevice(DeviceId(1)))...
100%|########################################################################| 88/88 [00:03<00:00, 20.72it/s]100%|######################################################################| 203/203 [00:06<00:00, 23.31it/s]
Model loaded.
Serving on http://0.0.0.0:1234.
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mistral.rs:409:17:
Model failed with error `DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")`. Please raise an issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-03-30T15:59:06Z

This is likely due to a growing KV cache with the sequence length. Future efforts in PagedAttention will hopefully resolve this and clearly mark the max sequence length.

ivanbaldo · 2024-04-01T16:42:42Z

I guess it's related, but on an Nvidia A10G with 24G VRAM got the following when running with -port 8080 llama -m meta-llama/Llama-2-7b-chat-hf on master as of today:

avx: false, neon: false, simd128: false, f16c: false
Loading model `meta-llama/Llama-2-7b-chat-hf` on Cuda(CudaDevice(DeviceId(1)))...
100%|####################################################################################################################| 82/82 [00:53<00:00, 707.70it/s]100%|##################################################################################################################| 241/241 [01:44<00:00, 690.23it/s]
Model loaded.
Serving on http://0.0.0.0:8080.
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/llama.rs:422:17:
Model failed with error `DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")`. Please raise an issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError

EricLBuehler · 2024-04-03T01:40:51Z

#47 will allocate the maximum possible KV cache (statically) and then catch any sequences whose length is too large. As of now, because the KV cache is allocated dynamically, the only way to deal with this is to reduce the maximum sequence length, as that causes the growth in the KV cache size.

EricLBuehler · 2024-04-10T02:35:17Z

An interesting development is #98, which will return the remaining response if it crashes due to any model error, including OOM. Additionally, the server will stay alive during this event.

EricLBuehler · 2024-04-12T18:18:46Z

Refs #49.

EricLBuehler · 2024-04-16T10:23:15Z

@ggilley, @ivanbaldo, this should be fixed now. Can you please try it again?

EricLBuehler · 2024-04-16T18:13:37Z

Closing as it is fixed.

ivanbaldo · 2024-04-30T17:54:28Z

I don't have the GPU server at hand @EricLBuehler , currently working on unrelated Web3 stuff.
Will continue work on LLM's in about two weeks, let me know if testing is needed then.
Sorry about that and thanks for all your work!!!

EricLBuehler · 2024-04-30T17:57:51Z

@ivanbaldo sounds good. We fixed this issue for good and there will be lots of exciting developments coming soon!

EricLBuehler added backend Backend work processing Processing related to the model labels Apr 3, 2024

EricLBuehler added the maybe-bug label Apr 8, 2024

EricLBuehler mentioned this issue Apr 12, 2024

Model wishlist #49

Closed

14 tasks

EricLBuehler added bug Something isn't working and removed maybe-bug labels Apr 12, 2024

EricLBuehler closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting out of memory error running mistral on 3090 #44

Getting out of memory error running mistral on 3090 #44

ggilley commented Mar 30, 2024

EricLBuehler commented Mar 30, 2024

ivanbaldo commented Apr 1, 2024

EricLBuehler commented Apr 3, 2024

EricLBuehler commented Apr 10, 2024

EricLBuehler commented Apr 12, 2024

EricLBuehler commented Apr 16, 2024

EricLBuehler commented Apr 16, 2024

ivanbaldo commented Apr 30, 2024

EricLBuehler commented Apr 30, 2024

Getting out of memory error running mistral on 3090 #44

Getting out of memory error running mistral on 3090 #44

Comments

ggilley commented Mar 30, 2024

EricLBuehler commented Mar 30, 2024

ivanbaldo commented Apr 1, 2024

EricLBuehler commented Apr 3, 2024

EricLBuehler commented Apr 10, 2024

EricLBuehler commented Apr 12, 2024

EricLBuehler commented Apr 16, 2024

EricLBuehler commented Apr 16, 2024

ivanbaldo commented Apr 30, 2024

EricLBuehler commented Apr 30, 2024