Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting out of memory error running mistral on 3090 #44

Closed
ggilley opened this issue Mar 30, 2024 · 9 comments
Closed

Getting out of memory error running mistral on 3090 #44

ggilley opened this issue Mar 30, 2024 · 9 comments
Labels
backend Backend work bug Something isn't working processing Processing related to the model

Comments

@ggilley
Copy link

ggilley commented Mar 30, 2024

I can see others having success running mistral on a 3090. Am I doing something wrong?

Request at 2024-03-29 20:33:44.758325187 -07:00: {"messages":[{"content":"What is the capital of France?","role":"user","name":null}],"model":"mistral","logit_bias":null,"logprobs":false,"top_logprobs":null,"max_tokens":256,"n":1,"presence_penalty":null,"frequency_penalty":null,"stop":null,"temperature":0.1,"top_p":0.1,"top_k":1}
$ mistralrs-server --port 1234 --log output.log mistral
Loading model on Cuda(CudaDevice(DeviceId(1)))...
100%|########################################################################| 88/88 [00:03<00:00, 20.72it/s]100%|######################################################################| 203/203 [00:06<00:00, 23.31it/s]
Model loaded.
Serving on http://0.0.0.0:1234.
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mistral.rs:409:17:
Model failed with error `DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")`. Please raise an issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError
@EricLBuehler
Copy link
Owner

This is likely due to a growing KV cache with the sequence length. Future efforts in PagedAttention will hopefully resolve this and clearly mark the max sequence length.

@ivanbaldo
Copy link

I guess it's related, but on an Nvidia A10G with 24G VRAM got the following when running with -port 8080 llama -m meta-llama/Llama-2-7b-chat-hf on master as of today:

avx: false, neon: false, simd128: false, f16c: false
Loading model `meta-llama/Llama-2-7b-chat-hf` on Cuda(CudaDevice(DeviceId(1)))...
100%|####################################################################################################################| 82/82 [00:53<00:00, 707.70it/s]100%|##################################################################################################################| 241/241 [01:44<00:00, 690.23it/s]
Model loaded.
Serving on http://0.0.0.0:8080.
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/llama.rs:422:17:
Model failed with error `DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")`. Please raise an issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError

@EricLBuehler EricLBuehler added backend Backend work processing Processing related to the model labels Apr 3, 2024
@EricLBuehler
Copy link
Owner

#47 will allocate the maximum possible KV cache (statically) and then catch any sequences whose length is too large. As of now, because the KV cache is allocated dynamically, the only way to deal with this is to reduce the maximum sequence length, as that causes the growth in the KV cache size.

@EricLBuehler
Copy link
Owner

An interesting development is #98, which will return the remaining response if it crashes due to any model error, including OOM. Additionally, the server will stay alive during this event.

@EricLBuehler EricLBuehler mentioned this issue Apr 12, 2024
14 tasks
@EricLBuehler EricLBuehler added bug Something isn't working and removed maybe-bug labels Apr 12, 2024
@EricLBuehler
Copy link
Owner

Refs #49.

@EricLBuehler
Copy link
Owner

@ggilley, @ivanbaldo, this should be fixed now. Can you please try it again?

@EricLBuehler
Copy link
Owner

Closing as it is fixed.

@ivanbaldo
Copy link

I don't have the GPU server at hand @EricLBuehler , currently working on unrelated Web3 stuff.
Will continue work on LLM's in about two weeks, let me know if testing is needed then.
Sorry about that and thanks for all your work!!!

@EricLBuehler
Copy link
Owner

@ivanbaldo sounds good. We fixed this issue for good and there will be lots of exciting developments coming soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend work bug Something isn't working processing Processing related to the model
Projects
None yet
Development

No branches or pull requests

3 participants