-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting out of memory error running mistral on 3090 #44
Comments
This is likely due to a growing KV cache with the sequence length. Future efforts in PagedAttention will hopefully resolve this and clearly mark the max sequence length. |
I guess it's related, but on an Nvidia A10G with 24G VRAM got the following when running with
|
#47 will allocate the maximum possible KV cache (statically) and then catch any sequences whose length is too large. As of now, because the KV cache is allocated dynamically, the only way to deal with this is to reduce the maximum sequence length, as that causes the growth in the KV cache size. |
An interesting development is #98, which will return the remaining response if it crashes due to any model error, including OOM. Additionally, the server will stay alive during this event. |
Refs #49. |
@ggilley, @ivanbaldo, this should be fixed now. Can you please try it again? |
Closing as it is fixed. |
I don't have the GPU server at hand @EricLBuehler , currently working on unrelated Web3 stuff. |
@ivanbaldo sounds good. We fixed this issue for good and there will be lots of exciting developments coming soon! |
I can see others having success running mistral on a 3090. Am I doing something wrong?
The text was updated successfully, but these errors were encountered: