-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code Llama 70b triton crashes with XQA #1256
Comments
I was looking into the XQA prepare and dispatch functions here https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h#L185
|
@jdemouth-nvidia @byshiue @ncomly-nvidia @nekorobov This is a major issue impeding out ability to make progress on serving our models more efficiently. Do you have any updates on when XQA will actually be working? Thanks! |
I will create a MR to remove the mPreparedCalled check for now. |
Thanks @ming-wei. June |
Thanks @ming-wei and @juney-nvidia Since it is a no-op, I tried uncommenting the lines below and rebuilt 0.8.0 from source
The model (codellama70b xqa as compiled above ^) seems to run and serve requests! However now there are repeats in the output. Here are samples of outputs given
|
To clarify I used the instructions here: https://github.com/triton-inference-server/tensorrtllm_backend
Even with only 4 concurrent requests and
|
This is resolved - it looks like adding strongly typed and removing multiblock mode worked. |
System Info
x86
8x h100 80g
v0.8.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I then use triton to serve the model and send a few inferences
it seems to work for a few inferences, especially when there's only 1-2 at a time. But after a few inferences, or if we send 4 at a time, it doesn't work.
Expected behavior
it runs successfully and does not crash for batch sizes up to 32!
actual behavior
It crashes with an XQA error after a few inferences
codellama_70b_xqa.log
additional notes
.
The text was updated successfully, but these errors were encountered: