Replies: 1 comment
-
Unfortunately, there is no cancellation support implemented yet. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I plan to run llama-server locally with the cuda backend to service autocomplete requests. I only care about the last-sent-request.
Is there a way to configure llama-server to "cancel all" requests when a new one is received?
Maybe I could set drop head queue with max size 1. Super best would be to actually cancel cuda before decoding finishes.
Beta Was this translation helpful? Give feedback.
All reactions