Supersede request behavior of llama-server #10925

0x0539 · 2024-12-21T00:08:10Z

0x0539
Dec 21, 2024

I plan to run llama-server locally with the cuda backend to service autocomplete requests. I only care about the last-sent-request.

Is there a way to configure llama-server to "cancel all" requests when a new one is received?

Maybe I could set drop head queue with max size 1. Super best would be to actually cancel cuda before decoding finishes.

ggerganov · 2024-12-21T10:01:18Z

Unfortunately, there is no cancellation support implemented yet.

0 replies