Replies: 2 comments
-
I've tried to have a closer look at that but I'm getting stuck with the following: |
Beta Was this translation helpful? Give feedback.
-
I just noticed #1498. |
Beta Was this translation helpful? Give feedback.
-
I'm currently trying to build tools using llama.cpp python as computing platform for several models.
In the end I would like my platform to be able to host multiple open source models, but also allow it to handle commercial networks.
To do this, and keep some information about the costs I would need to determine the number of tokens both in the prompt and in the completion. For normal requests, this is trivial, as the information is returned directly with the response.
However, when responding in a streaming manner, this information is essentially lost, sure, you can easily count the tokens returned in the stream and from that calculate the number of completion tokens, but for prompt tokens this gets problematic.
I noticed, that there is an extras endpoint with token counts, which I thought would be perfect, and works fine for the completions endpoint, but this endpoint unfortunately does not handle chat requests properly.
I can of course just stringify the chat request, but that leads to a discrepancy between the token counts and the tokens reported by the chat/completions endpoint.
So my question essentially is:
Would it be possible to allow a ChatCompletionRequest object to be handed to the extras/tokenize/count method instead of just an input string?
Beta Was this translation helpful? Give feedback.
All reactions