Extend the token/count method to allow obtaining the number of prompt tokens from a chat. #1461

tpfau · 2024-05-15T11:55:21Z

tpfau
May 15, 2024

I'm currently trying to build tools using llama.cpp python as computing platform for several models.
In the end I would like my platform to be able to host multiple open source models, but also allow it to handle commercial networks.
To do this, and keep some information about the costs I would need to determine the number of tokens both in the prompt and in the completion. For normal requests, this is trivial, as the information is returned directly with the response.

However, when responding in a streaming manner, this information is essentially lost, sure, you can easily count the tokens returned in the stream and from that calculate the number of completion tokens, but for prompt tokens this gets problematic.
I noticed, that there is an extras endpoint with token counts, which I thought would be perfect, and works fine for the completions endpoint, but this endpoint unfortunately does not handle chat requests properly.

I can of course just stringify the chat request, but that leads to a discrepancy between the token counts and the tokens reported by the chat/completions endpoint.

So my question essentially is:
Would it be possible to allow a ChatCompletionRequest object to be handed to the extras/tokenize/count method instead of just an input string?

tpfau · 2024-06-03T11:03:46Z

tpfau
Jun 3, 2024
Author

I've tried to have a closer look at that but I'm getting stuck with the following:
When a request comes in, it is passed on to the llama instance which in turn passes it to the chat handler, by calling the chat handler.
The chat handler then essentially does everything when being called, and there is no way to "intercept" this procedure (or maybe I'm just blind).
Since the chat handler is something that can be implemented by a user and the interface only defines a call, there doesn't seem to be a sensible way to use the chat handler to "just" create the prompt tokens in order to calculate them.
Now I'm wondering, whether there is any sensible way to change this interface so that different parts of the procedure are handled in separate steps in order to allow these types of calculations (or just offer the possibility to have a look at the actual prompt that goes to a model) without running through the whole pipeline.
From my understanding, the code essentially goes:
API Call -> llama.create_chat_completion -> LlamaChatCompletionHandler() -> llama.create_completion()
The last step essentially creates the completion along with the usage information, so I can get that piece, but I would need the pre-processing from he CompletionHandler to be included, and that is something that's not easily accessible.

0 replies

tpfau · 2024-06-14T05:39:17Z

tpfau
Jun 14, 2024
Author

I just noticed #1498.
I think all my needs would be fulfilled if this would be implemented (and it would get the API match the OpenAI API 😄

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the token/count method to allow obtaining the number of prompt tokens from a chat. #1461

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Extend the token/count method to allow obtaining the number of prompt tokens from a chat. #1461

tpfau May 15, 2024

Replies: 2 comments

tpfau Jun 3, 2024 Author

tpfau Jun 14, 2024 Author

tpfau
May 15, 2024

tpfau
Jun 3, 2024
Author

tpfau
Jun 14, 2024
Author