-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : Add option to return token pieces in /tokenize endpoint #9108
server : Add option to return token pieces in /tokenize endpoint #9108
Conversation
Also, this will fail if one of the pieces has incomplete unicode bytes. For example:
Response:
|
For the case that a piece is invalid utf8 I have added a fallback where a list of bytes will be sent instead. |
ping @ngxson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks. Let's merge after the CI passes
Small suggestion: we actually already had a version of is_valid_utf8
in process_token
, would be nice if we can deduplicate the code by reusing that. It can be done in a follow-up PR though.
@mathijshenquet Sorry for the delay. Let's resolve the conflict and merge. |
I merged with upstream master. That should fix the CI problem. (Let's merge once CI passes) |
…rganov#9108) * server : added with_pieces functionality to /tokenize endpoint * server : Add tokenize with pieces tests to server.feature * Handle case if tokenizer splits along utf8 continuation bytes * Add example of token splitting * Remove trailing ws * Fix trailing ws * Maybe fix ci * maybe this fix windows ci? --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
Description
This PR enhances the
/tokenize
endpoint by adding an option to return token pieces along with their IDs. This feature allows users to easily understand what each token represents without needing to make additional API calls or perform client-side lookups.Motivation
Often, users want to know not just the token IDs but also what these tokens represent in the original text. Previously, this required either inefficient additional API calls or complex client-side logic. This change makes it much easier and more efficient to get this information in a single request.
Changes
with_pieces
boolean parameter to the/tokenize
endpoint.with_pieces
is true, the response includes both token IDs and their corresponding pieces.with_pieces
is not specified or is false, the endpoint behaves as before.Testing
Documentation
Updated the API documentation to include the new
with_pieces
parameter and to show example responses for both cases (with and without pieces).NB: I wasn't able to run the CI pipeline locally as I'm currently on Windows.