Skip to content

Conversation

Ubospica
Copy link
Contributor

@Ubospica Ubospica commented Dec 13, 2023

This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.

  /*!
   * \brief Returns the vocabulary size. Special tokens are considered.
   */
  virtual size_t GetVocabSize() = 0;
  /*!
   * \brief Convert the given id to its corresponding token if it exists. If not, return an
   * empty string.
   */
  virtual std::string IdToToken(int32_t token_id) = 0;
  /*!
   * \brief Convert the given token to its corresponding id if it exists. If not, return -1.
   */
  virtual int32_t TokenToId(const std::string& token) = 0;

Tokenizer build time:

Tokenizer: SentencePiece
Load time: 5 ms

Tokenizer: Huggingface
Load time: 30 ms

Tokenizer: RWKVWorld
Load time: 113 ms

@Ubospica Ubospica force-pushed the main-dev/2023-12-13-vocab branch from 7abe788 to 89b15c2 Compare December 13, 2023 07:14
cd dist
if [ ! -f "tokenizer.model" ]; then
wget https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model
wget https://huggingface.co/lmsys/vicuna-7b-v1.5/resolve/main/tokenizer.model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wget cannot download decapoda-research/llama-7b-hf without logging in. But vicuna-7b-v1.5 does not have this restriction.

@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

let us directly call id_to_token, see related APIs

This would avoid the post processing done by the decode pipeline

@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31

@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

std::string IdToToken(int32_t token_id);

@Ubospica Ubospica force-pushed the main-dev/2023-12-13-vocab branch from f7b2248 to e0901e6 Compare December 18, 2023 07:48
@Ubospica
Copy link
Contributor Author

cc @tqchen

@Ubospica Ubospica force-pushed the main-dev/2023-12-13-vocab branch from e0901e6 to 159e8e8 Compare December 19, 2023 06:47
@Ubospica
Copy link
Contributor Author

cc @tqchen

@Ubospica Ubospica force-pushed the main-dev/2023-12-13-vocab branch from 17a82d6 to 8d8a323 Compare December 19, 2023 20:01
@tqchen tqchen merged commit 27dbe17 into mlc-ai:main Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants