The inferenced text has a leading space because of the vocab #41

Zardinality · 2023-07-24T16:57:44Z

Debugged to the first token emerged:

And confirmed by using python model directly:

Clearly the tokenizer model count ▁Once instead of Once as a token. Note that sentencepiece use ▁ as space.

The text was updated successfully, but these errors were encountered:

karpathy · 2023-07-24T21:17:50Z

Yes, the vocab itself has this space. But somehow when you ask sentencepiece to decode a token sequence, it doesn't print this leading space.

Zardinality · 2023-07-25T16:20:28Z

I am not familiar with the tokenizer common practices, but seems there it has a patch for such scenario in decoding: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.cc#L788 . Might have a good reason to do so.

karpathy · 2023-08-14T15:08:17Z

This was fixed while back

kroggen mentioned this issue Jul 26, 2023

omit the leading space on the first token #89

Closed

karpathy closed this as completed Aug 14, 2023

Provide feedback