SentencePieceProcessor vs C [Enhancement] #9

kris-jusiak · 2023-07-23T19:37:35Z

To do the inference with just c and without sentence piece processor one easy way would be to save the id to token in the model.bin?

tokenizer = SentencePieceProcessor(tokenizer_model)
vocab = [tokenizer.id_to_piece(id) for id in range(tokenizer.get_piece_size())]

and then just an array to get the proper token from id

auto decode(auto id) {
  return vocab[id];
}

That would allow not to use the run_wrap.py and it would be in pure C (kinda).

The text was updated successfully, but these errors were encountered:

trholding · 2023-07-23T20:06:56Z

karpathy · 2023-08-14T15:04:29Z

Closing old issue. This has been implemented in run.c for a while with ASCII, and UTF-8 support is about to merge soon in #226 .

ggerganov mentioned this issue Jul 23, 2023

Emscripten build (demo, quick and dirty) #12

Draft

karpathy closed this as completed Aug 14, 2023

Provide feedback