Tokenizer in C #15

python273 · 2023-07-23T21:32:32Z

Very simple. It doesn't remove leading spaces. Might be good idea to test with weird unicode?

karpathy · 2023-07-23T22:40:19Z

Very interesting!! Will take a look tonight thank you!

python273 · 2023-07-23T22:48:31Z

Looks like #12 also has tokenizer implementation 😅

karpathy · 2023-07-24T01:30:24Z

@python273 do you understand where #12 gets the vocab.bin file from? presumably a similar export script?

karpathy · 2023-07-24T01:36:15Z

export_tokenizer.py

+        t = '\n<s>\n'
+    elif i == eos_id:
+        t = '\n</s>\n'
+    elif len(t) == 6 and t.startswith('<0x') and t.endswith('>'):


??? some comments around here could be nice. i haven't dug into sp too much but this looks odd

karpathy · 2023-07-24T03:49:05Z

@python273 I got it working. What are the leading spaces? How does SentencePiece handles this 🤔

karpathy · 2023-07-24T04:04:15Z

Solved here I think.
3bfa566

ty!!

karpathy · 2023-07-24T04:04:44Z

(I didn't solve the leading space issue... leaving it there for now, will look later into what sentencepeice does with this)

Tokenizer in C

f40b4e9

yolo shorter

a2d056e

karpathy reviewed Jul 24, 2023

View reviewed changes

karpathy closed this Jul 24, 2023

Provide feedback