omit the leading space on the first token #89

kroggen · 2023-07-26T00:59:44Z

No description provided.

abnerguzman · 2023-07-26T01:11:13Z

Various tokens in the vocabulary begin with a x20 character. So, perhaps some custom decoding logic is needed for eg beginning of a paragraph. Here is the token commonly sampled first (' Once') from tokenizer.bin:

kroggen · 2023-07-26T01:19:13Z

Fixes #41
Fixes #76

abnerguzman · 2023-07-26T01:57:01Z

It's not only position zero that would need the removal. The sentencepiece logic is here:
https://github.com/google/sentencepiece/blob/635fe8423a249b6e081aacd290d8aef7476c6a28/src/sentencepiece_processor.cc#L786

karpathy · 2023-07-26T15:30:25Z

Still also a bit confused why sentencepiece even needs to do this or how that works. In the BPE world of GPT that I'm used to there is no need to have special postprocessings like this and strip whitespaces in special cases.

kroggen · 2023-07-26T17:14:34Z

It appears that BPE has tokens that do not contain spaces at the beginning, if they are frequent. So when decoding the first word, the transformer itself will choose that token without space, because it was trained to do so.

But models that use WordPiece do not "see" the spaces, because there is only one token for each subword. So they will not be able to learn that the first word is usually preceded by a space.

abnerguzman · 2023-07-26T20:03:27Z

By default the SentencePiece implementation adds whitespace during preprocessing to the beginning of text -- besides removing leading, trailing, and duplicate internal whitespace. See --add_dummy_prefix and --remove_extra_whitespaces here:
https://github.com/google/sentencepiece/blob/master/doc/options.md

The decoding code removes whitespace (if present) from a piece that follows BOS.

I'm not certain about this choice either. The SentencePiece implementation uses whitespace to differentiate between subwords that are continuation of a word vs not (at least for some languages). It seems someone found it advantageous to use the same id for a subword at the beginning of the text and the same subword elsewhere in the text.

karpathy · 2023-07-27T06:27:46Z

Sigh sentencepiece 🤦 . Let's not worry about this whitespace for now, it just confuses everything. Maybe we'll come back around at a later time.

kroggen · 2023-07-27T07:03:37Z

It is not a bug, it is a property of tokenizers based on Unigram.

If the word "Once" is tokenized as " Once" and there are no other versions of it, then we need to remove the space when outputting the first word.

This applies to all word prefixes or small words.

It is just automatically done by the tokenizer library in python, so we do not see it.

Look the example at the very end here:
https://huggingface.co/learn/nlp-course/chapter6/7

kroggen · 2023-07-27T07:16:12Z

This shows how the tokenizer works:

$ python3
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='tokenizer.model')
>>>
>>> sp.encode("Hello world!")
[15043, 3186, 29991]
>>> sp.id_to_piece([15043])
['▁Hello']
>>> sp.id_to_piece([3186])
['▁world']
>>> sp.id_to_piece([29991])
['!']
>>> sp.decode([15043, 3186, 29991])
'Hello world!'
>>>
>>> sp.encode("Once upon a time")
[9038, 2501, 263, 931]
>>> sp.id_to_piece([9038])
['▁Once']
>>> sp.id_to_piece([2501])
['▁upon']
>>> sp.id_to_piece([263])
['▁a']
>>> sp.id_to_piece([931])
['▁time']
>>> sp.decode([9038, 2501, 263, 931])
'Once upon a time'

kroggen · 2023-07-27T07:20:58Z

So we must remove the space

It can be done either:

On the first output token (sufficient when stopping generation on EOS or BOS)
On the next token after BOS (if the code should not stop on BOS or EOS)

karpathy · 2023-07-27T15:23:11Z

Thanks for the example. Any idea why the preprocessing even adds these spaces?
Alternatively, is it known somewhere how Meta trained its sentencepiece model? i.e. the launch command.

atamurad · 2023-07-27T15:30:11Z

Alternatively, is it known somewhere how Meta trained its sentencepiece model? i.e. the launch command.

I was able to extract the training and normalizer flags/parameters from transformer.model (by decoding protobuf message).

trainer_spec {
  input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
  model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
  model_type: BPE
  vocab_size: 32000
  self_test_sample_size: 0
  input_format: "text"
  character_coverage: 0.9999499917030334
  input_sentence_size: 200000000
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  num_threads: 80
  num_sub_iterations: 2
  max_sentence_length: 4192
  shuffle_input_sentence: true
  max_sentencepiece_length: 16
  split_by_unicode_script: true
  split_by_whitespace: true
  split_by_number: true
  treat_whitespace_as_suffix: false
  split_digits: true
  allow_whitespace_only_pieces: true
  vocabulary_output_piece_score: true
  hard_vocab_limit: true
  use_all_vocab: false
  byte_fallback: true
  required_chars: ""
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_surface: " \342\201\207 "
  unk_piece: "<unk>"
  bos_piece: "<s>"
  eos_piece: "</s>"
  pad_piece: "<pad>"
  train_extremely_large_corpus: false
  enable_differential_privacy: false
  differential_privacy_noise_level: 0.0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: true
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

snippet to decode/print above params from model file:

import sentencepiece.sentencepiece_model_pb2
mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
mp.ParseFromString(open("tokenizer.model", "rb").read())
print(mp.trainer_spec)
print(mp.normalizer_spec)

karpathy · 2023-07-27T15:46:00Z

!!! @atamurad super helpful

so. yeah in particular:
add_dummy_prefix: true

karpathy · 2023-07-27T15:51:03Z

I found this string on the internet:

The vocabulary size is set to 32,000. A add_dummy_prefix option is set to True because words are not separated by whitespaces in Japanese.

I don't really understand this sentence, how this option fixes Japanese, or why it exists.

karpathy · 2023-07-27T23:00:04Z

I pushed a fix to this, and fixed a bug with current PR, which only would have done it as pos=0 instead of right after BOS.
e5752e1

BeBornTo · 2023-07-30T14:39:52Z

>>> sp.decode([9038, 2501, 263, 931])
'Once upon a time'

I would add your explanation with another one example:

>>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time")]
['▁Once', '▁upon', '▁a', '▁time']
>>> sp.encode("Once upon a time ")
[9038, 2501, 263, 931, 29871]
>>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time ")]
['▁Once', '▁upon', '▁a', '▁time', '▁']
>>> sp.decode(sp.encode("Once upon a time "))
'Once upon a time '
>>> sp.decode(sp.encode(" Once upon a time"))
'Once upon a time'

omit the leading space on the first token

a1d62e8

kroggen mentioned this pull request Jul 26, 2023

Emscripten build (demo, quick and dirty) #12

Draft

karpathy closed this Jul 27, 2023

karpathy reopened this Jul 27, 2023

karpathy closed this Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omit the leading space on the first token #89

omit the leading space on the first token #89

kroggen commented Jul 26, 2023

abnerguzman commented Jul 26, 2023

kroggen commented Jul 26, 2023 •

edited

Loading

abnerguzman commented Jul 26, 2023

karpathy commented Jul 26, 2023

kroggen commented Jul 26, 2023

abnerguzman commented Jul 26, 2023

karpathy commented Jul 27, 2023

kroggen commented Jul 27, 2023

kroggen commented Jul 27, 2023

kroggen commented Jul 27, 2023

karpathy commented Jul 27, 2023

atamurad commented Jul 27, 2023 •

edited

Loading

karpathy commented Jul 27, 2023 •

edited

Loading

karpathy commented Jul 27, 2023

karpathy commented Jul 27, 2023

BeBornTo commented Jul 30, 2023 •

edited

Loading

omit the leading space on the first token #89

omit the leading space on the first token #89

Conversation

kroggen commented Jul 26, 2023

abnerguzman commented Jul 26, 2023

kroggen commented Jul 26, 2023 • edited Loading

abnerguzman commented Jul 26, 2023

karpathy commented Jul 26, 2023

kroggen commented Jul 26, 2023

abnerguzman commented Jul 26, 2023

karpathy commented Jul 27, 2023

kroggen commented Jul 27, 2023

kroggen commented Jul 27, 2023

kroggen commented Jul 27, 2023

karpathy commented Jul 27, 2023

atamurad commented Jul 27, 2023 • edited Loading

karpathy commented Jul 27, 2023 • edited Loading

karpathy commented Jul 27, 2023

karpathy commented Jul 27, 2023

BeBornTo commented Jul 30, 2023 • edited Loading

kroggen commented Jul 26, 2023 •

edited

Loading

atamurad commented Jul 27, 2023 •

edited

Loading

karpathy commented Jul 27, 2023 •

edited

Loading

BeBornTo commented Jul 30, 2023 •

edited

Loading