Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omit the leading space on the first token #89

Closed
wants to merge 1 commit into from

Conversation

kroggen
Copy link
Contributor

@kroggen kroggen commented Jul 26, 2023

No description provided.

@abnerguzman
Copy link

Various tokens in the vocabulary begin with a x20 character. So, perhaps some custom decoding logic is needed for eg beginning of a paragraph. Here is the token commonly sampled first (' Once') from tokenizer.bin:

Screenshot 2023-07-25 at 5 54 47 PM

@kroggen
Copy link
Contributor Author

kroggen commented Jul 26, 2023

Fixes #41
Fixes #76

@abnerguzman
Copy link

It's not only position zero that would need the removal. The sentencepiece logic is here:
https://github.com/google/sentencepiece/blob/635fe8423a249b6e081aacd290d8aef7476c6a28/src/sentencepiece_processor.cc#L786

@karpathy
Copy link
Owner

Still also a bit confused why sentencepiece even needs to do this or how that works. In the BPE world of GPT that I'm used to there is no need to have special postprocessings like this and strip whitespaces in special cases.

@kroggen
Copy link
Contributor Author

kroggen commented Jul 26, 2023

It appears that BPE has tokens that do not contain spaces at the beginning, if they are frequent. So when decoding the first word, the transformer itself will choose that token without space, because it was trained to do so.

But models that use WordPiece do not "see" the spaces, because there is only one token for each subword. So they will not be able to learn that the first word is usually preceded by a space.

@abnerguzman
Copy link

By default the SentencePiece implementation adds whitespace during preprocessing to the beginning of text -- besides removing leading, trailing, and duplicate internal whitespace. See --add_dummy_prefix and --remove_extra_whitespaces here:
https://github.com/google/sentencepiece/blob/master/doc/options.md

The decoding code removes whitespace (if present) from a piece that follows BOS.

I'm not certain about this choice either. The SentencePiece implementation uses whitespace to differentiate between subwords that are continuation of a word vs not (at least for some languages). It seems someone found it advantageous to use the same id for a subword at the beginning of the text and the same subword elsewhere in the text.

@karpathy
Copy link
Owner

Sigh sentencepiece 🤦 . Let's not worry about this whitespace for now, it just confuses everything. Maybe we'll come back around at a later time.

@karpathy karpathy closed this Jul 27, 2023
@kroggen
Copy link
Contributor Author

kroggen commented Jul 27, 2023

It is not a bug, it is a property of tokenizers based on Unigram.

If the word "Once" is tokenized as " Once" and there are no other versions of it, then we need to remove the space when outputting the first word.

This applies to all word prefixes or small words.

It is just automatically done by the tokenizer library in python, so we do not see it.

Look the example at the very end here:
https://huggingface.co/learn/nlp-course/chapter6/7

@kroggen
Copy link
Contributor Author

kroggen commented Jul 27, 2023

This shows how the tokenizer works:

$ python3
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='tokenizer.model')
>>>
>>> sp.encode("Hello world!")
[15043, 3186, 29991]
>>> sp.id_to_piece([15043])
['▁Hello']
>>> sp.id_to_piece([3186])
['▁world']
>>> sp.id_to_piece([29991])
['!']
>>> sp.decode([15043, 3186, 29991])
'Hello world!'
>>>
>>> sp.encode("Once upon a time")
[9038, 2501, 263, 931]
>>> sp.id_to_piece([9038])
['▁Once']
>>> sp.id_to_piece([2501])
['▁upon']
>>> sp.id_to_piece([263])
['▁a']
>>> sp.id_to_piece([931])
['▁time']
>>> sp.decode([9038, 2501, 263, 931])
'Once upon a time'

@kroggen
Copy link
Contributor Author

kroggen commented Jul 27, 2023

So we must remove the space

It can be done either:

  1. On the first output token (sufficient when stopping generation on EOS or BOS)
  2. On the next token after BOS (if the code should not stop on BOS or EOS)

@karpathy
Copy link
Owner

Thanks for the example. Any idea why the preprocessing even adds these spaces?
Alternatively, is it known somewhere how Meta trained its sentencepiece model? i.e. the launch command.

@karpathy karpathy reopened this Jul 27, 2023
@atamurad
Copy link
Contributor

atamurad commented Jul 27, 2023

Alternatively, is it known somewhere how Meta trained its sentencepiece model? i.e. the launch command.

I was able to extract the training and normalizer flags/parameters from transformer.model (by decoding protobuf message).

trainer_spec {
  input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
  model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
  model_type: BPE
  vocab_size: 32000
  self_test_sample_size: 0
  input_format: "text"
  character_coverage: 0.9999499917030334
  input_sentence_size: 200000000
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  num_threads: 80
  num_sub_iterations: 2
  max_sentence_length: 4192
  shuffle_input_sentence: true
  max_sentencepiece_length: 16
  split_by_unicode_script: true
  split_by_whitespace: true
  split_by_number: true
  treat_whitespace_as_suffix: false
  split_digits: true
  allow_whitespace_only_pieces: true
  vocabulary_output_piece_score: true
  hard_vocab_limit: true
  use_all_vocab: false
  byte_fallback: true
  required_chars: ""
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_surface: " \342\201\207 "
  unk_piece: "<unk>"
  bos_piece: "<s>"
  eos_piece: "</s>"
  pad_piece: "<pad>"
  train_extremely_large_corpus: false
  enable_differential_privacy: false
  differential_privacy_noise_level: 0.0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: true
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

snippet to decode/print above params from model file:

import sentencepiece.sentencepiece_model_pb2
mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
mp.ParseFromString(open("tokenizer.model", "rb").read())
print(mp.trainer_spec)
print(mp.normalizer_spec)

@karpathy
Copy link
Owner

karpathy commented Jul 27, 2023

!!! @atamurad super helpful

so. yeah in particular:
add_dummy_prefix: true

@karpathy
Copy link
Owner

I found this string on the internet:

The vocabulary size is set to 32,000. A add_dummy_prefix option is set to True because words are not separated by whitespaces in Japanese.

I don't really understand this sentence, how this option fixes Japanese, or why it exists.

@karpathy
Copy link
Owner

I pushed a fix to this, and fixed a bug with current PR, which only would have done it as pos=0 instead of right after BOS.
e5752e1

@karpathy karpathy closed this Jul 27, 2023
@BeBornTo
Copy link

BeBornTo commented Jul 30, 2023

>>> sp.decode([9038, 2501, 263, 931])
'Once upon a time'

I would add your explanation with another one example:

>>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time")]
['▁Once', '▁upon', '▁a', '▁time']
>>> sp.encode("Once upon a time ")
[9038, 2501, 263, 931, 29871]
>>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time ")]
['▁Once', '▁upon', '▁a', '▁time', '▁']
>>> sp.decode(sp.encode("Once upon a time "))
'Once upon a time '
>>> sp.decode(sp.encode(" Once upon a time"))
'Once upon a time'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants