Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jan 16, 2024

fix: #4958 #4925

@ggerganov ggerganov force-pushed the gg/fix-spm-added-tokens-dict-4958 branch from 9aefd14 to a137273 Compare January 16, 2024 12:08
@ggerganov ggerganov changed the title py : fix missing added_tokens_dict for SPM vocab py : fix missing added_tokens_dict for SPM and BPE vocabs Jan 16, 2024
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Jan 16, 2024
@TheBloke
Copy link
Contributor

Confirming this now works, as per my comment: #4958 (comment)

Many thanks

@ggerganov ggerganov merged commit 4f4bf35 into master Jan 17, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need feedback Testing and feedback with results are needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert.py: --pad-vocab not working with SPM, 'SentencePieceVocab' object has no attribute 'added_tokens_dict'. Did you mean: 'added_tokens_list'?

2 participants