Skip to content

Conversation

DOGEwbx
Copy link

@DOGEwbx DOGEwbx commented Nov 8, 2023

Thanks for your efforts to support convert deepseek coder model according to
#3633
I just found that the tokenized result of the quantinized model is sightly different from the original because deepseek coder use different pretokenizer from the gpt2 style preprocess implement in your code. So I add deepseekcoder process pipeline to your work.
The convert script is modified from strutive07's work.

python convert-deepseek-coder-hf-to-gguf.py <MODEL_PATH> --outfile <OUTPUT_NAME> --vocab-dir <MODEL_PATH> --padvocab
./quantize <OUTPUT_NAME> <OUTPUT_NAME_q4.0> q4_0

I ran above commands and did some test to make sure the tokenized result are the same

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 8, 2023

I'm still working on transfer the whole pretokenizer pipeline to cpp, hope we could directly read the pretokenize process from tokenizer.json instead of using current hard code solution.

llama.cpp Outdated
// std::cout<<utf_char<<std::endl;
// forward backward lookups
const std::string & utf_char_next = (i + 1 < (int)text_utf.size()) ? text_utf[i + 1] : "";
const std::string & utf_char_next_next = (i + 2 < (int)text_utf.size()) ? text_utf[i + 2] : "";
Copy link
Collaborator

@cebtenzzre cebtenzzre Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think C++ allows you to bind a reference to a temporary std::string (converted from "") like this, and then store it in a variable.
edit: I guess we're doing that too, so it must work well enough. There's quite a lot of copy-pasting going on here...

Copy link
Collaborator

@cebtenzzre cebtenzzre Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you don't need to copy-paste the whole convert.py for this. What version of convert.py did you base this on? It's not a very recent one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reference #3633 as the convert script

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too much extra (mostly duplicate) code to maintain for a single model (with only tokenizer changes). You need to find a better way to do this.

@Galunid
Copy link
Contributor

Galunid commented Nov 8, 2023

I'd suggest you give a look at #3838, that introduces convert-hf-to-gguf.py. It should be relatively easy to just create new class and overwrite set_vocab and maybe set_gguf_parameters methods to do what you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants