Skip to content

Conversation

@jpodivin
Copy link
Contributor

@jpodivin jpodivin commented Jul 1, 2024

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma.

I've been going through docs for both Refact and CodeGemma, and came to conclusion that <filename> isn't a good fit for the file separator token. The biggest issue is the lack of documentation on the matter.
Only mention of the <filename> token on Huggingface and other sources comes from description of Starcoder training[0] where it is used to mark start of filename for prompt. This is substantially different from the way it is used by CodeGemma[1].

Repos of Small Magellanic Cloud AI on github and HF don't provide more information.

With that in mind, I believe it's better not to set it as fsep token.

[0] https://huggingface.co/blog/starcoder#evaluation
[1] https://ai.google.dev/gemma/docs/formatting

@github-actions github-actions bot added the python python script changes label Jul 1, 2024
@mofosyne mofosyne added medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 3, 2024
special_vocab._set_special_token("prefix", 1)
special_vocab._set_special_token("suffix", 3)
special_vocab._set_special_token("middle", 2)
special_vocab._set_special_token("fsep", 4) # is this correct?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token id 4 in Refact's tokenizer seems to be <fim_pad> anyway, which doesn't look like a separator, so it was not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've rebased the PR and resolved conflicts.

The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <[email protected]>
@ggerganov ggerganov merged commit 5aefbce into ggml-org:master Jul 12, 2024
@jpodivin jpodivin deleted the filetoken branch July 13, 2024 08:31
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants