-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass stop words to openai api #887
Conversation
Please Notice!!! This PR modified |
Conflicts: lmdeploy/serve/async_engine.py lmdeploy/serve/openai/api_server.py lmdeploy/tokenizer.py lmdeploy/turbomind/turbomind.py
@@ -53,6 +63,27 @@ def _maybe_add_prefix_space(self, tokens, decoded): | |||
else: | |||
return decoded | |||
|
|||
def indexes_containing_token(self, token: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is indexes_containing_token
time-consuming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used maps to get the index. The time consuming should be acceptable.
f'There are too many(>{self.max_indexes_num}) possible ' | ||
f'indexes may decoding {token}, we will use {indexes} only') | ||
self._indexes_tokens_deque.append((token, indexes)) | ||
return indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个特殊在哪里?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
什么特殊?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if token == ' ': # ' ' is special
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer 里空格符都会被处理成 '▁'
vocab = self.model.IdToPiece(list(range(self.vocab_size))) | ||
indexes = [i for i, voc in enumerate(vocab) if token in voc] | ||
if len(indexes) > self.max_indexes_num: | ||
indexes = self.encode(token, add_bos=False)[-1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里长度有可能超过1么?超过1的话,取最后的不太对,有可能会是一个常见的token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该不会有这种情况,有多个单个 index 都能解码出含有 token 的字符串的话。token 本身应该只会被编码成一个
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
比如说token = 'ucke', internlm-chat-20b里面有
ucket
▁bucket
ucker
▁fucked
bucket
Bucket
uckets
_bucket
ucked
▁buckets
▁Bucket
▁sucked
▁Zucker
▁Tucker
(bucket
▁tucked
但是字典里面没有 ucke
No description provided.