Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][text] add tokens #2201

Closed
wants to merge 6 commits into from
Closed

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Dec 7, 2023

after pr: #2186
hf 上 add token 和add special tokens 的联系和区别
1 都会加载到tokenizer里边,并且tokenzie的时候 新加的token不会被“split”
2 add special tokens 会把新的token变成tokenizer的property

比如:

tokenizer = ....
tokenizer.add_special_tokens(['cls'])

# then:
print(tokenizer.cls )

之前讨论

我们只需要add_token就可以了,让模型own id就够了

TODO:
add tokens work with
- [x] tries (see: #2201 (comment))

  • char
  • bpe model
  • hg tokenizer
  • whisper
  • unit test

next pr

  • resize model embedding

@Mddct
Copy link
Collaborator Author

Mddct commented Dec 7, 2023

这里有个小问题:

  • 对于char tokenizer , 如果我们add_tokens(["你好"]) 怎么办?

solution:
检查tokens是否为char list

@Mddct
Copy link
Collaborator Author

Mddct commented Dec 7, 2023

第二个问题

solution:

@Mddct
Copy link
Collaborator Author

Mddct commented Dec 8, 2023

第二个问题

solution:

优先选择第二种,原因如下:

  • 修改proto的时候,比较繁琐,需要给相应的分数,这块可以留个有需要的人在最外边自己扩充词典和bpe model用
  • 对于其他tokenizer,比如 使用了tiktoke等,会涉及到更复杂的操作

所以直接按照方案2,简洁 更通用

@Mddct
Copy link
Collaborator Author

Mddct commented Dec 8, 2023

Screenshot 2023-12-08 at 17 09 42

add_tokens 需要考虑到“边界”

@Mddct Mddct marked this pull request as draft December 11, 2023 11:50
@Mddct Mddct closed this Dec 12, 2023
@Mddct Mddct deleted the Mddct-tokenizer-add_special_tokens branch December 12, 2023 02:47
@Mddct
Copy link
Collaborator Author

Mddct commented Dec 12, 2023

过度封装
可以丢到最外边做 每种底层实现都放到这里太臃肿了

@Mddct
Copy link
Collaborator Author

Mddct commented Dec 12, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant