fix: multiprocessing pickle error with tokenizer#111
Conversation
|
Think we should add tests to show pickle works on the @patrickvonplaten thoughts about exposing the file_path ? Should I make it private ? |
|
@juliendenize , I added a test, but unfortunately your changes broke it. Checking back to the first commit passes. |
|
@NanoCode012 you're right, it works for tekken but not for sentencepiece weirdly. Edit: not weird at all, fixing the issue rn |
| Returns: | ||
| A tuple of the factory function and the arguments to reconstruct the object from its source file. | ||
| """ | ||
| return MistralTokenizer.from_file, (self.instruct_tokenizer.tokenizer.file_path,) |
There was a problem hiding this comment.
Should we also pass mode? The subprocess tokenizer may have some issues where the test mode doesn't allow certain things through which the fine-tune mode does.
Probably should be added to test too.
| if isinstance(tokenizer_filename, Path): | ||
| tokenizer_filename = str(tokenizer_filename) |
There was a problem hiding this comment.
| if isinstance(tokenizer_filename, Path): | |
| tokenizer_filename = str(tokenizer_filename) | |
| tokenizer_filename = str(tokenizer_filename) |
| if isinstance(tokenizer_filename, Path): | ||
| tokenizer_filename = str(tokenizer_filename) |
There was a problem hiding this comment.
| if isinstance(tokenizer_filename, Path): | |
| tokenizer_filename = str(tokenizer_filename) | |
| tokenizer_filename = str(tokenizer_filename) |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Makes sense to me - left some nits
juliendenize
left a comment
There was a problem hiding this comment.
Thanks for issuing the PR !
@juliendenize
This was discussed on Slack. An error occurs as dill does not know how to serialize MistralTokenizer. The solution is to provide a factory function and args to recreate the tokenizer.
Error:
Repro script:
Tested that the above script now works fine with these changes.