Skip to content

fix: multiprocessing pickle error with tokenizer#111

Merged
juliendenize merged 8 commits into
mistralai:mainfrom
NanoCode012:fix/multiprocessing
Jun 30, 2025
Merged

fix: multiprocessing pickle error with tokenizer#111
juliendenize merged 8 commits into
mistralai:mainfrom
NanoCode012:fix/multiprocessing

Conversation

@NanoCode012

Copy link
Copy Markdown
Contributor

@juliendenize

This was discussed on Slack. An error occurs as dill does not know how to serialize MistralTokenizer. The solution is to provide a factory function and args to recreate the tokenizer.

Error:

  File "/opt/miniconda3/envs/py311/lib/python3.11/site-packages/dill/_dill.py", line 1217, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/opt/miniconda3/envs/py311/lib/python3.11/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/miniconda3/envs/py311/lib/python3.11/pickle.py", line 998, in _batch_setitems
    save(v)
  File "/opt/miniconda3/envs/py311/lib/python3.11/site-packages/dill/_dill.py", line 414, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/opt/miniconda3/envs/py311/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/opt/miniconda3/envs/py311/lib/python3.11/site-packages/dill/_dill.py", line 1869, in save_type
    StockPickler.save_global(pickler, obj, name=obj_name)
  File "/opt/miniconda3/envs/py311/lib/python3.11/pickle.py", line 1071, in save_global
    raise PicklingError(
_pickle.PicklingError: Can't pickle <class 'mistral_common.tokens.instruct.request.InstructRequest[Union[~UserMessageType, ~AssistantMessageType, ~ToolMessageType, ~SystemMessageType], Tool]'>: it's not found as mistral_common.tokens.instruct.request.InstructRequest[Union[~UserMessageType, ~AssistantMessageType, ~ToolMessageType, ~SystemMessageType], Tool]

Repro script:

from datasets import load_dataset
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

def main():

    dataset = load_dataset("mhenrichsen/alpaca_2k_test", split="train").select(
        range(100)
    )

    mistral_tokenizer = MistralTokenizer.from_hf_hub("mistralai/magistral-small")

    def tokenize_example(example):
        # Create a prompt from the dataset columns
        if example.get("input"):
            prompt = f"{example['instruction']}\n\n{example['input']}"
        else:
            prompt = example["instruction"]

        # Use the mistral-common protocol to create a chat request
        chat_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])

        tokenized_output = mistral_tokenizer.encode_chat_completion(chat_request)

        # map function expects a dict
        return {"input_ids": tokenized_output.tokens}

    tokenized_dataset = dataset.map(
        tokenize_example,
        num_proc=2,  # This is the crucial part that triggers the PicklingError
    )

    print(tokenized_dataset[0])


if __name__ == "__main__":
    main()

Tested that the above script now works fine with these changes.

@juliendenize

juliendenize commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

Think we should add tests to show pickle works on the MistralTokenizerand we're good to go !

@patrickvonplaten thoughts about exposing the file_path ? Should I make it private ?

@NanoCode012

NanoCode012 commented Jun 27, 2025

Copy link
Copy Markdown
Contributor Author

@juliendenize , I added a test, but unfortunately your changes broke it. Checking back to the first commit passes.

@juliendenize

juliendenize commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

@NanoCode012 you're right, it works for tekken but not for sentencepiece weirdly.

Edit: not weird at all, fixing the issue rn

Returns:
A tuple of the factory function and the arguments to reconstruct the object from its source file.
"""
return MistralTokenizer.from_file, (self.instruct_tokenizer.tokenizer.file_path,)

@NanoCode012 NanoCode012 Jun 27, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also pass mode? The subprocess tokenizer may have some issues where the test mode doesn't allow certain things through which the fine-tune mode does.

Probably should be added to test too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea

Comment on lines +55 to +56
if isinstance(tokenizer_filename, Path):
tokenizer_filename = str(tokenizer_filename)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(tokenizer_filename, Path):
tokenizer_filename = str(tokenizer_filename)
tokenizer_filename = str(tokenizer_filename)

Comment on lines +33 to +34
if isinstance(tokenizer_filename, Path):
tokenizer_filename = str(tokenizer_filename)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(tokenizer_filename, Path):
tokenizer_filename = str(tokenizer_filename)
tokenizer_filename = str(tokenizer_filename)

@patrickvonplaten patrickvonplaten left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me - left some nits

@juliendenize juliendenize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for issuing the PR !

@juliendenize juliendenize merged commit 8968083 into mistralai:main Jun 30, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants