Skip to content

bring back our demons: clean_up_tokenization_spaces#44035

Merged
ArthurZucker merged 3 commits intomainfrom
fix-remote-cleanup
Feb 20, 2026
Merged

bring back our demons: clean_up_tokenization_spaces#44035
ArthurZucker merged 3 commits intomainfrom
fix-remote-cleanup

Conversation

@ArthurZucker
Copy link
Collaborator

What does this PR do?

We already brought it back with:

        if clean_up_tokenization_spaces:
            # Call custom cleanup method if it exists (e.g., for CLVP's [SPACE] token replacement)
            if hasattr(self, "clean_up_tokenization") and callable(self.clean_up_tokenization):
                text = self.clean_up_tokenization(text)
            else:
                # Otherwise apply standard cleanup
                text = (
                    text.replace(" .", ".")
                    .replace(" ?", "?")
                    .replace(" !", "!")
                    .replace(" ,", ",")
                    .replace(" ' ", "'")
                    .replace(" n't", "n't")
                    .replace(" 'm", "'m")
                    .replace(" 's", "'s")
                    .replace(" 've", "'ve")
                    .replace(" 're", "'re")
                )

which basically always calls it.
We are doing this now because it allows to fix custom remote model that expect this to still exist.

@ArthurZucker ArthurZucker marked this pull request as ready for review February 16, 2026 09:50
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap
Copy link
Collaborator

itazap commented Feb 17, 2026

nice thanks - we can also then rm this code then from tokenization_utils_tokenizers.py, toknization_python.py, and tokenization_mistral_common.py !

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the issues with remote models in vLLM.

Let's take @itazap's suggestion to deduplicate this BC code.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: luke, plbart, wav2vec2

@ArthurZucker ArthurZucker merged commit d233970 into main Feb 20, 2026
24 of 26 checks passed
@ArthurZucker ArthurZucker deleted the fix-remote-cleanup branch February 20, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants