Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpusCleaner supports only a limited set of languages #649

Open
Tracked by #311
eu9ene opened this issue May 29, 2024 · 4 comments
Open
Tracked by #311

OpusCleaner supports only a limited set of languages #649

eu9ene opened this issue May 29, 2024 · 4 comments
Labels
language-coverage Issues related to covering specific languages

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented May 29, 2024

I ran into an issue with Turkish:
https://firefox-ci-tc.services.mozilla.com/tasks/Ip5AUlOmRU2hu2yP8RfS0w/runs/0/logs/public/logs/live.log

Specifically alpha_ratio filter: https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/clean_common.py

@eu9ene eu9ene added the language-coverage Issues related to covering specific languages label May 29, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented May 29, 2024

From our list of configs, I also don't see: tr, bs, id, vi, sr

@eu9ene eu9ene changed the title OpusCleaner does not support Turkish OpusCleaner supports only a limited set of languages May 29, 2024
@gregtatum
Copy link
Member

I wonder as a mitigation if we can just skip the feature when it's not supported, and add a note in the training config generation. Maybe the config generator generates a suppression of this feature so we know.

@eu9ene eu9ene self-assigned this Jun 11, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 11, 2024

Based on my experiment aphabet ratio filtering can be very efficient so we should just add support for those languages to OpusCleaner.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 17, 2024

@gregtatum here's my attempt to extend alphabets but I'm not confident in Vietnamese: eu9ene/OpusCleaner@7aefb5b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

2 participants