Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable use of bicleaner-hardrules #888

Closed
Tracked by #216
ZJaume opened this issue Oct 21, 2024 · 3 comments · Fixed by #892
Closed
Tracked by #216

Disable use of bicleaner-hardrules #888

ZJaume opened this issue Oct 21, 2024 · 3 comments · Fixed by #892
Assignees
Labels
quality Improving robustness and translation quality

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 21, 2024

I think the most safe option, specially if you already have a previous rule-based cleaning step, is to disable bicleaner-hardrules with --disable_hardrules for all language pairs.
https://github.com/mozilla/firefox-translations-training/blob/c588cdfe3b397a02e411a0a5168ff720b7df4770/pipeline/bicleaner/bicleaner.sh#L53-L58
Even though hardrules are for detecting the most obious noise, there are some rules that may be too much for cleaning corpora coming from cleaner sources than web-crawls.

Right now, if not disabled, it is discarding all the sentences with less than 3 words. So I think #215 is totally hardrules's fault.

If you still want to keep hardrules, the most harmful options for non-Paracrawl corpora, can be disabled with: --disable_minimal_length --disable_lm_filter.

@marco-c
Copy link
Collaborator

marco-c commented Oct 21, 2024

@ZJaume should we consider enabling them for web crawl data and disable them for cleaner data sources?

@ZJaume
Copy link
Collaborator Author

ZJaume commented Oct 21, 2024

I don't think so, the main issue was with Paracrawl because, despite being decently clean at the end, all the data we ran bitext extraction on, was very noisy. Other web-crawled corpora like NLLB or CCMatrix, despite having a worse cleaning scorer (LASER), its sources were cleaner. Also, bicleaner has improved a lot since we introduced the rule. Although bicleaner has less accuracy with short sentences than with long ones, I think it has a reasonable quality now.

@eu9ene eu9ene added the quality Improving robustness and translation quality label Oct 21, 2024
@eu9ene eu9ene self-assigned this Oct 21, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Oct 21, 2024

I'm disabling it but we can compare the translation of short sentences for the languages where we used a custom model with enabled hard rules versus the languages where we used the multilingual model with disabled ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants