You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, if not disabled, it is discarding all the sentences with less than 3 words. So I think #215 is totally hardrules's fault.
If you still want to keep hardrules, the most harmful options for non-Paracrawl corpora, can be disabled with: --disable_minimal_length --disable_lm_filter.
The text was updated successfully, but these errors were encountered:
I don't think so, the main issue was with Paracrawl because, despite being decently clean at the end, all the data we ran bitext extraction on, was very noisy. Other web-crawled corpora like NLLB or CCMatrix, despite having a worse cleaning scorer (LASER), its sources were cleaner. Also, bicleaner has improved a lot since we introduced the rule. Although bicleaner has less accuracy with short sentences than with long ones, I think it has a reasonable quality now.
I'm disabling it but we can compare the translation of short sentences for the languages where we used a custom model with enabled hard rules versus the languages where we used the multilingual model with disabled ones.
I think the most safe option, specially if you already have a previous rule-based cleaning step, is to disable
bicleaner-hardrules
with--disable_hardrules
for all language pairs.https://github.com/mozilla/firefox-translations-training/blob/c588cdfe3b397a02e411a0a5168ff720b7df4770/pipeline/bicleaner/bicleaner.sh#L53-L58
Even though hardrules are for detecting the most obious noise, there are some rules that may be too much for cleaning corpora coming from cleaner sources than web-crawls.
Right now, if not disabled, it is discarding all the sentences with less than 3 words. So I think #215 is totally hardrules's fault.
If you still want to keep hardrules, the most harmful options for non-Paracrawl corpora, can be disabled with:
--disable_minimal_length --disable_lm_filter
.The text was updated successfully, but these errors were encountered: