Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve translation of URLs #736

Open
Tracked by #238
eu9ene opened this issue Jul 12, 2024 · 4 comments
Open
Tracked by #238

Improve translation of URLs #736

eu9ene opened this issue Jul 12, 2024 · 4 comments
Labels
quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jul 12, 2024

Sometimes URLs are written in text rather than hidden behind the HTML element. The URL should be copied as is in this case.

There are two ways to fix this:

  1. Maybe an easier way: identify a URL with a regex on the translation engine side and copy it without passing to the model
  2. Add data augmentation to insert URLs in some training examples and retrain the models
@eu9ene eu9ene added the quality Improving robustness and translation quality label Jul 12, 2024
@marco-c
Copy link
Collaborator

marco-c commented Jul 17, 2024

Could also be a data cleaning problem, like num_mismatch.

@gregtatum
Copy link
Member

gregtatum commented Jul 19, 2024

I would push back against implementing option 1, which would happen on the Gecko side for every translation. That regex seems risky and error prone to write. I would at least start with data augmentation. There is hplt-project/OpusTrainer#43 already on file.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 19, 2024

See also hplt-project/OpusTrainer#43

@jeremiahlee
Copy link

I'm seeing this happen often with Svenska-to-English translations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

4 participants