Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate monolingual cleaning #476

Closed
Tracked by #216
eu9ene opened this issue Mar 12, 2024 · 3 comments · Fixed by #991
Closed
Tracked by #216

Investigate monolingual cleaning #476

eu9ene opened this issue Mar 12, 2024 · 3 comments · Fixed by #991
Assignees
Labels
quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Mar 12, 2024

Based on @marco-c's feedback we should investigate how the HPLT project cleans monolingual data and whether we should adjust our cleaning procedure.

https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf

@eu9ene eu9ene added the quality Improving robustness and translation quality label Mar 12, 2024
@marco-c
Copy link
Collaborator

marco-c commented Apr 11, 2024

@marco-c
Copy link
Collaborator

marco-c commented May 17, 2024

See also #247.

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 30, 2024

We are now using https://github.com/pablop16n/web-docs-scorer which gives us much better results when inspecting cleaned data. But this requires cleaning at document level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants