We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Based on @marco-c's feedback we should investigate how the HPLT project cleans monolingual data and whether we should adjust our cleaning procedure.
https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf
The text was updated successfully, but these errors were encountered:
Worth looking at: https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf.
Sorry, something went wrong.
See also #247.
We are now using https://github.com/pablop16n/web-docs-scorer which gives us much better results when inspecting cleaned data. But this requires cleaning at document level.
eu9ene
Successfully merging a pull request may close this issue.
Based on @marco-c's feedback we should investigate how the HPLT project cleans monolingual data and whether we should adjust our cleaning procedure.
https://hplt-project.org/HPLT_D3_1___Software_for_cleaning_data_sets.pdf
The text was updated successfully, but these errors were encountered: