Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] New transform to annotate with readability scores #923

Open
1 of 2 tasks
Harmedox opened this issue Jan 8, 2025 · 2 comments
Open
1 of 2 tasks

[Feature] New transform to annotate with readability scores #923

Harmedox opened this issue Jan 8, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@Harmedox
Copy link

Harmedox commented Jan 8, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The proposal is to have a transform that can annotate with metrics that help determine the readability, complexity, and grade level of a given text. An easy to use library to calculate these statistics from text is textstat.

Mode of operation for this transform:

  • Curriculum learning: in this mode, a document is only annotated with flesch_kincaid, gunning_fog, automated_readability_index readability scores, and the average of these 3 grade-level scores to speed up the annotation process. Otherwise, the document is annotated with all of the readability scores available in textstat.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Harmedox Harmedox added the enhancement New feature or request label Jan 8, 2025
@touma-I
Copy link
Collaborator

touma-I commented Jan 9, 2025

@Harmedox How is this being used for pre-training, fine-tuning, SR, RAG, etc? Can you elaborate on the use case. Thanks

@Harmedox
Copy link
Author

Harmedox commented Jan 9, 2025

@Harmedox How is this being used for pre-training, fine-tuning, SR, RAG, etc? Can you elaborate on the use case. Thanks

Readability scores facilitate the identification and filtering of hard-to-read low-quality documents. We use a combination of readability scores and other quality metrics to identify high-quality documents suitable for pretraining.

For instance, McAlpine-EFLAW readability score is computed as a ratio of the number of words in the document plus the number of mini-words (consisting of ≤ 3 characters), and the number of sentences. Lower scores indicate that the document is easier for a reader with English as a foreign language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants