[Feature] New transform to annotate with readability scores #923

Harmedox · 2025-01-08T01:35:49Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The proposal is to have a transform that can annotate with metrics that help determine the readability, complexity, and grade level of a given text. An easy to use library to calculate these statistics from text is textstat.

Mode of operation for this transform:

Curriculum learning: in this mode, a document is only annotated with flesch_kincaid, gunning_fog, automated_readability_index readability scores, and the average of these 3 grade-level scores to speed up the annotation process. Otherwise, the document is annotated with all of the readability scores available in textstat.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

touma-I · 2025-01-09T00:07:40Z

@Harmedox How is this being used for pre-training, fine-tuning, SR, RAG, etc? Can you elaborate on the use case. Thanks

Harmedox · 2025-01-09T00:32:43Z

@Harmedox How is this being used for pre-training, fine-tuning, SR, RAG, etc? Can you elaborate on the use case. Thanks

Readability scores facilitate the identification and filtering of hard-to-read low-quality documents. We use a combination of readability scores and other quality metrics to identify high-quality documents suitable for pretraining.

For instance, McAlpine-EFLAW readability score is computed as a ratio of the number of words in the document plus the number of mini-words (consisting of ≤ 3 characters), and the number of sentences. Lower scores indicate that the document is easier for a reader with English as a foreign language.

Harmedox added the enhancement New feature or request label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] New transform to annotate with readability scores #923

[Feature] New transform to annotate with readability scores #923

Harmedox commented Jan 8, 2025

touma-I commented Jan 9, 2025

Harmedox commented Jan 9, 2025

[Feature] New transform to annotate with readability scores #923

[Feature] New transform to annotate with readability scores #923

Comments

Harmedox commented Jan 8, 2025

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Jan 9, 2025

Harmedox commented Jan 9, 2025