You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
The proposal is to have a transform that can annotate with metrics that help determine the readability, complexity, and grade level of a given text. An easy to use library to calculate these statistics from text is textstat.
Mode of operation for this transform:
Curriculum learning: in this mode, a document is only annotated with flesch_kincaid, gunning_fog, automated_readability_index readability scores, and the average of these 3 grade-level scores to speed up the annotation process. Otherwise, the document is annotated with all of the readability scores available in textstat.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
@Harmedox How is this being used for pre-training, fine-tuning, SR, RAG, etc? Can you elaborate on the use case. Thanks
Readability scores facilitate the identification and filtering of hard-to-read low-quality documents. We use a combination of readability scores and other quality metrics to identify high-quality documents suitable for pretraining.
For instance, McAlpine-EFLAW readability score is computed as a ratio of the number of words in the document plus the number of mini-words (consisting of ≤ 3 characters), and the number of sentences. Lower scores indicate that the document is easier for a reader with English as a foreign language.
Search before asking
Component
Transforms/Other
Feature
The proposal is to have a transform that can annotate with metrics that help determine the readability, complexity, and grade level of a given text. An easy to use library to calculate these statistics from text is textstat.
Mode of operation for this transform:
flesch_kincaid
,gunning_fog
,automated_readability_index
readability scores, and the average of these 3 grade-level scores to speed up the annotation process. Otherwise, the document is annotated with all of the readability scores available in textstat.Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: