[Feature] New transform to remove repeating text sequences from documents #921

Harmedox · 2025-01-08T00:57:31Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

In the current dedupe process, duplicate documents are identified. However, in certain situations, boilerplates such as menu items appear in the extracted text, repeating over multiple documents. In this case, we only care about removing the repeated sequence of substrings.

What I propose is:

a new transform - e.g., sequence-deduplication
identify and remove all substrings of a given length (length_threshold) that are repeated more times than a set number (frequency_threshold)
option to retain the first copy of the duplicated sequence
modify the content in-place and not annotate - for efficient memory usage

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Harmedox added the enhancement New feature or request label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] New transform to remove repeating text sequences from documents #921

[Feature] New transform to remove repeating text sequences from documents #921

Harmedox commented Jan 8, 2025

[Feature] New transform to remove repeating text sequences from documents #921

[Feature] New transform to remove repeating text sequences from documents #921

Comments

Harmedox commented Jan 8, 2025

Search before asking

Component

Feature

Are you willing to submit a PR?