Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] New transform to remove repeating text sequences from documents #921

Open
1 of 2 tasks
Harmedox opened this issue Jan 8, 2025 · 0 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request

Comments

@Harmedox
Copy link

Harmedox commented Jan 8, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

In the current dedupe process, duplicate documents are identified. However, in certain situations, boilerplates such as menu items appear in the extracted text, repeating over multiple documents. In this case, we only care about removing the repeated sequence of substrings.

What I propose is:

  • a new transform - e.g., sequence-deduplication
  • identify and remove all substrings of a given length (length_threshold) that are repeated more times than a set number (frequency_threshold)
  • option to retain the first copy of the duplicated sequence
  • modify the content in-place and not annotate - for efficient memory usage

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Harmedox Harmedox added the enhancement New feature or request label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant