You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
In the current dedupe process, duplicate documents are identified. However, in certain situations, boilerplates such as menu items appear in the extracted text, repeating over multiple documents. In this case, we only care about removing the repeated sequence of substrings.
What I propose is:
a new transform - e.g., sequence-deduplication
identify and remove all substrings of a given length (length_threshold) that are repeated more times than a set number (frequency_threshold)
option to retain the first copy of the duplicated sequence
modify the content in-place and not annotate - for efficient memory usage
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Component
Transforms/Other
Feature
In the current dedupe process, duplicate documents are identified. However, in certain situations, boilerplates such as menu items appear in the extracted text, repeating over multiple documents. In this case, we only care about removing the repeated sequence of substrings.
What I propose is:
sequence-deduplication
length_threshold
) that are repeated more times than a set number (frequency_threshold
)Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: