Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option to Settings to Suppress Hidden Elements #1268

Open
Nitrousoxide opened this issue Mar 28, 2024 · 3 comments
Open

Add Option to Settings to Suppress Hidden Elements #1268

Nitrousoxide opened this issue Mar 28, 2024 · 3 comments

Comments

@Nitrousoxide
Copy link

Nitrousoxide commented Mar 28, 2024

Is your feature request related to a problem? Please describe.
More sites are using hidden text watermarks in an attempt to poison LLM harvesting. While some site parsers have been updated, a general setting to suppress hidden elements like these would be helpful for sites with a parser that's 4 years old (example which suffers from hidden watermark), or if a user makes their own because the plugin can't identify how to handle the page (like this one, though it doesn't use a watermark, a site that does or makes use of hidden elements that hurt readability would be applicable)

Describe the solution you'd like
A checkbox to suppress hidden elements from being rendered into the epub in the settings

Describe alternatives you've considered
Updating all the parsers as they implement watermarks is a potential option and it probably should still be done even if this option is implemented. But it wouldn't protect sites like azaleaellis's above example which have no parser at all and require a user-defined one.

@dteviot
Copy link
Owner

dteviot commented Mar 28, 2024

Unfortunately, it's not that simple.
The way the watermarking is done differs from site to site.
Conceptually, the usual way is a watermark is marked with some sort of tag, and there is javascript that hides/removes the element(s) with the mark when viewing normally. However, as WebToEpub does not run the javascript, the watermark elements are remain.

And the tagging differs from site to site.
e.g. https://re-library.com/ seems to tag the watermarks with a number of different classes, although it looks like "code-block" is the key element.

Anyway, I've provided EpubEditor dteviot/EpubEditor#4 to make it pretty easy to fix up the epubs after they are collected.

For example, to fix this in the above re:library, the following script can be used.

            for(let p of [...dom.querySelectorAll("div.code-block")]) {
                    p.remove();
            }
            return true;

@Nitrousoxide
Copy link
Author

Oh for sure, there are a multitude of ways one could try to hide watermarking, and I do think updating the parsers or editing the finished product would give the best tailored response. But a checkbox to block or remove common known techniques for watermarking might be a good addition to the WebToEpub, with the note that it's a generic solution and may not work for every hidden element.

Totally understand if you think this is out of scope though, so if so please feel free to close.

@kevin01523
Copy link

theres an option to remove tags on an uknown site i forgot how to access it i did use it to remove ads lol or comment section etc

its probably easy to add for the most common watermark or hidden elements and the unusual ones for epub editor to work on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants