Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: issue with image parsing with css2rss #1477

Closed
seventyiris83 opened this issue Aug 14, 2024 · 3 comments
Closed

[BUG]: issue with image parsing with css2rss #1477

seventyiris83 opened this issue Aug 14, 2024 · 3 comments
Assignees
Labels
Milestone

Comments

@seventyiris83
Copy link

Brief description of the issue

related to #8

when i use css2rss to get rumble videos as articles in rss guard using this as post-process script:

python css2rss.py "[role='listitem']" "h3" "img"

in 4.3.3 lite it correctly appends video thumbnails to the articles:

000000001

in 4.7.2 lite and 4.7.3 lite it doesn't:

000001

How to reproduce the bug?

have 4.3.3 lite and 4.7.2 lite or 4.7.3 lite on your system
have css2rss installed
add a new feed with this url: https://rumble.com/c/LofiGirl
add python css2rss.py "[role='listitem']" "h3" "img" as post-process script
fetch the feed articles
see the difference between 4.3.3 lite and 4.7.2 lite/4.7.3 lite

What was the expected result?

4.7.2 lite to append the video thumbnails to articles (jpg urls)

What actually happened?

4.7.2 lite appends an unwanted section of the img item resulting in no video thumbnail appended

Debug log

N/A

Operating system and version

  • OS: Windows 10
  • RSS Guard version: 4.7.2 lite/4.7.3 lite
@martinrotter
Copy link
Owner

will look into this

martinrotter added a commit that referenced this issue Aug 20, 2024
@martinrotter
Copy link
Owner

Check it.

Sadly, the internal RegExp which extracts information about images from HTML article is confused because your HTML excerpts contain this:

<img alt="Lofi Girl's red scarf 🧣" class="thumbnail__image" draggable="false" height="270" loading="lazy" onerror="this.onerror=null;this.src="data:image/svg+xml,%3Csvg width='480' height='270' xmlns='http://www.w3.org/2000/svg'/%3E"" src="https://1a-1791.com/s/s8/1/L/7/z/k/L7zko.oq1b.2-small-Lofi-Girls-red-scarf-.jpg" width="480"/>

Specifically, HTML contains "src=" part which is in fact no the real source of the picture - this piece of code is embedded JavaScript.

I tried to tweak the logic a bit and it all seems to work. Sadly the regular expression approach will not work for 100% of websites. If some problem arises in the future, I will have to re-think the approach and maybe use full XML/HTML parsing to extract source URLs of images.

@martinrotter martinrotter added this to the 4.7.4 milestone Aug 20, 2024
@seventyiris83
Copy link
Author

thanks for looking into this 😄

i tested the build containing the change and it does now work! the images are appended to articles.

but it's the same result as before when "use legacy article formatting" is enabled:

000001

could your fix also be applied to the legacy article formatting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants