Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web pages cannot correctly identify and download image links. #174

Closed
jinshuqishi2019 opened this issue May 27, 2024 · 6 comments · Fixed by #178
Closed

Web pages cannot correctly identify and download image links. #174

jinshuqishi2019 opened this issue May 27, 2024 · 6 comments · Fixed by #178

Comments

@jinshuqishi2019
Copy link

Environment

  • Operating System: debian 10
  • node --version: v20.12.2
  • npm --version: 10.7.0
  • yarn --version, if using Yarn:
  • percollate --version: v4.2.1

Description

Thank you for answering my question.
Images requiring a Referer header are not fetched

Another minor issue mentioned in this question has not yet been resolved.

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate html --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - --inline
chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate epub --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - 

EPUB and HTML cannot download images; even when using the --inline parameter, HTML still displays image URLs, whereas PDF can correctly display images.

Thanks.

@jinshuqishi2019
Copy link
Author

I am currently using a non-conventional method to solve the problem of downloading images, which involves using the sed command to modify the HTML image links into image formats (such as png, etc.).

I tried asking ChatGPT how to solve this problem, and it suggested installing Cheerio library to get the image links. For images without a suffix, it also recommended using the mime-types library to obtain the MIME type from the response headers to determine the file extension.

@danburzo
Copy link
Owner

Thanks for the report, it seems that something is not hooked up correctly when the HTML content comes via the standard input. Will investigate!

@jinshuqishi2019
Copy link
Author

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw|sed -e 's/\(\?wx_fmt=png\)[^"]*/.png/gI' -e 's/\(\?wx_fmt=jpe\?g\)[^"]*/.jpg/gI' -e 's/\(\?wx_fmt=gif\)[^"]*/.gif/gI'| percollate epub https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw -  

@danburzo
Copy link
Owner

In the example from the first message, image URLs don’t use a file extension that allows us to identify the image format. In #178 I’ve made a change so that such images receive a generic image MIME media type. Similarly, images collected as resource files for bundling inside EPUB will be saved with an .image file extension and a image MIME type.

The extent to which the EPUB reader can render images with such a MIME type depends on the application, but it’s the best we can do at the moment.

@danburzo
Copy link
Owner

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

To bundle several pages fetched with an external tool (in this case chromium), you can try saving them to files on the disk and then using something like:

percollate epub file1.html --url=url1 file2.html --url=url2 ...

@danburzo
Copy link
Owner

Released as [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants