Web pages cannot correctly identify and download image links. #174

jinshuqishi2019 · 2024-05-27T12:27:07Z

Environment

Operating System: debian 10
node --version: v20.12.2
npm --version: 10.7.0
yarn --version, if using Yarn:
percollate --version: v4.2.1

Description

Thank you for answering my question.
Images requiring a Referer header are not fetched

Another minor issue mentioned in this question has not yet been resolved.

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate html --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - --inline

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate epub --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ -

EPUB and HTML cannot download images; even when using the --inline parameter, HTML still displays image URLs, whereas PDF can correctly display images.

Thanks.

The text was updated successfully, but these errors were encountered:

jinshuqishi2019 · 2024-05-29T12:01:53Z

I am currently using a non-conventional method to solve the problem of downloading images, which involves using the sed command to modify the HTML image links into image formats (such as png, etc.).

I tried asking ChatGPT how to solve this problem, and it suggested installing Cheerio library to get the image links. For images without a suffix, it also recommended using the mime-types library to obtain the MIME type from the response headers to determine the file extension.

danburzo · 2024-05-29T12:03:40Z

Thanks for the report, it seems that something is not hooked up correctly when the HTML content comes via the standard input. Will investigate!

jinshuqishi2019 · 2024-06-07T13:39:06Z

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw|sed -e 's/\(\?wx_fmt=png\)[^"]*/.png/gI' -e 's/\(\?wx_fmt=jpe\?g\)[^"]*/.jpg/gI' -e 's/\(\?wx_fmt=gif\)[^"]*/.gif/gI'| percollate epub https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw -

danburzo · 2024-08-11T17:59:35Z

In the example from the first message, image URLs don’t use a file extension that allows us to identify the image format. In #178 I’ve made a change so that such images receive a generic image MIME media type. Similarly, images collected as resource files for bundling inside EPUB will be saved with an .image file extension and a image MIME type.

The extent to which the EPUB reader can render images with such a MIME type depends on the application, but it’s the best we can do at the moment.

danburzo · 2024-08-11T18:01:13Z

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

To bundle several pages fetched with an external tool (in this case chromium), you can try saving them to files on the disk and then using something like:

percollate epub file1.html --url=url1 file2.html --url=url2 ...

danburzo · 2024-08-11T18:48:32Z

Released as [email protected]

danburzo mentioned this issue Aug 11, 2024

Handle images without an extension as 'image' MIME and '.image' extension #178

Merged

danburzo closed this as completed in #178 Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web pages cannot correctly identify and download image links. #174

Web pages cannot correctly identify and download image links. #174

jinshuqishi2019 commented May 27, 2024

jinshuqishi2019 commented May 29, 2024

danburzo commented May 29, 2024

jinshuqishi2019 commented Jun 7, 2024

danburzo commented Aug 11, 2024

danburzo commented Aug 11, 2024

danburzo commented Aug 11, 2024

Web pages cannot correctly identify and download image links. #174

Web pages cannot correctly identify and download image links. #174

Comments

jinshuqishi2019 commented May 27, 2024

Environment

Description

jinshuqishi2019 commented May 29, 2024

danburzo commented May 29, 2024

jinshuqishi2019 commented Jun 7, 2024

danburzo commented Aug 11, 2024

danburzo commented Aug 11, 2024

danburzo commented Aug 11, 2024