Server-side image caching of images being used in articles #2532

Strubbl · 2016-11-03T19:13:43Z

As request by @tcitworld in the gitter chat, this is ticket for the feature request of server-side caching of images.

Summary

When downloading an article with wallabag, all images being used in the article shall be downloaded, too. How they are saved is implementation detail and not clear to me.

Issue details

Suppose i add an article to my wallabag. The article content is then saved to the database. But only the text is saved. If i open an article in wallabag core, always all images need to be requested again.

Inspired by ttrss the images are cached in the folder cache/images/somehash.jpg. I do not know how the hash is calculated. But to avoid name clashes between articles maybe articleURL+imageURL should be hashed. Or a separated subfolder for each article id? But even then name clashes can occur.
Or one could save the images as blob in the database.

When we downloaded the images of an article, all article's image tags need to be rewritten to let their src's attributes point to the cached image now.

API

The cached images should be exposed by the API somehow, so that 3rd-party apps like Android can download images directly from wallabag without contacting all the article sites.

Q: so the core api should give the list of pictures inside the articles ?
A: if the wallabag core alreday caches images in a certain format, maybe yes, the links to the according files (e.g. cache/image/hash1.jpg cache/image/hash2.jpg etc)

Reloading of article

If i click on the refetch button in the article view, do the images need to be refetched? Maybe for performance not, because the chance is high that a subset of the images is still in the article. Maybe for stability yes to be always consistent.

Why

reuse in the wallabag android app for a real offline mode with no internet at all
if a site goes down in the meantime (between add and reading), i still have an intact article
less user tracking --> best case: download article once and never contact the original site without my request to do so
no mixed mode when using SSL anymore (e.g. wallabag served via ssl, but images from articles are not)

gitter chat log

left in this ticket just for reference:

Simon Alberny
@Simounet
16:34
Will the image downloading thing will work on the Android App too? :fingercrossed:
ping @tcitworld
Thomas Citharel
@tcitworld
16:36
nope, they're not related
Nicolas Lœuillet
@nicosomb
16:37
And I don’t think it’s a good idea
Thomas Citharel
@tcitworld
16:37
I think it is if we have some kind of a cache system but it's tricky
Simon Alberny
@Simounet
16:38
Offline is the idea of these kind of apps.
I understand why you don't want to. But it is very handy.
I loved this feature on the Pocket App.
Nicolas Lœuillet
@nicosomb
16:39
pocket downloads pictures on your phone?
Simon Alberny
@Simounet
16:39
Yep.
Nicolas Lœuillet
@nicosomb
16:39
and you have an access to them in your gallery?
Simon Alberny
@Simounet
16:39
No.
It was only into the app.
It's great not to have to download pictures every time you load a page, especially on a bad network connection.
I'm using Wallabag to store content I want to read later, maybe without network.
Strubbl
@Strubbl
19:26
@Simounet @nicosomb @tcitworld I support the image caching idea. I often use wallabag when i have no internet connection at all. Is there some chance, that it gets implemented in the wallabag core and is going to be exposed to the API? Or should we think about image caching without the core's help? The pocket app did image caching, that's why it ate up to 1 GB space on my phone. But that is okay if i know i have full offline support.
Does anybody know tt-rss? This open source feed reader also supports downloading images from articles to the local server, where ttrss runs. It works well. Maybe we can get some inspiration from that project?
Thomas Citharel
@tcitworld
19:27
so the core api should give the list of pictures inside the articles ?
Strubbl
@Strubbl
19:29
if the wallabag core alreday caches images in a certain format, maybe yes, the links to the according files (e.g. cache/image/hash1.jpg cache/image/hash2.jpg etc)
Nicolas Lœuillet
@nicosomb
19:29
I know tt-rss, there is a wallabag plugin for this software :stuck_out_tongue_winking_eye:
Strubbl
@Strubbl
19:30
yes, yes. :grinning:
Thomas Citharel
@tcitworld
19:31
it requires saving the list of pictures in the database, or the pictures saved in a specific folder and named specifically
Strubbl
@Strubbl
19:31
and tt-rss core is caching already images if you want it to. but i do not know if the android app reuses that already from the server fetched images
yes, maybe a hash of the "articleurl+imageurl". then there is no need for extra database fields. but the image src's fields need to be replaced in the wallabag article then.
Thomas Citharel
@tcitworld
19:33
not sure listing the contents of a folder is faster than looking into database
interesting challenge, though :stuck_out_tongue:
@Strubbl @Simounet could someone open an issue about this ?
I think there's one in the android app repo, but we'll need one on the server repo
Strubbl
@Strubbl
20:10
a nice side effect of server side image caching is that i do not have to contact any other (at least less) internet servers for opening an article --> less tracking

The text was updated successfully, but these errors were encountered:

j0k3r · 2016-11-04T09:33:49Z

Well, the downloading image feature just landed in the 2.2 branch.
This is how it works:

parse the html content
extract all images
download all images on the server (in web/assets/images)
store them in a folder generated using a hash of the id (using crc32 with hash). Sth like assets/images/9/b/9b0ead26/ and image name is also hashed (c638b4c2.png)
finally it replace the img src from the html with that new link (prefixing with the wallabag instance url if defined)

Which means, if downloading images is enabled on the wallabag instance, the Android app will receive all img src targetting the wallabag instance.

This is for the downloading part on the server side.
I also love to be able to read article completely offline (that's why I built http://f43.me/) and I love the Reeder feature of downloading image locally on my iPhone.

What Reeder does?

download all items from my RSS
once downloaded, parse them to extract all images to be downloaded
display a progression while downloading images (Caching 7 of 45)
replace img src after each image is downloaded with the local path
I guess it also store that information somewhere because if you clear your cache, it can still display the original image

If you don't enable that feature, image are downloading on the fly when you open an article.

Why not implementing the same feature on the Android / iOS app?

Strubbl · 2016-11-05T09:05:31Z

What happens to the JSON Export with that feature in wallabag core? Does it still contain original links or links to my wallabag instance for the used pictures in the articles?

Now, it is WIP in android app: wallabag/android-app#343

j0k3r · 2016-11-06T22:32:16Z

Since the behavior is to replace original link in html by the one to the wallabag instance, everything related to the content will got image link to the wallabag instance.

Good job on the Android part 👍

gerroon · 2016-11-07T03:53:16Z

This was supported in 1.9 if I remember right. iw as able to cache the images locally. Is there any reason why that feature was not carried over to the new version?

Strubbl · 2016-11-07T20:15:47Z

@j0k3r Thank you.

So there is no possibility to just get my json export from my instance 1, go to my wallabag instance 2 and restore all articles? Because i also have to carry over my cached images since the articles in json export do not contain the original image links anymore?

j0k3r · 2016-11-08T15:33:19Z

@gerroon this is off topic. v2 is a complete rewrite from scratch. To be able to ship a v2 faster we postponed some feature like downloading image, search engine, etc.

@Strubbl you won't be able to re-import image from v1 to v2 (like moving them thru folders). Images will have to be re-downloaded.

Strubbl · 2017-02-07T16:44:04Z

This issue is closed since we merged PR #2180

tcitworld added the Meta label Nov 4, 2016

Strubbl closed this as completed Feb 7, 2017

j0k3r added this to the 2.2.0 milestone Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side image caching of images being used in articles #2532

Server-side image caching of images being used in articles #2532

Strubbl commented Nov 3, 2016 •

edited

Loading

j0k3r commented Nov 4, 2016

Strubbl commented Nov 5, 2016

j0k3r commented Nov 6, 2016

gerroon commented Nov 7, 2016

Strubbl commented Nov 7, 2016

j0k3r commented Nov 8, 2016 •

edited

Loading

Strubbl commented Feb 7, 2017

Server-side image caching of images being used in articles #2532

Server-side image caching of images being used in articles #2532

Comments

Strubbl commented Nov 3, 2016 • edited Loading

Summary

Issue details

API

Reloading of article

Why

gitter chat log

j0k3r commented Nov 4, 2016

Strubbl commented Nov 5, 2016

j0k3r commented Nov 6, 2016

gerroon commented Nov 7, 2016

Strubbl commented Nov 7, 2016

j0k3r commented Nov 8, 2016 • edited Loading

Strubbl commented Feb 7, 2017

Strubbl commented Nov 3, 2016 •

edited

Loading

j0k3r commented Nov 8, 2016 •

edited

Loading