Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Use directly ZIM files? #42

Closed
kelson42 opened this issue Jun 11, 2017 · 11 comments
Closed

Question: Use directly ZIM files? #42

kelson42 opened this issue Jun 11, 2017 · 11 comments

Comments

@kelson42
Copy link

Would that be possible without having to unpack it in millions of small single files?
Would a soft like kiwix-serve (or any other ZIM reader) be able to serve content through IPFS?

More information about kiwix-serve:
http://wiki.kiwix.org/wiki/Kiwix-serve

@singpolyma
Copy link

This could be very useful, because (with appropriate client) I think local ZIM files (fetched from IPFS) can be searched, etc?

@derhuerst
Copy link

cross-post from derhuerst/build-wikipedia-feed#2 (comment) :

Have you looked into getting the images from articles as well? From what I understand Kiwix has distributions where they bundle images in (for example wikipedia_en_all_2016-12.zim is 62GB)... I actually wrote a parser for their format recently https://www.npmjs.com/package/zimmer but havent finished doing the actual import-to-dat part :P

@kelson42
Copy link
Author

kelson42 commented Sep 9, 2019

@lidel Would be really interested to refresh that ticket. Let me know if you want to discuss this. Pretty motivated to help this project go forward.

@lidel
Copy link
Member

lidel commented Sep 9, 2019

Potential path to using ZIM directly

Publishing ZIM on IPFS

Something we could start doing today, is publishing .zim files on IPFS. IPFS CID could be listed along with HTTP and Bitorrent ones. It would act as a distributed CDN: files could be accessed via a local node (ipfs-desktop / go-ipfs) or one of public HTTP Gateways.

@kelson42 Is this something Kiwix would be interested in trying to do?
fyi there are two experimental ways of adding data to IPFS without duplicating it on disk: ipfs-filestore and ipfs-urlstore – may be useful when introducing IPFS to existing infrasturture.

Web-based reader

This one is a long shot, but opens exciting possibilities: if a ZIM reader in pure JS existed, a web browser would be the only software required to browse distributed mirror. The reader would be just a set of static HTML+CSS+JS files published on IPFS along with ZIM archives, making it a self-contained proposition.

While it would be possible to read ZIM from HTTP Gateway via range requests, a more decentralized option would be to run embedded JS IPFS node on a page and request specific byte ranges via something like

ipfs.cat('/ipfs/QmHash/wiki.zim', { offset: x, length: y })

Question: how feasible this is from the perspective of existing ZIM readers?
Is there any prior art for JS one, apart from zimmer?

Concerns / Unknowns

I am new to this endeavor, but from quick eyeballing it looks like ZIM is a flat file optimized for random seeks within. At some level it is similar to files put on IPFS: they get chunked and produced blocks are assembled into balanced trees (Mergle-DAGs) optimized for random access. Not sure how performance would look like if we put ZIM on IPFS and try to fetch it over the network, but we can experiment with this.

Data deduplication (dedicated chunker for ZIM?)

Potential problem with using ZIM directly is deduplication. When we put unpacked mirror on IPFS, a lot of data does not change between snapshots. All media assets such as images, audio files etc get deduplicated across entire IPFS swarm (all snapshots, all websites using the same image are cohosting it).

iiuc ZIM does internal compression of Clusters (>1MB) of data, which means each ZIM file is a different stream of bytes, defeating deduplication provided by IPFS.

My understanding is that good deduplication is not possible, unless ZIM Cluster compression is deterministic across snapshots (always compresses same assets, and compressing same assets produces exactly the same array of bytes) AND/OR we add ZIM to IPFS using custom chunker, that is aware of its internal structure, enabling deduplication of the same content across snapshots. This could also be neat demo of what is possible with https://ipld.io

Update: created #71 to benchmark the level of deduplication we can get with regular ipfs add + some custom parameters

Please let me know if I missed something here.

@kelson42
Copy link
Author

kelson42 commented Sep 9, 2019

@lidel Distributing ZIM files via IPFS would be interesting and I would volunteer to make it if the process is not too complex.

We have two ZIM readers in Javascript:

But so far Kiwix-JS is not able to read a ZIM file online, see kiwix/kiwix-js#356

But what I had in mind originally was to provide a server side service (so not just HTML files) able to read the ZIM files on demand and provide the content via IPFS. This would simplify the publication process by avoiding the data extraction process from the ZIM files. Not sure this is technically possible.

@eminence
Copy link

eminence commented Sep 9, 2019

Another possibility would be to compile https://github.com/dignifiedquire/zim to webassembly to use in a browser

@kelson42
Copy link
Author

kelson42 commented Sep 9, 2019

@eminence This is also basically possible with the libzim/libkiwix as well... but looks like the result is not able to handle files over 4GB :(

@lidel
Copy link
Member

lidel commented Apr 29, 2020

Hi friends, I've published a draft of a devgrant for adding IPFS support to kiwix-js: ipfs/devgrants#49
Readable version: https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

It tries to define steps to have kiwix-js reading Wikipedia .zim archives from IPFS. Right now we are looking for people with bandwidth and interest in creating PoC to test feasibility of that approach. Feel free to comment on ipfs/devgrants#49

@lidel lidel pinned this issue Jan 25, 2021
@lidel
Copy link
Member

lidel commented Jan 25, 2021

(quick update on the state of things for drive-by reader)

The current process of unpacking ZIM and tweaking HTML on per-case basis is a very, very wasteful, impossible to automate across different languages and not sustainable. Every time something breaks, and we need to sink a lot of time to fix the build scripts – if we allocated that time into web-based ZIM reader we would already be there.

I believe our effort should go into putting ZIM on IPFS and then reading them from IPFS via web browser (as I elaborated in the original idea draft + we read in the latest research tracked in past in kiwix/kiwix-js#595 and continued now in kiwix/kiwix-js#659).

@lidel
Copy link
Member

lidel commented Apr 12, 2021

Related: low-hanging optimization for future builds may be adding ZIM file to IPFS using trickle-dag (ipfs add --trickle) which is optimized for random seeking (default is optimized for reduction of link count).

IIUC this should improve use case when ZIM is read over HTTP range requests (if we talk to a public gateway or use preload servers). Nah, not really. We just need to use ZIMs directly.

tl;dr we need web-based reader for ZIM archives:
https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

NOTE: the link above is old devgrant, we now can do it better with

@lidel
Copy link
Member

lidel commented Oct 12, 2023

A lot changed since we've looked into this. Many new oportunities and protocols exist now, that did not before.
Direct use of ZIMs continues in fresh issue #140

@lidel lidel closed this as completed Oct 12, 2023
@lidel lidel unpinned this issue Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants