Question: Use directly ZIM files? #42

kelson42 · 2017-06-11T08:22:57Z

Would that be possible without having to unpack it in millions of small single files?
Would a soft like kiwix-serve (or any other ZIM reader) be able to serve content through IPFS?

More information about kiwix-serve:
http://wiki.kiwix.org/wiki/Kiwix-serve

singpolyma · 2017-10-02T16:13:59Z

This could be very useful, because (with appropriate client) I think local ZIM files (fetched from IPFS) can be searched, etc?

derhuerst · 2017-10-02T16:18:57Z

cross-post from derhuerst/build-wikipedia-feed#2 (comment) :

Have you looked into getting the images from articles as well? From what I understand Kiwix has distributions where they bundle images in (for example wikipedia_en_all_2016-12.zim is 62GB)... I actually wrote a parser for their format recently https://www.npmjs.com/package/zimmer but havent finished doing the actual import-to-dat part :P

kelson42 · 2019-09-09T11:28:04Z

@lidel Would be really interested to refresh that ticket. Let me know if you want to discuss this. Pretty motivated to help this project go forward.

lidel · 2019-09-09T13:11:03Z

Potential path to using ZIM directly

Publishing ZIM on IPFS

Something we could start doing today, is publishing .zim files on IPFS. IPFS CID could be listed along with HTTP and Bitorrent ones. It would act as a distributed CDN: files could be accessed via a local node (ipfs-desktop / go-ipfs) or one of public HTTP Gateways.

@kelson42 Is this something Kiwix would be interested in trying to do?
fyi there are two experimental ways of adding data to IPFS without duplicating it on disk: ipfs-filestore and ipfs-urlstore – may be useful when introducing IPFS to existing infrasturture.

Web-based reader

This one is a long shot, but opens exciting possibilities: if a ZIM reader in pure JS existed, a web browser would be the only software required to browse distributed mirror. The reader would be just a set of static HTML+CSS+JS files published on IPFS along with ZIM archives, making it a self-contained proposition.

While it would be possible to read ZIM from HTTP Gateway via range requests, a more decentralized option would be to run embedded JS IPFS node on a page and request specific byte ranges via something like

ipfs.cat('/ipfs/QmHash/wiki.zim', { offset: x, length: y })

Question: how feasible this is from the perspective of existing ZIM readers?
Is there any prior art for JS one, apart from zimmer?

Concerns / Unknowns

I am new to this endeavor, but from quick eyeballing it looks like ZIM is a flat file optimized for random seeks within. At some level it is similar to files put on IPFS: they get chunked and produced blocks are assembled into balanced trees (Mergle-DAGs) optimized for random access. Not sure how performance would look like if we put ZIM on IPFS and try to fetch it over the network, but we can experiment with this.

Data deduplication (dedicated chunker for ZIM?)

Potential problem with using ZIM directly is deduplication. When we put unpacked mirror on IPFS, a lot of data does not change between snapshots. All media assets such as images, audio files etc get deduplicated across entire IPFS swarm (all snapshots, all websites using the same image are cohosting it).

iiuc ZIM does internal compression of Clusters (>1MB) of data, which means each ZIM file is a different stream of bytes, defeating deduplication provided by IPFS.

My understanding is that good deduplication is not possible, unless ZIM Cluster compression is deterministic across snapshots (always compresses same assets, and compressing same assets produces exactly the same array of bytes) AND/OR we add ZIM to IPFS using custom chunker, that is aware of its internal structure, enabling deduplication of the same content across snapshots. This could also be neat demo of what is possible with https://ipld.io

Update: created #71 to benchmark the level of deduplication we can get with regular ipfs add + some custom parameters

Please let me know if I missed something here.

kelson42 · 2019-09-09T14:47:21Z

@lidel Distributing ZIM files via IPFS would be interesting and I would volunteer to make it if the process is not too complex.

We have two ZIM readers in Javascript:

One binding library https://github.com/openzim/libzim
One pure Javascript reader (mostly distributed as extension for Chrome and Firefox) https://github.com/kiwix/kiwix-js

But so far Kiwix-JS is not able to read a ZIM file online, see kiwix/kiwix-js#356

But what I had in mind originally was to provide a server side service (so not just HTML files) able to read the ZIM files on demand and provide the content via IPFS. This would simplify the publication process by avoiding the data extraction process from the ZIM files. Not sure this is technically possible.

eminence · 2019-09-09T15:38:10Z

Another possibility would be to compile https://github.com/dignifiedquire/zim to webassembly to use in a browser

kelson42 · 2019-09-09T15:39:39Z

@eminence This is also basically possible with the libzim/libkiwix as well... but looks like the result is not able to handle files over 4GB :(

lidel · 2020-04-29T11:49:49Z

Hi friends, I've published a draft of a devgrant for adding IPFS support to kiwix-js: ipfs/devgrants#49
Readable version: https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

It tries to define steps to have kiwix-js reading Wikipedia .zim archives from IPFS. Right now we are looking for people with bandwidth and interest in creating PoC to test feasibility of that approach. Feel free to comment on ipfs/devgrants#49

lidel · 2021-01-25T17:28:35Z

(quick update on the state of things for drive-by reader)

The current process of unpacking ZIM and tweaking HTML on per-case basis is a very, very wasteful, impossible to automate across different languages and not sustainable. Every time something breaks, and we need to sink a lot of time to fix the build scripts – if we allocated that time into web-based ZIM reader we would already be there.

I believe our effort should go into putting ZIM on IPFS and then reading them from IPFS via web browser (as I elaborated in the original idea draft + we read in the latest research tracked in past in kiwix/kiwix-js#595 and continued now in kiwix/kiwix-js#659).

lidel · 2021-04-12T22:47:51Z

Related: low-hanging optimization for future builds may be adding ZIM file to IPFS using trickle-dag (ipfs add --trickle) which is optimized for random seeking (default is optimized for reduction of link count).

~~IIUC this should improve use case when ZIM is read over HTTP range requests (if we talk to a public gateway or use preload servers).~~ Nah, not really. We just need to use ZIMs directly.

tl;dr we need web-based reader for ZIM archives:
https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

NOTE: the link above is old devgrant, we now can do it better with

modern verifiable block/CAR responses on gateways and IPIP-402 for partial CARs with blocks only for specific byte-ranges)

lidel · 2023-10-12T19:27:01Z

A lot changed since we've looked into this. Many new oportunities and protocols exist now, that did not before.
Direct use of ZIMs continues in fresh issue #140

kelson42 mentioned this issue Jun 11, 2017

Hackathon Wikimania 2017 #41

Closed

lidel mentioned this issue Jan 21, 2020

ZIM mirror at download.wikipedia-on-ipfs.org #69

Open

4 tasks

lidel mentioned this issue Apr 16, 2020

ZIM on IPFS: maximizing deduplication #71

Open

lidel mentioned this issue Apr 23, 2020

kiwix-js + IPFS = unstoppable wikipedia mirror ipfs/devgrants#49

Closed

3 tasks

lidel pinned this issue Jan 25, 2021

This was referenced Jan 25, 2021

Update en.wikipedia-on-ipfs.org #61

Closed

Add Chinese Version distributed wikipedia mirror. #74

Closed

Improve preload mechanism to support random access via ipfs.cat ipfs/js-ipfs#3510

Closed

This was referenced Feb 15, 2021

Update tr.wikipedia-on-ipfs.org #60

Closed

Rendering pages from XML sources instead of kiwix HTML dumps #9

Closed

Set up collaborative pinning clusters #68

Closed

lidel mentioned this issue Aug 18, 2021

Wikipedia Mirror for Afghanistan #100

Closed

lidel added the help wanted label Nov 29, 2021

lidel mentioned this issue Feb 22, 2023

Minify output files #117

Closed

lidel closed this as completed Oct 12, 2023

lidel unpinned this issue Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Use directly ZIM files? #42

Question: Use directly ZIM files? #42

kelson42 commented Jun 11, 2017

singpolyma commented Oct 2, 2017

derhuerst commented Oct 2, 2017

kelson42 commented Sep 9, 2019

lidel commented Sep 9, 2019 •

edited

Loading

kelson42 commented Sep 9, 2019

eminence commented Sep 9, 2019

kelson42 commented Sep 9, 2019 •

edited

Loading

lidel commented Apr 29, 2020 •

edited

Loading

lidel commented Jan 25, 2021 •

edited

Loading

lidel commented Apr 12, 2021 •

edited

Loading

lidel commented Oct 12, 2023 •

edited

Loading

Question: Use directly ZIM files? #42

Question: Use directly ZIM files? #42

Comments

kelson42 commented Jun 11, 2017

singpolyma commented Oct 2, 2017

derhuerst commented Oct 2, 2017

kelson42 commented Sep 9, 2019

lidel commented Sep 9, 2019 • edited Loading

Potential path to using ZIM directly

Publishing ZIM on IPFS

Web-based reader

Concerns / Unknowns

Data deduplication (dedicated chunker for ZIM?)

kelson42 commented Sep 9, 2019

eminence commented Sep 9, 2019

kelson42 commented Sep 9, 2019 • edited Loading

lidel commented Apr 29, 2020 • edited Loading

lidel commented Jan 25, 2021 • edited Loading

lidel commented Apr 12, 2021 • edited Loading

lidel commented Oct 12, 2023 • edited Loading

lidel commented Sep 9, 2019 •

edited

Loading

kelson42 commented Sep 9, 2019 •

edited

Loading

lidel commented Apr 29, 2020 •

edited

Loading

lidel commented Jan 25, 2021 •

edited

Loading

lidel commented Apr 12, 2021 •

edited

Loading

lidel commented Oct 12, 2023 •

edited

Loading