-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mirror preparation script #70
Conversation
Note this is using kanej/zim until the jpeg fix has been merged upstream.
The node script currently modifies files in the out folder in place. Fixing links and appending a footer.
Mainly to get a progress bar
…nder article In Turkish wikipedia articles can be nested 3 folders deep. We now walk the directory tree looking for html files, and process each one.
…ect rather than kiwix
…titue it in based on version
…ter #) Fixes a bug where Justice#Another was being changed to Justice#Another.html rather than Justice.html#Another.
The complete url is now passed through from the command line.
…omepage used The Turkey wikipedia homepage referred to in the zim file appears to be a version set to the users page. You can override it with the `--mainpageversion` flag.
…kipedia-on-ipfs` logo
When deployed to ipfs the src and href attributes where one directory too high.
@lidel is there text for the takedown policy / contact information? Currently the footer has a placeholder: |
@kanej This is amazing! ❤️ Quick feedback / questions:
Let me know your thoughts 🚀 |
…cessing Defaults to 6 threads, but can be overwritten with `--numberofworkerthreads=12` flag. Sped up Turkish wikipedia snapshot from 30mins to 10mins on Mac mini(2018)
Included in eslint config as well.
If serving from ipfs.io show the https://ipfs.io/legal as the link, otherwise point the user at domain tools.
@lidel thanks for the feedback! Taking the points in turn: 1. Preferring not to touch every article due to build speedAh ... good point, I mean if practicality is your thing. Yesterday before I read your comment I implemented a thread worker pool, which does speed things up, but I take the point about wanting to minimise the intervention on the unpacked directory. I think I just assumed that 2. Legal noticeI have pushed a commit with the approach you suggested. Demo: https://ipfs.io/ipfs/bafybeicxrdsibx4mwqwmavsocdww5u5k5duqbwtgtiy6oz26sci4thd7yy/wiki/Main_Page.html On ipfs.io you will see: Everywhere else you will get a domaintools link: I would be tempted to leave out the second one entirely, putting in the guard only for ipfs.io, but I will leave that as a call for someone else, right now it is in. Any text changes let me know (the text for both are in 3. Turkish homepage version number weirdnessAgreed. 4. Using local already downloaded zimSorry this is a poor communication in the README.md, or none in the case of the parameter. The script flag You specify the unpacked zim directory to be processed by the node script as the first argument. OtherThe first task on the todo list is to eliminate js errors on the page. After fixing up the links from pages to js, the errors are still these. I ran I think this is an upstream problem, maybe 'mwoffliner'? We could remove links to the non-working files to clean up the console, but that would be a per article thing. Put another way can we take getting rid of those js errors off the todo list? I will concede to removing any js errors that I introduce. |
Reusing .js injection instead of rewriting each .html
👍 Make sure we pick JS file that is always present (would be good to check a few ZIMs just to be sure). Also see my notes on JS errors at the end of this comment. zimfiledownloadurlPerhaps just rename it to Legal noticeI know this part is tedious, but we really need to keep it and show something on third party gateways, just to ensure IPFS project does not need to deal with bogus takedowns aimed at other people's gateways :-) That being said, no point in showing it on localhost, parhaps you could update logic to do something like this? if (window.location.hostname === 'ipfs.io' || window.location.hostname.indexOf('dweb.link') !== -1) {
// show link to https://ipfs.io/legal/
} else if (['localhost', '127.0.0.1', '[::1]' ].indexOf(window.location.hostname) !== -1 || window.location.protocol.indexOf('http') !== -1) {
// This content is provided by a third party IPFS gateway.
// To report copyright infringement please contact the owner of ${window.location.hostname}
} JS errors in consoleI agree, this looks like an upstream issue – filled openzim/mwoffliner#1034, lets wait for response there. If JS errors are blocking your work on injecting the footer, see if you could use one of earlier scripts. If that is not possible, check if |
Thank you for super quick responses @kanej :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I tested it with a small ZIM and build was pretty smooth 👍
I will continue testing, but initial feedback below.
Co-Authored-By: Marcin Rataj <[email protected]>
… link and snapshot link To match other pages the canonical link does not have a version number and the snaphot link does.
Hi @lidel I have separated out the canonical and snapshot links, and grouped everything into one badge: I have also added some basic styling to support mobile devices (it looks like mediawiki has updated to be responsive since the last snapshot). Turkish Wikipedia./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W 19869765 Demo: https://ipfs.io/ipfs/bafybeiawhknrpld5u6e5mdduklbi6yuuhzee7txeiezml4eopt5nrzfkxy English Wikiquote./mirrorzim.sh en wikiquote Demo: https://ipfs.io/ipfs/bafybeia22ol6qgent3ul37ilvkg6gji25qo3sjvxqus2kwouix7jzelniu |
…ignifiedquire/zim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kanej Hi, thank you for addressing feedback, we are pretty close to the state in which we can merge this 🙌
I am running a local build of Turkish wiki today (will add more feedback when its done), but for now some quick notes:
-
Main page from your Turkish example has some broken links starting with
/w/index.php
that produce HTTP 404:<a href="/w/index.php.html?title=Anasayfa/Karde%C5%9F_projeler&action=edit&redlink=1" class="new" title="Anasayfa/Kardeş projeler (sayfa mevcut değil)">Kardeş projeler</a>
Those seem to be dynamic in nature, so I think we should just fix those with JS, so they point at working script at original instance:
/w/index.php.html?
→https://tr.wikipedia.org/w/index.php?
-
Turkish Main page continues to point at resources at
upload.wikimedia.org/wikipedia/commons/
insrcset
attribute:<img alt="HILLBLU w.png" src="../I/m/19px-HILLBLU_w.png" decoding="async" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/HILLBLU_w.png/29px-HILLBLU_w.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/HILLBLU_w.png/38px-HILLBLU_w.png 2x" data-file-width="149" data-file-height="149" width="19" height="19">
I think we should either do for
srcset
the same is already done forsrc
, or remove the attribute. -
mirrorzim.sh
: it is impossible to pass custom OID without passing IPNS.
Switching parameter passing to explicit key:val pairs (--ipns <val>
etc) would make the script much more useful, as we may have sites which use only one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kanej Ok, I did a local build with wikipedia_tr_all_maxi_2019-12.zim
via ./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W 19869765
and it went fin – apart form a few 404 on the Main page, but that may be due to Main page OLDID being out of sync with the snapshot, so I ignored errors on the Main page for now and focused on regular articles.
(I tested Turkish build with original main page (./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org
): bafybeieoya74422ovlmx23i5bxpuw2szsdrhsjwenfxkqoknw34jigcoua – without search this landing page may work bit better for now, as it encourages exploration instead of a static set of articles.)
Remaining issues:
- missing images
1.1. /wiki/Amerika_Birle%C5%9Fik_Devletleri_Hava_Kuvvetleri.html is unable to find/I/m/Military_service_mark_of_the_United_States_Air_Force.svg.png
(double extension looks suspicious – perhaps SVG requires custom handling?)
1.2. /wiki/Lockheed_Martin_F-22_Raptor.html (at the bottom) fails to load/I/m/F-22_underside.jpeg
(JS did not fix it to.jpg
for some reason)
1.3. /wiki/Britanya_%C4%B0mparatorlu%C4%9Fu.html fails to load/I/m/British_colonies_1763-76_shepherd1923.PNG
- broken links
2.1. /wiki/Ha%C3%A7l%C4%B1_Seferleri.html links to /wiki/VII._Gregorius.html but the latter produces error – probably due to double.
in filename?- same problem with
/wiki/I._D%C3%BCnya_Sava%C5%9F%C4%B1.html
at /wiki/1920.html
- same problem with
- other
3.1. /wiki/ is a broken page (it should redirect to Main page instead, just like/
- reuse/index.html
?)
This is no longer needed as we append js to do the equivalent.
The srcset attrib was pointing at none local images.
Usage updated in readme and help section of the script.
Updates the site.js to fix casing issues and in jpg and png links.
Hi @lidel, Turkish Wikipedia./mirrorzim.sh \
--languagecode=tr \
--wikitype=wikipedia \
--hostingdnsdomain=tr.wikipedia-on-ipfs.org \
--hostingipnshash=QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W \
--mainpageversion=19869765 Demo: https://ipfs.io/ipfs/bafybeihtoa6mcsgkotg3rd3xlym7pulcj7hifv4begsajfqad7bndqbpq4 IssuesMain page from your Turkish example has some broken links starting with /w/index.phpI removed the categories section of the home page, that is where these links where, and it was broken with the kiwix js problems. Turkish Main page continues to point at resources at upload.wikimedia.org/wikipedia/commons/ in srcsetsrcset is now removed mirrorzim.sh: it is impossible to pass custom OID without passing IPNSI have switched to key value pair args as you suggest. The README.md gives the Turkish wikipedia example. missing images1. /wiki/Amerika_Birle%C5%9Fik_Devletleri_Hava_Kuvvetleri.htmlThis image is missing in the zim file. Kiwix-serve doesn't have it either. 2. /wiki/Lockheed_Martin_F-22_Raptor.htmlFixed 3. . /wiki/Britanya_%C4%B0mparatorlu%C4%9Fu.htmlFixed broken links/wiki/VII._Gregorius.html is not being extracted from the zim file, but it is in there (it appears in kiwix-serve). I think this is one for the rust extract zim program. /wiki/ is a broken pageAs suggested this now redirects to the homepage e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Probably good to merge already, but I'll run this again and against a bigger snapshots and report back, just in case we missed something: may take some time, thank you for your patience.
Apologies for the radio silence! I hoped to test this against latest version of English wiki, but has a sync call yesterday and the Kiwix project is still struggling to produce one and it won't be available at least in the next two weeks (possibly more, it takes 6 days to generate, and last build failed at 99%, ouch). So instead, to unblock this, I am testing this PR against newly published Turkish one TL;DR I am pretty confident we can merge this, unblock bounty and produce new Turkish snapshot this week. I just need to double check Turkish build with main page override. Details below. Quick feedback:
|
@lidel thank you for the update! +1 on merging this, fixing the version issue would be good but I don't think it's a blocker @kanej would you be open to using gitcoin to get payment for this bounty (in DAI)? we're looking into trying gitcoin as a bounty platform and I would like to manually run one through to get a feel for the system |
@parkan Happy to go the gitcoin route. |
I did one last pass and believe the current state of this PR is meeting Acceptance Criteria of the bounty listed in #64, namely:
Once again, thank you for your patience @kanej and thank you @parkan for helping with finalizing this bounty. Demo: wikipedia_tr_all_maxi_2020-04I've built an updated Turkish snapshot from Digression on what this PR means for the distributed mirror projectNote that the goal here was to produce updated snapshot in best-effort fashion by fixing blockers listed in #64. Rough edges in the process of producing mirror from an unpacked ZIM remain because ZIMs were never designed to be used in unpacked form and existing tooling is lacking. There will be broken links, but I don't believe investing more time in fixing them makes sense: we are approaching the space of diminishing returns and the time will be better spent if we look into alternative ways of mirroring Wikipedia, decreasing complexity of this project. Expect updates related to research into putting ZIMs on IPFS and reading them without the need for unpacking. Ideally, with a regular web browser. |
Work in progress, do not merge yet
This PR replaces the
execute-changes.sh
script with a nodejs script that works with the updated Kiwix Wikipedia format changes.The previous
execute-changes.sh
script post processed the unpacked zim Wikipedia directory to make it a usable website and append attribution information. The node script has taken over both these functions.This PR is trying to solve the issues summarised in #64.
TODO
A more detailed version of the todo list is in the linked issue: #64
Status
Demo: https://ipfs.io/ipfs/bafybeick7xx6s6mxnstco7z4t4lzjqldesgidm6afbpllqm23xivqqmrgi
The replacement node version of
execute-changes.sh
produces a website with appended footers and a version of the original Main page (rather than the kiwix version). The README.md has been updated to reflect the new steps.To run through the Turkish snapshot:
Issues
https://en.wikipedia.org/wiki/Aquinas
redirects tohttps://en.wikipedia.org/wiki/Thomas_Aquinas
), this may be an option onextract_zim
Notes
getzim.sh
are now downloaded to the./snapshots
folderoldid
method resulted in a version of main page that had been overriden with the Kiwix version e.g.https://tr.wikipedia.org/wiki/Anasayfa?oldid=21118304
. I am not sure what is going on here.