Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mirror preparation script #70

Merged
merged 58 commits into from
Apr 9, 2020

Conversation

kanej
Copy link
Member

@kanej kanej commented Feb 27, 2020

closes #46, #64

Work in progress, do not merge yet

This PR replaces the execute-changes.sh script with a nodejs script that works with the updated Kiwix Wikipedia format changes.

The previous execute-changes.sh script post processed the unpacked zim Wikipedia directory to make it a usable website and append attribution information. The node script has taken over both these functions.

This PR is trying to solve the issues summarised in #64.

TODO

A more detailed version of the todo list is in the linked issue: #64

  • Ensure there are no JS errors when pages are loaded
  • Make it possible to navigate to other articles
  • Custom footer needs to be appended to every page
  • Update footer contents
    • add link to article snapshot at original Wikipedia
    • add link to the source .zim file
    • remove logos/buttons of centralized services
    • include information on takedown policy / contact (eg. if latest snapshot includes information removed in upstream wikipedia)
  • Restore original Main Page

Status

Demo: https://ipfs.io/ipfs/bafybeick7xx6s6mxnstco7z4t4lzjqldesgidm6afbpllqm23xivqqmrgi

The replacement node version of execute-changes.sh produces a website with appended footers and a version of the original Main page (rather than the kiwix version). The README.md has been updated to reflect the new steps.

To run through the Turkish snapshot:

# Install the dependencies, this will build the extract_zim rust utility as well
yarn

# Pull down the latest Turkish snapshot and put it in ./snapshots
bash ./getzim.sh download wikipedia wikipedia tr all maxi latest

# Run the extract_zim unpack utility
./extract_zim/extract_zim --skip-link ./snapshots/wikipedia_tr_all_maxi_2019-12.zim  --out ./tmp

# Run the replacement execute-changes script
node ./bin/run ./tmp \
  --hostingdnsdomain=tr.wikipedia-on-ipfs.org \
  --hostingipnshash=QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W \
  --zimfiledownloadurl=https://download.kiwix.org/zim/wikipedia/wikipedia_tr_all_maxi_2019-12.zim \
  --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \
  --mainpage=Anasayfa.html \
  --mainpageversion=19869765

# Push the converted website into IPFS (see README.md for setting up sharding + badgerds)
ipfs add -r --cid-version 1 --offline ./tmp

Issues

  1. There is no search bar or function
  2. Redirects are not in place (https://en.wikipedia.org/wiki/Aquinas redirects to https://en.wikipedia.org/wiki/Thomas_Aquinas), this may be an option on extract_zim
  3. There are no category pages (not sure if this is an existing issue)

Notes

  • Snapshots through getzim.sh are now downloaded to the ./snapshots folder
  • the node app processes all the articles, it is currently single threaded and slow (30mins for Turkish Wikipedia on Mac Mini)
  • The extract_zim used is a fork with a bug fix around jpeg file extensions, this will be swapped back once merged upstream
  • A dockerfile attempts to capture the build tools and requirements
  • An override on the version of the main page had to be provided for Turkish Wikipedia, the oldid method resulted in a version of main page that had been overriden with the Kiwix version e.g. https://tr.wikipedia.org/wiki/Anasayfa?oldid=21118304. I am not sure what is going on here.

Note this is using kanej/zim until the jpeg fix has been merged upstream.
The node script currently modifies files in the out folder in place. Fixing links and appending a
footer.
…nder article

In Turkish wikipedia articles can be nested 3 folders deep. We now walk the directory tree looking
for html files, and process each one.
…ter #)

Fixes a bug where Justice#Another was being changed to Justice#Another.html rather than
Justice.html#Another.
The complete url is now passed through from the command line.
…omepage used

The Turkey wikipedia homepage referred to in the zim file appears to be a version set to the users
page. You can override it with the `--mainpageversion` flag.
When deployed to ipfs the src and href attributes where one directory too high.
@kanej
Copy link
Member Author

kanej commented Feb 27, 2020

@lidel is there text for the takedown policy / contact information? Currently the footer has a placeholder:

image

@lidel
Copy link
Member

lidel commented Feb 27, 2020

@kanej This is amazing! ❤️

Quick feedback / questions:

  1. Did HTML in ZIMs changed that much that we need to modify every article,
    or is one of requirements in [BOUNTY] Fix script responsible for preparing IPFS mirror #64 forcing you to do so?

    IIUC you add footer HTML to every article. It takes 30 minutes for Turkish, but will be extremely expensive for bigger wikis, such as English one. Our changes to unpacked snapshot should be surgical and minimal if possible.

    Perhaps we could continue doing what we did before: modify JS already present in ZIM and append code responsible for dynamically adding footer on the fly? Some JS scripts are already loaded by existing HTML (eg. /-/j/js_modules/script.js) so if we go that route there is no need for touching every article, reducing build time significantly.

  2. Linking to the takedown policy (https://ipfs.io/legal/) is tricky, because content can be loaded from canonical gateways at ipfs.io (path) and dweb.link (subdomain), but also from localhost or someone's else public gateway.

    Distributed Wikipedia Project wants to avoid situation where IPFS project's legal team gets takedown for someone's else HTTP gateway.

    Perhaps something like this would be good enough?

    if (window.location.hostname === 'ipfs.io' || window.location.hostname.indexOf('dweb.link') !== -1) {
      // show link to https://ipfs.io/legal/
    } else {
      // show link to https://whois.domaintools.com/${window.location.hostname} (probably best we can do)
    }
  3. https://tr.wikipedia.org/wiki/Anasayfa?oldid=21118304 is really odd, seems to be a human error (perhaps someone edited upstream Anasayfa article and then it got fixed by moving it to https://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1:The_other_Kiwix_guy/Landing?). I think it is ok to keep manual override for now, until we test automatic mode on some other language. If the issue is not specific to that one snapshot, we will reach Kiwix for help.

  4. There should be a way to use local, already downloaded ZIM instead of fetching it every time via zimfiledownloadurl
    Initial idea: change it to zimfilelocation and if value starts with http then do what it does now, but if it does not start with http expect file to be at provided path (one can expose local file via Docker's -v or --mount).

Let me know your thoughts 🚀

…cessing

Defaults to 6 threads, but can be overwritten with `--numberofworkerthreads=12` flag. Sped up
Turkish wikipedia snapshot from 30mins to 10mins on Mac mini(2018)
Included in eslint config as well.
If serving from ipfs.io show the https://ipfs.io/legal as the link, otherwise point the user at
domain tools.
@kanej
Copy link
Member Author

kanej commented Feb 28, 2020

@lidel thanks for the feedback!

Taking the points in turn:

1. Preferring not to touch every article due to build speed

Ah ... good point, I mean if practicality is your thing. Yesterday before I read your comment I implemented a thread worker pool, which does speed things up, but I take the point about wanting to minimise the intervention on the unpacked directory. I think I just assumed that execute-changes.sh was touching each article, as the unpacked files seemed to have broken links. I will take a look this afternoon at bringing back the body.js approach (though body.js is no longer a referenced js file, so an append to site.js might be the best).

2. Legal notice

I have pushed a commit with the approach you suggested. Demo:

https://ipfs.io/ipfs/bafybeicxrdsibx4mwqwmavsocdww5u5k5duqbwtgtiy6oz26sci4thd7yy/wiki/Main_Page.html

On ipfs.io you will see:

image

Everywhere else you will get a domaintools link:

image

I would be tempted to leave out the second one entirely, putting in the guard only for ipfs.io, but I will leave that as a call for someone else, right now it is in.

Any text changes let me know (the text for both are in ./src/footer_fragment.handlebars)

3. Turkish homepage version number weirdness

Agreed.

4. Using local already downloaded zim

Sorry this is a poor communication in the README.md, or none in the case of the parameter. The script flag --zimfiledownloadurl is for substitution into the footer, it doesn't download anything just gets added to the page for the zim link. For download the getzim choose command is still used, though now it outputs to the ./snapshots folder. If you rerun getzim it will not an existing file, it will just verify it.

You specify the unpacked zim directory to be processed by the node script as the first argument.

Other

The first task on the todo list is to eliminate js errors on the page. After fixing up the links from pages to js, the errors are still these. I ran kiwix-serve, the kiwix projects http-server for zim files and that set of errors appears (including the malformed jsConfigVars.js file you mentioned).

image

I think this is an upstream problem, maybe 'mwoffliner'? We could remove links to the non-working files to clean up the console, but that would be a per article thing.

Put another way can we take getting rid of those js errors off the todo list? I will concede to removing any js errors that I introduce.

@lidel
Copy link
Member

lidel commented Feb 28, 2020

Reusing .js injection instead of rewriting each .html

I will take a look this afternoon at bringing back the body.js approach (though body.js is no longer a referenced js file, so an append to site.js might be the best).

👍

Make sure we pick JS file that is always present (would be good to check a few ZIMs just to be sure). Also see my notes on JS errors at the end of this comment.

zimfiledownloadurl

Perhaps just rename it to zimfilesourceurl ? :)

Legal notice

I know this part is tedious, but we really need to keep it and show something on third party gateways, just to ensure IPFS project does not need to deal with bogus takedowns aimed at other people's gateways :-)

That being said, no point in showing it on localhost, parhaps you could update logic to do something like this?
It would not show copyright notice on localhost gateways and when non-http protocol is used (future proofing, for time when we have native support in some browsers):

if (window.location.hostname === 'ipfs.io' || window.location.hostname.indexOf('dweb.link') !== -1) {
  // show link to https://ipfs.io/legal/
} else if (['localhost', '127.0.0.1', '[::1]' ].indexOf(window.location.hostname) !== -1 || window.location.protocol.indexOf('http') !== -1) {
  // This content is provided by a third party IPFS gateway. 
  // To report copyright infringement please contact the owner of ${window.location.hostname}
}

JS errors in console

I agree, this looks like an upstream issue – filled openzim/mwoffliner#1034, lets wait for response there.

If JS errors are blocking your work on injecting the footer, see if you could use one of earlier scripts. If that is not possible, check if jsConfigVars.js contents are "(" and replace it with some stub that fixes errors. Hopefully that rabbithole won't be too deep 🤞

@lidel
Copy link
Member

lidel commented Mar 5, 2020

Thank you for super quick responses @kanej :)
Hour is bit late for me today to do this justice, but will do my best to block some time and review on Friday, worst case Monday.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I tested it with a small ZIM and build was pretty smooth 👍
I will continue testing, but initial feedback below.

src/templates/footer_fragment.handlebars Outdated Show resolved Hide resolved
src/templates/footer_fragment.handlebars Outdated Show resolved Hide resolved
src/templates/footer_fragment.handlebars Outdated Show resolved Hide resolved
src/templates/footer_fragment.handlebars Outdated Show resolved Hide resolved
extract_zim/Makefile Outdated Show resolved Hide resolved
src/templates/footer_fragment.handlebars Outdated Show resolved Hide resolved
@kanej
Copy link
Member Author

kanej commented Mar 10, 2020

Hi @lidel

I have separated out the canonical and snapshot links, and grouped everything into one badge:

image

I have also added some basic styling to support mobile devices (it looks like mediawiki has updated to be responsive since the last snapshot).

Turkish Wikipedia

./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W 19869765

Demo: https://ipfs.io/ipfs/bafybeiawhknrpld5u6e5mdduklbi6yuuhzee7txeiezml4eopt5nrzfkxy

English Wikiquote

./mirrorzim.sh en wikiquote

Demo: https://ipfs.io/ipfs/bafybeia22ol6qgent3ul37ilvkg6gji25qo3sjvxqus2kwouix7jzelniu

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kanej Hi, thank you for addressing feedback, we are pretty close to the state in which we can merge this 🙌

I am running a local build of Turkish wiki today (will add more feedback when its done), but for now some quick notes:

  1. Main page from your Turkish example has some broken links starting with /w/index.php that produce HTTP 404:

    <a href="/w/index.php.html?title=Anasayfa/Karde%C5%9F_projeler&amp;action=edit&amp;redlink=1" class="new" title="Anasayfa/Kardeş projeler (sayfa mevcut değil)">Kardeş projeler</a>

    Those seem to be dynamic in nature, so I think we should just fix those with JS, so they point at working script at original instance:
    /w/index.php.html?https://tr.wikipedia.org/w/index.php?

  2. Turkish Main page continues to point at resources at upload.wikimedia.org/wikipedia/commons/ in srcset attribute:

    <img alt="HILLBLU w.png" src="../I/m/19px-HILLBLU_w.png" decoding="async" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/HILLBLU_w.png/29px-HILLBLU_w.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/HILLBLU_w.png/38px-HILLBLU_w.png 2x" data-file-width="149" data-file-height="149" width="19" height="19"> 

    I think we should either do for srcset the same is already done for src, or remove the attribute.

  3. mirrorzim.sh: it is impossible to pass custom OID without passing IPNS.
    Switching parameter passing to explicit key:val pairs (--ipns <val> etc) would make the script much more useful, as we may have sites which use only one of them.

src/process-article.ts Outdated Show resolved Hide resolved
src/utils/download-file.ts Outdated Show resolved Hide resolved
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kanej Ok, I did a local build with wikipedia_tr_all_maxi_2019-12.zim
via ./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W 19869765 and it went fin – apart form a few 404 on the Main page, but that may be due to Main page OLDID being out of sync with the snapshot, so I ignored errors on the Main page for now and focused on regular articles.

(I tested Turkish build with original main page (./mirrorzim.sh tr wikipedia tr.wikipedia-on-ipfs.org ): bafybeieoya74422ovlmx23i5bxpuw2szsdrhsjwenfxkqoknw34jigcoua – without search this landing page may work bit better for now, as it encourages exploration instead of a static set of articles.)

Remaining issues:

  1. missing images
    1.1. /wiki/Amerika_Birle%C5%9Fik_Devletleri_Hava_Kuvvetleri.html is unable to find /I/m/Military_service_mark_of_the_United_States_Air_Force.svg.png (double extension looks suspicious – perhaps SVG requires custom handling?)
    1.2. /wiki/Lockheed_Martin_F-22_Raptor.html (at the bottom) fails to load /I/m/F-22_underside.jpeg (JS did not fix it to .jpg for some reason)
    1.3. /wiki/Britanya_%C4%B0mparatorlu%C4%9Fu.html fails to load /I/m/British_colonies_1763-76_shepherd1923.PNG
  2. broken links
    2.1. /wiki/Ha%C3%A7l%C4%B1_Seferleri.html links to /wiki/VII._Gregorius.html but the latter produces error – probably due to double . in filename?
    • same problem with /wiki/I._D%C3%BCnya_Sava%C5%9F%C4%B1.html at /wiki/1920.html
  3. other
    3.1. /wiki/ is a broken page (it should redirect to Main page instead, just like / - reuse /index.html?)

@kanej
Copy link
Member Author

kanej commented Mar 18, 2020

Hi @lidel,

Turkish Wikipedia

./mirrorzim.sh \
  --languagecode=tr \
  --wikitype=wikipedia \
  --hostingdnsdomain=tr.wikipedia-on-ipfs.org \
  --hostingipnshash=QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W \
  --mainpageversion=19869765

Demo: https://ipfs.io/ipfs/bafybeihtoa6mcsgkotg3rd3xlym7pulcj7hifv4begsajfqad7bndqbpq4

Issues

Main page from your Turkish example has some broken links starting with /w/index.php

I removed the categories section of the home page, that is where these links where, and it was broken with the kiwix js problems.

Turkish Main page continues to point at resources at upload.wikimedia.org/wikipedia/commons/ in srcset

srcset is now removed

mirrorzim.sh: it is impossible to pass custom OID without passing IPNS

I have switched to key value pair args as you suggest. The README.md gives the Turkish wikipedia example.

missing images

1. /wiki/Amerika_Birle%C5%9Fik_Devletleri_Hava_Kuvvetleri.html

This image is missing in the zim file. Kiwix-serve doesn't have it either.

2. /wiki/Lockheed_Martin_F-22_Raptor.html

Fixed

3. . /wiki/Britanya_%C4%B0mparatorlu%C4%9Fu.html

Fixed

broken links

/wiki/VII._Gregorius.html is not being extracted from the zim file, but it is in there (it appears in kiwix-serve). I think this is one for the rust extract zim program.

/wiki/ is a broken page

As suggested this now redirects to the homepage e.g. /wiki/Anasayfa.html

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Probably good to merge already, but I'll run this again and against a bigger snapshots and report back, just in case we missed something: may take some time, thank you for your patience.

@parkan
Copy link

parkan commented Apr 6, 2020

@kanej @lidel let me know if I can help get this over the finish line!

@lidel
Copy link
Member

lidel commented Apr 7, 2020

Apologies for the radio silence!

I hoped to test this against latest version of English wiki, but has a sync call yesterday and the Kiwix project is still struggling to produce one and it won't be available at least in the next two weeks (possibly more, it takes 6 days to generate, and last build failed at 99%, ouch).

So instead, to unblock this, I am testing this PR against newly published Turkish one wikipedia_tr_all_maxi_2020-04.zim (March snapshot).

TL;DR I am pretty confident we can merge this, unblock bounty and produce new Turkish snapshot this week. I just need to double check Turkish build with main page override. Details below.


Quick feedback:

  • As far as I was able to tell, issues that could be fixed in this project are fixed. Remaining bugs are caused by data already present (or missing) in ZIM file, which is outside of scope for this work.

  • When original "Main page" from ZIM is used, everything looks good: https://bafybeihyk5dwbk56nao4v2ks4m7rvod6x4puxh2gzxet3h2kexrviynere.ipfs.dweb.link/wiki/Anasayfa.html

    • The only thing I would fix is the page title:
      - <title>Kullan&#x131;c&#x131;:The other Kiwix guy/Landing - Vikipedi</title>
      + <title>Vikipedi: Özgür Ansiklopedi</title>
      
      But it is cosmetic and can be done manually.
  • When we pass --mainpageversion=19869765 to mirrorzim.sh it restores original "main page" but it does not seem to respect the actual version – instead, the latest version as of TODAY is fetched and used.

    • Example: https://bafybeih3bnqhgyqyiwl3ggclq7hsb2exz77fw7lrnus547yhnqnowjv3tq.ipfs.dweb.link/wiki/Anasayfa.html – header has date for 7th of April.
      was build with:
      ./mirrorzim.sh --languagecode=tr --wikitype=wikipedia --hostingdnsdomain=tr.wikipedia-on-ipfs.org --mainpageversion=19869765
      
    • It looks like version is not passed properly to the main script, but its bit late for me and possible I am doing something wrong – will try again tomorrow.
      • Note to self: I should find the new version as I'm using the new snapshot.

@parkan
Copy link

parkan commented Apr 7, 2020

@lidel thank you for the update! +1 on merging this, fixing the version issue would be good but I don't think it's a blocker

@kanej would you be open to using gitcoin to get payment for this bounty (in DAI)? we're looking into trying gitcoin as a bounty platform and I would like to manually run one through to get a feel for the system

@kanej
Copy link
Member Author

kanej commented Apr 8, 2020

@parkan Happy to go the gitcoin route.

@lidel lidel marked this pull request as ready for review April 9, 2020 22:44
@lidel
Copy link
Member

lidel commented Apr 9, 2020

I did one last pass and believe the current state of this PR is meeting Acceptance Criteria of the bounty listed in #64, namely:

  • PR with necessary changes is submitted and merged to this repo
  • Script works and enables us to produce updated IPFS mirror of the latest Turkish snapshot (tested with wikipedia_tr_all_maxi_2020-04.zim)
  • CID of a demo output is provided

Once again, thank you for your patience @kanej and thank you @parkan for helping with finalizing this bounty.

Demo: wikipedia_tr_all_maxi_2020-04

I've built an updated Turkish snapshot from wikipedia_tr_all_maxi_2020-04.zim:
https://bafybeih3pdooy7tghdiuibphro6cwlyxp4henmmap7f2qm6fyitgpqafle.ipfs.dweb.link

Digression on what this PR means for the distributed mirror project

Note that the goal here was to produce updated snapshot in best-effort fashion by fixing blockers listed in #64. Rough edges in the process of producing mirror from an unpacked ZIM remain because ZIMs were never designed to be used in unpacked form and existing tooling is lacking.

There will be broken links, but I don't believe investing more time in fixing them makes sense: we are approaching the space of diminishing returns and the time will be better spent if we look into alternative ways of mirroring Wikipedia, decreasing complexity of this project.

Expect updates related to research into putting ZIMs on IPFS and reading them without the need for unpacking. Ideally, with a regular web browser.

@lidel lidel merged commit 3992e0d into ipfs:master Apr 9, 2020
@lidel lidel mentioned this pull request Apr 9, 2020
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Error: file does not exist" when trying to add CA wikipedia
4 participants