Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tr.wikipedia-on-ipfs.org #60

Closed
8 of 9 tasks
lidel opened this issue Sep 9, 2019 · 16 comments
Closed
8 of 9 tasks

Update tr.wikipedia-on-ipfs.org #60

lidel opened this issue Sep 9, 2019 · 16 comments
Labels
Epic language language-specific issues P1 High: Likely tackled by core team if no one steps up

Comments

@lidel
Copy link
Member

lidel commented Sep 9, 2019

This could be done manually or as a part of #58

@lidel lidel added the language language-specific issues label Sep 9, 2019
@lidel lidel added the P1 High: Likely tackled by core team if no one steps up label Oct 28, 2019
@lidel
Copy link
Member Author

lidel commented Oct 28, 2019

Couldn't extract fully, some files fail with other os error described in dignifiedquire/zim#3:

$  extract_zim --skip-link wikipedia_tr_all_maxi_2019-10.zim --out distributed-wikipedia-mirror/out2
Mon 28 Oct 12:26:11 CET 2019                                                                                                                       
Extracting file: wikipedia_tr_all_maxi_2019-10.zim to distributed-wikipedia-mirror/out2

  Creating map
  Extracting entries: 4808
  Spawning 4808 threads
couldn't create distributed-wikipedia-mirror/out2/A/Eternity: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Eternity: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Eternity: other os error
thread '<unnamed>' panicked at 'failed retry: couldn't create distributed-wikipedia-mirror/out2/A/Eternity: other os error', extract_zim.rs:133:17
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
couldn't create distributed-wikipedia-mirror/out2/A/Karacaoğlan: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Karacaoğlan: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Karacaoğlan: other os error
thread '<unnamed>' panicked at 'failed retry: couldn't create distributed-wikipedia-mirror/out2/A/Karacaoğlan: other os error', extract_zim.rs:133:17
couldn't create distributed-wikipedia-mirror/out2/A/Dört/Mazi_Kalbimde: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Dört/Mazi_Kalbimde: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Dört/Mazi_Kalbimde: other os error
thread '<unnamed>' panicked at 'failed retry: couldn't create distributed-wikipedia-mirror/out2/A/Dört/Mazi_Kalbimde: other os error', extract_zim.rs:133:17
couldn't create distributed-wikipedia-mirror/out2/A/HIV: other os error
couldn't create distributed-wikipedia-mirror/out2/A/HIV: other os error
couldn't create distributed-wikipedia-mirror/out2/A/HIV: other os error
thread '<unnamed>' panicked at 'failed retry: couldn't create distributed-wikipedia-mirror/out2/A/HIV: other os error', extract_zim.rs:133:17
couldn't create distributed-wikipedia-mirror/out2/A/Fredrikstad/Sarpsborg: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Fredrikstad/Sarpsborg: other os error
couldn't create distributed-wikipedia-mirror/out2/A/Fredrikstad/Sarpsborg: other os error
thread '<unnamed>' panicked at 'failed retry: couldn't create distributed-wikipedia-mirror/out2/A/Fredrikstad/Sarpsborg: other os error', extract_zim.rs:133:17
...

Fixing this would be the first step to unblock this.

@kelson42
Copy link

kelson42 commented Nov 7, 2019

I want to make a side remark here. It is a pity that openZIM seems to not to provide an official tool you can just use to extract the ZIM content. I'm not sure about your exact requirement, but If I get them, I will seriously consider to do something to fix that problem for you. I want to really encourage you to create a feature request here https://github.com/openzim/zim-tools

@lidel
Copy link
Member Author

lidel commented Nov 7, 2019

@kelson42 thank you for bringing zim-tools to my attention!
I was not around when tweaked extract_zim was created, but been told it was created either because zimdump was simply not around yet and original one was missing some features.
I'll run some tests with zimdump from zim-tools and report back.

Update: I think we could switch, but some things need to be fixed first. See #66 :)

@dignifiedquire
Copy link
Member

this error should be fixed now in the latest version of dignifiedquire/zim

@dignifiedquire
Copy link
Member

Just ran the extraction, all fixed now

$time ./target/release/extract_zim --skip-link ~/Downloads/wikipedia_tr_all_maxi_2019-10.zim --out ./out
Extracting file: /Users/dignifiedquire/Downloads/wikipedia_tr_all_maxi_2019-10.zim to ./out

  Creating map
  Extracting entries: 4808
  Spawning 4808 tasks across 16 threads
  Extraction done in 47453ms
  Main page is Kullanıcı:The_other_Kiwix_guy/Landing
./target/release/extract_zim --skip-link  --out ./out  81.81s user 218.73s system 631% cpu 47.560 total

@lidel
Copy link
Member Author

lidel commented Jan 20, 2020

Thank you @dignifiedquire, this is great!

I took it for a spin and initial results are pretty good (wip in #67):

  • Unpacking wikipedia_tr_all_maxi_2019-10.zim took less than two minutes.
  • Adding ~10GB to IPFS (0.5.0-dev with sharding+badgerds) in --offline mode took under 10 minutes.

Next step is to figure out #64 and landing page for execute-changes.sh

@kelson42 do you know why .zim file states that the Main page is Kullanıcı:The_other_Kiwix_guy/Landing?
Is providing custom page a new convention in kiwix project?

@lidel
Copy link
Member Author

lidel commented Jan 22, 2020

  • Original landing page at ./out/A/Anasayfa.html seems to be truncated, the page includes only "article of the week" section, making it pretty bad landing page overall.

  • Tried to finish snapshot creation but scripts no longer work.
    JS and the directory structure changed so much, that entire execute-changes.sh needs to be redone.

  • I also noticed ./out/-/j/js_modules/jsConfigVars.js is invalid, its contents being: ( (the single character) file has the same contents when unpacked with zimdump, so its not a bug in extract_zim

  • Retested with wikipedia_tr_all_maxi_2019-12.zim, same results

Update: I created a bounty for remaining work: #64

@lidel
Copy link
Member Author

lidel commented Apr 9, 2020

#70 fixed the most painful blockers, and I was able to produce a new snapshot from wikipedia_tr_all_maxi_2020-04.zim:

It has a pretty nice footer with useful information about mirror and its sources.

Two cosmetic issues remain:

  1. Main page at /wiki/Anasayfa.html is fetched for the date of unpacking ZIM, instead of the day ZIM was created (This seems to be specific to Turkish wiki)
  2. Some links on the Main page are broken, for example ones in header (/wiki/Portal:Matematik.html etc)
    • Not a problem, as the old version did not have links there anyway

Both can be fixed manually by patching HTML in /wiki/Anasayfa.html, but if anyone has time to fix them programmatically, that would be useful for other languages.

@momack2
Copy link
Contributor

momack2 commented Apr 9, 2020

The view on mobile isn’t great - is that expected or a regression?
2E167E64-3635-4A38-8B5B-B6D3CD4DBB71

@lidel
Copy link
Member Author

lidel commented Apr 10, 2020

@momack2 Yes, if we want "original" landing page this is something we need to fix manually (I believe the one in old snapshot was also crafted by hand).

Context:

Original Main page is truncated or not included in many ZIM, so we have no "mobile friendly" version.
We download original HTML from wikipedia itself and add it to the unpacked ZIM snapshot, which as seen above requires some work.

FYI ZIM files often use a custom page provided by a contributor that makes more sense for offline use (example). You can see it does not have "topic of the day", instead its a dry list of wide topics ready to explore.

If we keep it, very little or no manual fixes may be needed because it is already simple enough to be mobile friendly – see this build where I left the original landing page from ZIM archive: https://bafybeieoya74422ovlmx23i5bxpuw2szsdrhsjwenfxkqoknw34jigcoua.ipfs.dweb.link/wiki/Anasayfa.html (added only the landing page, links may not resolve)

@momack2
Copy link
Contributor

momack2 commented Apr 10, 2020

Oh yeah - much better. Not perfect, but "good enough" for smooth browsing.

@lidel
Copy link
Member Author

lidel commented Feb 15, 2021

Ok friends, I've picked up the ball in #77 and produced a brand new snapshot from wikipedia_tr_all_maxi_2021-02.zim. If this goes well we will do the same for English (#61)

Highlights:

👉 take it for a spin and comment if you find any issues:

@kelson42
Copy link

kelson42 commented Feb 16, 2021

"Kiwix" in place of "kiwix" in the footer would be better. With a link to https://kiwix.org, even better :)

@lidel
Copy link
Member Author

lidel commented Feb 16, 2021

@kelson42 added this in #82 and created updated version:

@mburns mind switching https://tr.dev.wikipedia-on-ipfs.org to the above CID?

@mburns
Copy link

mburns commented Feb 17, 2021

done. :)

@lidel
Copy link
Member Author

lidel commented Feb 18, 2021

Ok, this should be good enough for now.

DNSLink is updated and https://tr.wikipedia-on-ipfs.org now points at:
bafybeieuutdavvf55sh3jktq2dpi2hkle6dtmebe7uklod3ramihyf3xa4 (generated from wikipedia_tr_all_maxi_2021-02)

I'm now shifting focus to English one, tracking that in #61

@lidel lidel closed this as completed Feb 18, 2021
@lidel lidel unpinned this issue Feb 18, 2021
lidel added a commit that referenced this issue Feb 19, 2021
@ipfs ipfs deleted a comment from VIP0000fa Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic language language-specific issues P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

5 participants