Generator directory format #6

newsch · 2023-06-23T22:00:14Z

I decided to break up the next steps into smaller PRs compared to the last one.

This PR updates the program to create to the folder structure that the map generator expects, e.g.:

.
├── de.wikipedia.org
│  └── wiki
│     ├── Coal_River_Springs_Territorial_Park
│     │  ├── de.html
│     │  └── ru.html
│     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
│     │  ├── de.html
│     │  └── en.html
│    ...
├── en.wikipedia.org
│  └── wiki
│     ├── Arctic_National_Wildlife_Refuge
│     │  ├── de.html
│     │  ├── en.html
│     │  ├── es.html
│     │  ├── fr.html
│     │  └── ru.html
│     │
│     │ **NOTE: Article titles with a `/` are not escaped, so "Baltimore/Washington_International_Airport" becomes two subfolders as below.**
│     │
│     ├── Baltimore
│     │  └── Washington_International_Airport
│     │     ├── de.html
│     │     ├── en.html
│     │     ├── es.html
│     │     ├── fr.html
│     │     └── ru.html
│    ...
└── wikidata
   ├── Q59320
   │  ├── de.html
   │  ├── en.html
   │  ├── es.html
   │  ├── fr.html
   │  └── ru.html
   ├── Q120306
   │  ├── de.html
   │  ├── en.html
   │  ├── es.html
   │  ├── fr.html
   │  └── ru.html
  ...

While the old description scraper would write duplicates for the same article's title and qid, this implementation writes symlinks in the wikipedia tree that point to the wikidata files.

I know I can change what the generator looks for, but I figured it would be easier to have this working and then change them together instead of debugging both at the same time while neither works.

The goal is that with this PR, the parser will be a drop-in replacement for the current scraper, even if the speed and html size is not what we'd like.

Remaining work for this PR:

handle articles without QIDs (yes, they exist! 🤷)
only write symlinks for requested redirects
handle updating existing files ~~(e.g. timestamps)~~ timestamps moved to Skip articles that haven't changed between dumps #9
do a test run with the generator and multiple languages
add documentation for running with multiple languages

biodranik

Good approach )

biodranik · 2023-06-24T05:45:00Z

src/main.rs

+            Ok(title) => title,
+        };
+
+        // NOTE: Some wikipedia titles have '/' in them.


How are they processed in the generator?

The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like en:Article_Title are converted to urls somewhere early in the OSM ingestion process.
It dumps the urls to a file for the descriptions scraper, then when it adds them to the mwm files it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location..
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.

Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.

I'm working on a list of changes that would be helpful

biodranik · 2023-06-24T05:45:09Z

src/main.rs

+
+        // NOTE: Some wikipedia titles have '/' in them.
+        let wikipedia_dir = title.get_dir(base.as_ref().to_owned());
+        // TODO: handle incorrect links, directories


For example?

The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.

biodranik · 2023-06-24T05:47:09Z

src/main.rs

+/// Write selected article to disk.
+///
+/// - Write page contents to wikidata page (`wikidata.org/wiki/QXXX/lang.html`).
+/// - If the page has no wikidata qid, write contents to wikipedia location (`lang.wikipedia.org/wiki/article_title/lang.html`).


Lang is used two times here in the path, but only one file is always stored in the directory, right?

The behavior that the generator/scraper expects is to write all available translations in each directory.
So for the article for Berlin, if there are OSM tags for wikipedia:en=Berlin, wikipedia:de=Berlin, wikipedia:fr=Berlin and wikidata=Q64, and the generator keeps them all, then there will be four folders with duplicates of all language copies:

en.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} de.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} fr.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} wikidata/Q64/{en.html, de.html, fr.html, ...}

Now, I don't understand exactly how the generator picks which tags to use yet, but just from looking at the Canada Yukon region map there are duplicated copies of wikipedia items there.

For this program, we only see one language at a time, so we write that copy to the master wikidata directory. When later we get the same article in a different language, we write it to the same wikidata directory.

Once all the languages have been processed, it would look like:

en.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ de.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ fr.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ wikidata/Q64/{en.html, de.html, fr.html, ...}

biodranik · 2023-06-24T05:48:53Z

src/wm/mod.rs

@@ -132,6 +152,11 @@ impl FromStr for WikidataQid {
 ///
 /// assert!(WikipediaTitleNorm::from_url("https://en.wikipedia.org/not_a_wiki_page").is_err());
 /// assert!(WikipediaTitleNorm::from_url("https://wikidata.org/wiki/Q12345").is_err());
+///
+/// assert!(
+///     WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil/Brigels").unwrap() !=


Can / be percent-escaped in such cases? How the generator handles it now?

I guess it could be, I haven't looked for that. Wikipedia works with either.

See below for more details, but the generator should decode those before dumping the urls.

It looks like a handful of encoded titles still slip through, but none with %2F=/.
I made an issue with some notes about this in #7.

From my read of when it first adds a wikipedia tag and later writes it as a url:

If the tag looks like a url instead of the expected lang:Article Title format, take what's after .wikipedia.org/wiki/, url decode it, replace underscores with spaces, then concat that with the lang at the beginning of the url and store it.

Otherwise attempt to check if it's a url, replace underscores with spaces, and store it.

To transform it back into a url, replace spaces with underscores in the title, escape any %s, and add it to the end of https://lang.wikipedia.org/wiki/.

Glancing at the url decoding, I don't think there's anything wrong with it - it should handle arbitrary characters, although neither the encoding or decoding look unicode-aware.

biodranik · 2023-06-24T05:49:27Z

src/wm/mod.rs

@@ -145,7 +170,7 @@ impl WikipediaTitleNorm {
        title.trim().replace(' ', "_")
    }

-    // https://en.wikipedia.org/wiki/Article_Title
+    // https://en.wikipedia.org/wiki/Article_Title/More_Title


Is more than one slash in the title possible?

Yes, there are a handful, for example https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower.

There are 39 present in the generator urls

$ grep -E '^https://\w+\.wikipedia\.org/wiki/.+/.+/' /tmp/wikipedia_urls.txt | sort | uniq https://de.wikipedia.org/wiki/Darum/Gretesch/Lüstringen https://de.wikipedia.org/wiki/Kienhorst/Köllnseen/Eichheide https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Erlangen/A#Altstädter_Friedhof_2/3,_Altstädter_Friedhof https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001-1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099)#Evang._Christuskirche https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/100–1/199) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/200–1/299) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/300–1/399) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/400–1/499) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580)#Schulgeb.C3.A4ude https://de.wikipedia.org/wiki/Rhumeaue/Ellerniederung/Gillersheimer_Bachtal https://de.wikipedia.org/wiki/Speck_/_Wehl_/_Helpenstein https://de.wikipedia.org/wiki/Veldrom/Feldrom/Kempen https://de.wikipedia.org/wiki/VHS_Witten/Wetter/Herdecke https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Bach https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Judenberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Kramerberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Loasleiten https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Pelzereck https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Theresienberg https://de.wikipedia.org/wiki/Wohnanlage_Arzbacher_Straße/Thalkirchner_Straße/Wackersberger_Straße/Würzstraße https://en.wikipedia.org/wiki/Abura/Asebu/Kwamankese_District https://en.wikipedia.org/wiki/Ajumako/Enyan/Essiam_District https://en.wikipedia.org/wiki/Bibiani/Anhwiaso/Bekwai_Municipal_District https://en.wikipedia.org/wiki/Clapp/Langley/Crawford_Complex https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower https://en.wikipedia.org/wiki/SAIT/AUArts/Jubilee_station https://en.wikipedia.org/wiki/Santa_Cruz/Graciosa_Bay/Luova_Airport https://fr.wikipedia.org/wiki/Landunvez#/media/Fichier:10_Samson_C.jpg https://gl.wikipedia.org/wiki/Moaña#/media/Ficheiro:Plano_de_Moaña.png https://it.wikipedia.org/wiki/Tswagare/Lothoje/Lokalana https://lb.wikipedia.org/wiki/Lëscht_vun_den_nationale_Monumenter_an_der_Gemeng_Betzder#/media/Fichier:Roodt-sur-Syre,_14_rue_d'Olingen.jpg https://pt.wikipedia.org/wiki/Wikipédia:Wikipédia_na_Universidade/Cursos/Rurtugal/Gontães https://ru.wikipedia.org/wiki/Алажиде#/maplink/0 https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Волинська_область/Старовижівський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Київська_область/Броварський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Полтавська_область/Семенівський_район

newsch · 2023-06-30T21:34:31Z

I ran them with all languages on my machine. I only have 4 cores, so more than two instances didn't show much of an improvement.
I didn't run into any errors, but there is a race condition between checking if the folder for a QID exists and creating it.
If we decide to do parallelism by running multiple instances, that should be handled. But I think we will be better off running multiple decompression threads internally.

Speaking of which, after investigating pgzip further, my understanding is it can only parallelize decompressing files that it compressed in a specific way. I'll make another issue for investigating other gunzip implementations.

biodranik · 2023-07-01T07:29:27Z

Parallelism is the next step, it can be done using existing tools. Let's lower its priority.

Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?

newsch · 2023-07-03T14:08:29Z

Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?

When running multiple instances in parallel, they could process different translations of an article at the same time, and interleave between checking that the QID folder doesn't exist and creating it.

The same thing could hypothetically happen with article title folders, but since each dump is in a different language it shouldn't occur.

It is probably unlikely to occur, and it won't take down the entire program.
I can add special handling for the error to mitigate it.

biodranik · 2023-07-04T11:57:13Z

Aren't file system operations atomic? Adding handler for the case "tried to create it but it was already created by another process" is a good idea.

newsch · 2023-07-04T13:41:18Z

Yes, individual syscalls should be atomic but I don't think there are any guarantees between the call to path.is_dir() and fs::create_dir(&path).

It looks like create_dir_all explicitly handles this though by checking if the directory exists after getting an error. So it should not be a problem after all.

biodranik · 2023-07-05T18:11:13Z

README.md

+To serve as a drop-in replacement for the descriptions scraper:
+- Install this tool to `$PATH` as `om-wikiparser`.
+- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
+- Set `WIKIPEDIA_ENTERPRISE_DUMPS` to the list of the dump files to process


List? Delimited by what? Any example? Is specifying a directory with dumps better?

I meant a shell list/array(?), separated by spaces.

One example is a glob, so using a directory and then referencing $WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz might be clearer?

It's better to mention list item separators explicitly and provide some example for clarity.

biodranik · 2023-07-05T18:11:36Z

README.md

@@ -4,5 +4,24 @@ _Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki

 ## Usage

+To serve as a drop-in replacement for the descriptions scraper:
+- Install this tool to `$PATH` as `om-wikiparser`.


Why it should be at PATH? Can it be run from any directory?

It doesn't need to be, the example script read more clearly to me if it's in the context of the intermediate_data directory. It could also be run as ../../../wikiparser/target/release/om-wikiparser, with cargo run --release from the wikiparser directory, or anything else.

...then why suggesting to install the tool at PATH?

So that you can always reference it as om-wikiparser wherever you are, without worrying about where it is relative to you, or copying it into your working directory.

I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.

Maybe writing a shell script to use on the maps server instead would be helpful?

Would you prefer:

# Transform intermediate files from generator. cut -f 2 id_to_wikidata.csv > wikidata_ids.txt tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \ --wikidata-ids wikidata_ids.txt \ --wikipedia-urls wikipedia_urls.txt \ descriptions/ done

or

# Transform intermediate files from generator. maps_build=~/maps_build/$BUILD_DATE/intermediate_data cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | ./target/release/om-wikiparser \ --wikidata-ids $maps_build/wikidata_ids.txt \ --wikipedia-urls $maps_build/wikipedia_urls.txt \ $maps_build/descriptions/ done

Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :)

cargo run -r may be even better instead of a path to binary :) But it's also ok to hard-code the path or use $WIKIPARSER_BINARY var.

Think about me testing your code soon on a production server. Less surprises = less stress ;-)

Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.

I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.

biodranik · 2023-07-05T18:13:39Z

README.md

+- Set `WIKIPEDIA_ENTERPRISE_DUMPS` to the list of the dump files to process
+- Run the following from within the `intermediate_data` subdirectory of the maps build directory:
+```shell
+# transform intermediate files from generator


Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?

Pros:

Independent of the generator process. Can be run as soon as planet file is updated.

Cons:

Need to keep osm query in sync with generator's own multi-step filtering and transformation process.

Need to match generator's multi-step processing of urls exactly.

When I did this earlier, it was with the osm-filter tool, I only tested it on the yukon region, and it output more entries than the generator did.

I can create an issue for this, but the rough steps to get that working are:

Convert osmfilter query to osmium command so it can work on pbf files directly.

Dig into generator map processing to try to improve querying.

Compare processing of a complete planet with generator output.

Write conversion of osmuim output for wikiparser to use.

Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?

What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles?

osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options?

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term.

Absolutely agree!

Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from osmfilter, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.

As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this osmfilter approach (based on ftypes_matcher.cpp).

What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right?

As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.

Do you remember how big is the percent of "unnecessary" articles?

That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.

How good is the osmium tool compared to other options?

I haven't looked into omium much, but my understanding is it is at least as powerful as osmfilter/osmconvert. I know we talked about using pbfs directly at some point so that's why I mentioned it.

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.

README.md

src/main.rs

src/wm/page.rs

The map generator expects a certain folder structure created by the current scraper to add the article content into the mwm files. - Article html is written to wikidata directory. - Directories are created for any matched titles and symlinked to the wikidata directory. - Articles without a QID are written to article title directory. - Article titles containing `/` are not escaped, so multiple subdirectories are possible. The output folder hierarchy looks like this: . ├── de.wikipedia.org │ └── wiki │ ├── Coal_River_Springs_Territorial_Park │ │ ├── de.html │ │ └── ru.html │ ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park │ │ ├── de.html │ │ └── en.html │ ... ├── en.wikipedia.org │ └── wiki │ ├── Arctic_National_Wildlife_Refuge │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ├── Baltimore │ │ └── Washington_International_Airport │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ... └── wikidata ├── Q59320 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ├── Q120306 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ... Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

newsch changed the title ~~Finalize directory format~~ Generator directory format Jun 23, 2023

biodranik reviewed Jun 24, 2023

View reviewed changes

newsch mentioned this pull request Jun 29, 2023

Initial Html Processing #10

Merged

4 tasks

newsch marked this pull request as ready for review June 30, 2023 21:24

newsch added this to the v0.1 milestone Jul 4, 2023

newsch requested a review from biodranik July 5, 2023 14:21

biodranik approved these changes Jul 6, 2023

View reviewed changes

newsch force-pushed the generator-compat branch from 9f529c1 to 382d351 Compare July 10, 2023 14:32

newsch merged commit 9036e34 into main Jul 10, 2023

newsch deleted the generator-compat branch July 10, 2023 14:34

This was referenced Jul 10, 2023

Make script for production #17

Closed

Document interface with generator #18

Open

Investigate using osmfilter/osmium for generating inputs #19

Closed

newsch mentioned this pull request Dec 27, 2023

Wikipedia's links should respect the current locale whenever possible organicmaps/organicmaps#6988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generator directory format #6

Generator directory format #6

newsch commented Jun 23, 2023 •

edited

Loading

biodranik left a comment

biodranik Jun 24, 2023

newsch Jun 24, 2023

biodranik Jun 24, 2023

newsch Jun 26, 2023

biodranik Jun 24, 2023

newsch Jun 24, 2023

biodranik Jun 24, 2023

newsch Jun 24, 2023

biodranik Jun 24, 2023

newsch Jun 24, 2023

biodranik Jun 24, 2023

newsch Jun 24, 2023

newsch commented Jun 30, 2023

biodranik commented Jul 1, 2023

newsch commented Jul 3, 2023

biodranik commented Jul 4, 2023

newsch commented Jul 4, 2023

biodranik Jul 5, 2023

newsch Jul 6, 2023

biodranik Jul 6, 2023

biodranik Jul 5, 2023

newsch Jul 6, 2023

biodranik Jul 6, 2023

newsch Jul 6, 2023

biodranik Jul 6, 2023

biodranik Jul 6, 2023

newsch Jul 6, 2023

biodranik Jul 5, 2023

newsch Jul 6, 2023 •

edited

Loading

biodranik Jul 6, 2023

newsch Jul 6, 2023

biodranik Jul 6, 2023

Generator directory format #6

Generator directory format #6

Conversation

newsch commented Jun 23, 2023 • edited Loading

biodranik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch commented Jun 30, 2023

biodranik commented Jul 1, 2023

newsch commented Jul 3, 2023

biodranik commented Jul 4, 2023

newsch commented Jul 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch commented Jun 23, 2023 •

edited

Loading

newsch Jul 6, 2023 •

edited

Loading