-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generator directory format #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good approach )
src/main.rs
Outdated
Ok(title) => title, | ||
}; | ||
|
||
// NOTE: Some wikipedia titles have '/' in them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are they processed in the generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like en:Article_Title
are converted to urls somewhere early in the OSM ingestion process.
It dumps the urls to a file for the descriptions scraper, then when it adds them to the mwm files it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location..
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a list of changes that would be helpful
src/main.rs
Outdated
|
||
// NOTE: Some wikipedia titles have '/' in them. | ||
let wikipedia_dir = title.get_dir(base.as_ref().to_owned()); | ||
// TODO: handle incorrect links, directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.
/// Write selected article to disk. | ||
/// | ||
/// - Write page contents to wikidata page (`wikidata.org/wiki/QXXX/lang.html`). | ||
/// - If the page has no wikidata qid, write contents to wikipedia location (`lang.wikipedia.org/wiki/article_title/lang.html`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lang is used two times here in the path, but only one file is always stored in the directory, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior that the generator/scraper expects is to write all available translations in each directory.
So for the article for Berlin, if there are OSM tags for wikipedia:en=Berlin
, wikipedia:de=Berlin
, wikipedia:fr=Berlin
and wikidata=Q64
, and the generator keeps them all, then there will be four folders with duplicates of all language copies:
en.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
de.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
fr.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
wikidata/Q64/{en.html, de.html, fr.html, ...}
Now, I don't understand exactly how the generator picks which tags to use yet, but just from looking at the Canada Yukon region map there are duplicated copies of wikipedia items there.
For this program, we only see one language at a time, so we write that copy to the master wikidata directory. When later we get the same article in a different language, we write it to the same wikidata directory.
Once all the languages have been processed, it would look like:
en.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
de.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
fr.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
wikidata/Q64/{en.html, de.html, fr.html, ...}
@@ -132,6 +152,11 @@ impl FromStr for WikidataQid { | |||
/// | |||
/// assert!(WikipediaTitleNorm::from_url("https://en.wikipedia.org/not_a_wiki_page").is_err()); | |||
/// assert!(WikipediaTitleNorm::from_url("https://wikidata.org/wiki/Q12345").is_err()); | |||
/// | |||
/// assert!( | |||
/// WikipediaTitleNorm::from_url("https://de.wikipedia.org/wiki/Breil/Brigels").unwrap() != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can / be percent-escaped in such cases? How the generator handles it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be, I haven't looked for that. Wikipedia works with either.
See below for more details, but the generator should decode those before dumping the urls.
It looks like a handful of encoded titles still slip through, but none with %2F
=/
.
I made an issue with some notes about this in #7.
From my read of when it first adds a wikipedia tag and later writes it as a url:
- If the tag looks like a url instead of the expected
lang:Article Title
format, take what's after.wikipedia.org/wiki/
, url decode it, replace underscores with spaces, then concat that with the lang at the beginning of the url and store it. - Otherwise attempt to check if it's a url, replace underscores with spaces, and store it.
- To transform it back into a url, replace spaces with underscores in the title, escape any
%
s, and add it to the end ofhttps://lang.wikipedia.org/wiki/
.
Glancing at the url decoding, I don't think there's anything wrong with it - it should handle arbitrary characters, although neither the encoding or decoding look unicode-aware.
@@ -145,7 +170,7 @@ impl WikipediaTitleNorm { | |||
title.trim().replace(' ', "_") | |||
} | |||
|
|||
// https://en.wikipedia.org/wiki/Article_Title | |||
// https://en.wikipedia.org/wiki/Article_Title/More_Title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is more than one slash in the title possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a handful, for example https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower.
There are 39 present in the generator urls
$ grep -E '^https://\w+\.wikipedia\.org/wiki/.+/.+/' /tmp/wikipedia_urls.txt | sort | uniq https://de.wikipedia.org/wiki/Darum/Gretesch/Lüstringen https://de.wikipedia.org/wiki/Kienhorst/Köllnseen/Eichheide https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Erlangen/A#Altstädter_Friedhof_2/3,_Altstädter_Friedhof https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001-1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099)#Evang._Christuskirche https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/100–1/199) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/200–1/299) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/300–1/399) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/400–1/499) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580)#Schulgeb.C3.A4ude https://de.wikipedia.org/wiki/Rhumeaue/Ellerniederung/Gillersheimer_Bachtal https://de.wikipedia.org/wiki/Speck_/_Wehl_/_Helpenstein https://de.wikipedia.org/wiki/Veldrom/Feldrom/Kempen https://de.wikipedia.org/wiki/VHS_Witten/Wetter/Herdecke https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Bach https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Judenberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Kramerberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Loasleiten https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Pelzereck https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Theresienberg https://de.wikipedia.org/wiki/Wohnanlage_Arzbacher_Straße/Thalkirchner_Straße/Wackersberger_Straße/Würzstraße https://en.wikipedia.org/wiki/Abura/Asebu/Kwamankese_District https://en.wikipedia.org/wiki/Ajumako/Enyan/Essiam_District https://en.wikipedia.org/wiki/Bibiani/Anhwiaso/Bekwai_Municipal_District https://en.wikipedia.org/wiki/Clapp/Langley/Crawford_Complex https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower https://en.wikipedia.org/wiki/SAIT/AUArts/Jubilee_station https://en.wikipedia.org/wiki/Santa_Cruz/Graciosa_Bay/Luova_Airport https://fr.wikipedia.org/wiki/Landunvez#/media/Fichier:10_Samson_C.jpg https://gl.wikipedia.org/wiki/Moaña#/media/Ficheiro:Plano_de_Moaña.png https://it.wikipedia.org/wiki/Tswagare/Lothoje/Lokalana https://lb.wikipedia.org/wiki/Lëscht_vun_den_nationale_Monumenter_an_der_Gemeng_Betzder#/media/Fichier:Roodt-sur-Syre,_14_rue_d'Olingen.jpg https://pt.wikipedia.org/wiki/Wikipédia:Wikipédia_na_Universidade/Cursos/Rurtugal/Gontães https://ru.wikipedia.org/wiki/Алажиде#/maplink/0 https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Волинська_область/Старовижівський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Київська_область/Броварський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Полтавська_область/Семенівський_район
I ran them with all languages on my machine. I only have 4 cores, so more than two instances didn't show much of an improvement. Speaking of which, after investigating pgzip further, my understanding is it can only parallelize decompressing files that it compressed in a specific way. I'll make another issue for investigating other gunzip implementations. |
Parallelism is the next step, it can be done using existing tools. Let's lower its priority. Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump? |
When running multiple instances in parallel, they could process different translations of an article at the same time, and interleave between checking that the QID folder doesn't exist and creating it. The same thing could hypothetically happen with article title folders, but since each dump is in a different language it shouldn't occur. It is probably unlikely to occur, and it won't take down the entire program. |
Aren't file system operations atomic? Adding handler for the case "tried to create it but it was already created by another process" is a good idea. |
Yes, individual syscalls should be atomic but I don't think there are any guarantees between the call to It looks like |
README.md
Outdated
To serve as a drop-in replacement for the descriptions scraper: | ||
- Install this tool to `$PATH` as `om-wikiparser`. | ||
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`). | ||
- Set `WIKIPEDIA_ENTERPRISE_DUMPS` to the list of the dump files to process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List? Delimited by what? Any example? Is specifying a directory with dumps better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant a shell list/array(?), separated by spaces.
One example is a glob, so using a directory and then referencing $WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz
might be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to mention list item separators explicitly and provide some example for clarity.
README.md
Outdated
@@ -4,5 +4,24 @@ _Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki | |||
|
|||
## Usage | |||
|
|||
To serve as a drop-in replacement for the descriptions scraper: | |||
- Install this tool to `$PATH` as `om-wikiparser`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it should be at PATH? Can it be run from any directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be, the example script read more clearly to me if it's in the context of the intermediate_data
directory. It could also be run as ../../../wikiparser/target/release/om-wikiparser
, with cargo run --release
from the wikiparser directory, or anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...then why suggesting to install the tool at PATH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that you can always reference it as om-wikiparser
wherever you are, without worrying about where it is relative to you, or copying it into your working directory.
I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.
Maybe writing a shell script to use on the maps server instead would be helpful?
Would you prefer:
# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \
--wikidata-ids wikidata_ids.txt \
--wikipedia-urls wikipedia_urls.txt \
descriptions/
done
or
# Transform intermediate files from generator.
maps_build=~/maps_build/$BUILD_DATE/intermediate_data
cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt
tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
tar xzf $dump | ./target/release/om-wikiparser \
--wikidata-ids $maps_build/wikidata_ids.txt \
--wikipedia-urls $maps_build/wikipedia_urls.txt \
$maps_build/descriptions/
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :)
cargo run -r
may be even better instead of a path to binary :) But it's also ok to hard-code the path or use$WIKIPARSER_BINARY
var.
Think about me testing your code soon on a production server. Less surprises = less stress ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.
README.md
Outdated
- Set `WIKIPEDIA_ENTERPRISE_DUMPS` to the list of the dump files to process | ||
- Run the following from within the `intermediate_data` subdirectory of the maps build directory: | ||
```shell | ||
# transform intermediate files from generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pros:
- Independent of the generator process. Can be run as soon as planet file is updated.
Cons:
- Need to keep osm query in sync with generator's own multi-step filtering and transformation process.
- Need to match generator's multi-step processing of urls exactly.
When I did this earlier, it was with the osm-filter
tool, I only tested it on the yukon region, and it output more entries than the generator did.
I can create an issue for this, but the rough steps to get that working are:
- Convert
osmfilter
query toosmium
command so it can work onpbf
files directly. - Dig into generator map processing to try to improve querying.
- Compare processing of a complete planet with generator output.
- Write conversion of
osmuim
output forwikiparser
to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
- What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles?
- osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options?
It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term.
Absolutely agree!
Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
I think so, do you mean the wikipedia/wikidata files or the mwm format in general?
By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from osmfilter
, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.
As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this osmfilter
approach (based on ftypes_matcher.cpp
).
What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right?
As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.
Do you remember how big is the percent of "unnecessary" articles?
That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.
How good is the osmium tool compared to other options?
I haven't looked into omium
much, but my understanding is it is at least as powerful as osmfilter
/osmconvert
. I know we talked about using pbfs directly at some point so that's why I mentioned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, do you mean the wikipedia/wikidata files or the mwm format in general?
I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.
The map generator expects a certain folder structure created by the current scraper to add the article content into the mwm files. - Article html is written to wikidata directory. - Directories are created for any matched titles and symlinked to the wikidata directory. - Articles without a QID are written to article title directory. - Article titles containing `/` are not escaped, so multiple subdirectories are possible. The output folder hierarchy looks like this: . ├── de.wikipedia.org │ └── wiki │ ├── Coal_River_Springs_Territorial_Park │ │ ├── de.html │ │ └── ru.html │ ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park │ │ ├── de.html │ │ └── en.html │ ... ├── en.wikipedia.org │ └── wiki │ ├── Arctic_National_Wildlife_Refuge │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ├── Baltimore │ │ └── Washington_International_Airport │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ... └── wikidata ├── Q59320 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ├── Q120306 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ... Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
I decided to break up the next steps into smaller PRs compared to the last one.
This PR updates the program to create to the folder structure that the map generator expects, e.g.:
While the old description scraper would write duplicates for the same article's title and qid, this implementation writes symlinks in the wikipedia tree that point to the wikidata files.
I know I can change what the generator looks for, but I figured it would be easier to have this working and then change them together instead of debugging both at the same time while neither works.
The goal is that with this PR, the parser will be a drop-in replacement for the current scraper, even if the speed and html size is not what we'd like.
Remaining work for this PR:
(e.g. timestamps)timestamps moved to Skip articles that haven't changed between dumps #9