Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate escaping in article titles and urls #7

Open
newsch opened this issue Jun 24, 2023 · 0 comments
Open

Investigate escaping in article titles and urls #7

newsch opened this issue Jun 24, 2023 · 0 comments

Comments

@newsch
Copy link
Collaborator

newsch commented Jun 24, 2023

Wikipedia articles can contain slashes (/). Wikipedia accepts them in urls escaped or not, e.g.
https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport
and
https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport
return the same page, and neither redirects to the other.

The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls.

My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have % in them is a little tricky, and the generator doesn't do that.

It looks like some of the resulting urls are encoded twice, thankfully a small number:

$ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq
https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz
https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen
https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium
https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003
https://en.wikipedia.org/wiki/McMullen%2527s_Brewery
https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower
https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall
https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres
https://es.wikipedia.org/wiki/100%25_Banco
https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes
https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon
https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique)
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3
https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen
https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Of those, all except the three below are malformed:

https://es.wikipedia.org/wiki/100%25_Banco
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Some seem to be arbitrary character data, for example:

https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
with the extra escaped %25s removed becomes:
https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus
which the browser converts to:
https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus

@newsch newsch mentioned this issue Jun 24, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant