You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some pages have a <link rel="canonical" href="{url}"> element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it would be great to include the canonical URL in the source_metadata field.
We already parse HTML pages that we’re importing to get their titles, and we could get the canonical link (if present) at a similar point in the process. Ideally we should create a way to only parse the page once to get title, canonical link, and anything else we might want to extract from the page content in the future.
Some pages have a
<link rel="canonical" href="{url}">
element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it would be great to include the canonical URL in thesource_metadata
field.We already parse HTML pages that we’re importing to get their titles, and we could get the canonical link (if present) at a similar point in the process. Ideally we should create a way to only parse the page once to get title, canonical link, and anything else we might want to extract from the page content in the future.
Where we already parse markup for titles:
web-monitoring-processing/web_monitoring/cli/cli.py
Lines 408 to 413 in 21512eb
web-monitoring-processing/web_monitoring/utils.py
Lines 98 to 112 in 21512eb
The text was updated successfully, but these errors were encountered: