Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep track of canonical URLs from page markup in source_metadata #728

Open
Mr0grog opened this issue Jun 29, 2021 · 1 comment
Open

Keep track of canonical URLs from page markup in source_metadata #728

Mr0grog opened this issue Jun 29, 2021 · 1 comment

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jun 29, 2021

Some pages have a <link rel="canonical" href="{url}"> element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it would be great to include the canonical URL in the source_metadata field.

We already parse HTML pages that we’re importing to get their titles, and we could get the canonical link (if present) at a similar point in the process. Ideally we should create a way to only parse the page once to get title, canonical link, and anything else we might want to extract from the page content in the future.

Where we already parse markup for titles:

title = ''
if media_type in HTML_MEDIA_TYPES:
encoding = detect_encoding(memento.content, memento.headers)
title = utils.extract_title(memento.content, encoding)
elif media_type in PDF_MEDIA_TYPES or memento.content.startswith(b'%PDF-'):
title = utils.extract_pdf_title(memento.content) or title

def extract_title(content_bytes, encoding='utf-8'):
"Return content of <title> tag as string. On failure return empty string."
content_str = content_bytes.decode(encoding=encoding, errors='ignore')
# The parser expects a file-like, so we mock one.
content_as_file = io.StringIO(content_str)
try:
title = lxml.html.parse(content_as_file).find(".//title")
except Exception:
return ''
if title is None or title.text is None:
return ''
# In HTML, all consecutive whitespace (including line breaks) collapses
return WHITESPACE_PATTERN.sub(' ', title.text.strip())

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 29, 2021

It might be useful to do this for <link rel="shortlink">, too. I had totally forgotten about that one!

@stale stale bot added the stale label Jan 9, 2022
@stale stale bot closed this as completed Apr 16, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@Mr0grog Mr0grog removed the stale label Apr 17, 2022
@Mr0grog Mr0grog reopened this Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant