Keep track of canonical URLs from page markup in `source_metadata` #728

Mr0grog · 2021-06-29T06:30:02Z

Some pages have a <link rel="canonical" href="{url}"> element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it would be great to include the canonical URL in the source_metadata field.

We already parse HTML pages that we’re importing to get their titles, and we could get the canonical link (if present) at a similar point in the process. Ideally we should create a way to only parse the page once to get title, canonical link, and anything else we might want to extract from the page content in the future.

Where we already parse markup for titles:

web-monitoring-processing/web_monitoring/cli/cli.py

Lines 408 to 413 in 21512eb

    
           title = '' 
        
           if media_type in HTML_MEDIA_TYPES: 
        
               encoding = detect_encoding(memento.content, memento.headers) 
        
               title = utils.extract_title(memento.content, encoding) 
        
           elif media_type in PDF_MEDIA_TYPES or memento.content.startswith(b'%PDF-'): 
        
               title = utils.extract_pdf_title(memento.content) or title

web-monitoring-processing/web_monitoring/utils.py

Lines 98 to 112 in 21512eb

    
           def extract_title(content_bytes, encoding='utf-8'): 
        
               "Return content of <title> tag as string. On failure return empty string." 
        
               content_str = content_bytes.decode(encoding=encoding, errors='ignore') 
        
               # The parser expects a file-like, so we mock one. 
        
               content_as_file = io.StringIO(content_str) 
        
               try: 
        
                   title = lxml.html.parse(content_as_file).find(".//title") 
        
               except Exception: 
        
                   return '' 
        
               if title is None or title.text is None: 
        
                   return '' 
        
               # In HTML, all consecutive whitespace (including line breaks) collapses 
        
               return WHITESPACE_PATTERN.sub(' ', title.text.strip())

The text was updated successfully, but these errors were encountered:

Mr0grog · 2021-06-29T06:35:37Z

It might be useful to do this for <link rel="shortlink">, too. I had totally forgotten about that one!

Mr0grog added enhancement good-first-issue labels Jun 29, 2021

stale bot added the stale label Jan 9, 2022

stale bot closed this as completed Apr 16, 2022

edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022

Mr0grog removed the stale label Apr 17, 2022

Mr0grog reopened this Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep track of canonical URLs from page markup in `source_metadata` #728

Keep track of canonical URLs from page markup in `source_metadata` #728

Mr0grog commented Jun 29, 2021 •

edited

Loading

Mr0grog commented Jun 29, 2021

Keep track of canonical URLs from page markup in source_metadata #728

Keep track of canonical URLs from page markup in source_metadata #728

Comments

Mr0grog commented Jun 29, 2021 • edited Loading

Mr0grog commented Jun 29, 2021

Keep track of canonical URLs from page markup in `source_metadata` #728

Keep track of canonical URLs from page markup in `source_metadata` #728

Mr0grog commented Jun 29, 2021 •

edited

Loading