Optionally pre-check known versions in `wm import ia-known-pages` #664

Mr0grog · 2020-11-13T07:32:13Z

Depends on #659, #660.

In the wm import ia-known-pages script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.

Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.

After loading the list of page URLs in import_ia_db_urls():

web-monitoring-processing/web_monitoring/cli/cli.py

Lines 555 to 561 in 06e3e51

    
           def import_ia_db_urls(*, from_date=None, to_date=None, maintainers=None, 
        
                                 tags=None, skip_unchanged='resolved-response', 
        
                                 url_pattern=None, worker_count=0, 
        
                                 unplaybackable_path=None, dry_run=False): 
        
               client = db.Client.from_env() 
        
               logger.info('Loading known pages from web-monitoring-db instance...') 
        
               urls, version_filter = _get_db_page_url_info(client, url_pattern)

We should load all the versions in the timeframe (using the new features in #660) and add them to version_filter, e.g:

if should_precheck:
	print('Pre-loading known versions...')
	memento_key = lambda time, url: f'{time.strftime("%Y%m%d%H%M%S")}|{url}'
	versions = client.list_all_versions(start_date=from_date,
	                                    end_date=to_date,
	                                    sort='capture_time:asc',
	                                    chunk_size=1000)
	known_mementos = set(memento_key(v["capture_time"], v["capture_url"]) for v in versions)
	_filter = version_filter
	def precheck_filter(cdx_record):
		if memento_key(cdx_record.timestamp, cdx_record.url) in known_mementos:
			return False
		return _filter(cdx_record)
	
	version_filter = precheck_filter

Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).

The text was updated successfully, but these errors were encountered:

This adds a new set of `get_*()` methods to replace `list_*()` in `db.Client`. For example, `get_pages()` replaces `list_pages()`. These new methods return an iterator over *the entire result set* rather than a single, paginated chunk of results. They make for much nicer usage and should make #664 easier. The old methods will eventually be removed; they’re not what you want 99% of the time, and when you do need them, you can still call `client.request_json()`. It’s slightly more verbose, but that’s OK for the uncommon use case. Fixes #660.

Mr0grog · 2020-11-30T06:06:25Z

This was solved in #667.

Mr0grog added the enhancement label Nov 13, 2020

This was referenced Nov 13, 2020

Roadmap for the rest of 2020 edgi-govdata-archiving/web-monitoring#158

Closed

Add new DB methods for iterating over paginated results #665

Merged

Mr0grog closed this as completed Nov 30, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally pre-check known versions in `wm import ia-known-pages` #664

Optionally pre-check known versions in `wm import ia-known-pages` #664

Mr0grog commented Nov 13, 2020

Mr0grog commented Nov 30, 2020

Optionally pre-check known versions in wm import ia-known-pages #664

Optionally pre-check known versions in wm import ia-known-pages #664

Comments

Mr0grog commented Nov 13, 2020

Mr0grog commented Nov 30, 2020

Optionally pre-check known versions in `wm import ia-known-pages` #664

Optionally pre-check known versions in `wm import ia-known-pages` #664