Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally pre-check known versions in wm import ia-known-pages #664

Closed
Mr0grog opened this issue Nov 13, 2020 · 1 comment
Closed

Optionally pre-check known versions in wm import ia-known-pages #664

Mr0grog opened this issue Nov 13, 2020 · 1 comment

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 13, 2020

Depends on #659, #660.

In the wm import ia-known-pages script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.

Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.

After loading the list of page URLs in import_ia_db_urls():

def import_ia_db_urls(*, from_date=None, to_date=None, maintainers=None,
tags=None, skip_unchanged='resolved-response',
url_pattern=None, worker_count=0,
unplaybackable_path=None, dry_run=False):
client = db.Client.from_env()
logger.info('Loading known pages from web-monitoring-db instance...')
urls, version_filter = _get_db_page_url_info(client, url_pattern)

We should load all the versions in the timeframe (using the new features in #660) and add them to version_filter, e.g:

if should_precheck:
	print('Pre-loading known versions...')
	memento_key = lambda time, url: f'{time.strftime("%Y%m%d%H%M%S")}|{url}'
	versions = client.list_all_versions(start_date=from_date,
	                                    end_date=to_date,
	                                    sort='capture_time:asc',
	                                    chunk_size=1000)
	known_mementos = set(memento_key(v["capture_time"], v["capture_url"]) for v in versions)
	_filter = version_filter
	def precheck_filter(cdx_record):
		if memento_key(cdx_record.timestamp, cdx_record.url) in known_mementos:
			return False
		return _filter(cdx_record)
	
	version_filter = precheck_filter

Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).

Mr0grog added a commit that referenced this issue Nov 14, 2020
This adds a new set of `get_*()` methods to replace `list_*()` in `db.Client`. For example, `get_pages()` replaces `list_pages()`.

These new methods return an iterator over *the entire result set* rather than a single, paginated chunk of results. They make for much nicer usage and should make #664 easier.

The old methods will eventually be removed; they’re not what you want 99% of the time, and when you do need them, you can still call `client.request_json()`. It’s slightly more verbose, but that’s OK for the uncommon use case.

Fixes #660.
@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 30, 2020

This was solved in #667.

@Mr0grog Mr0grog closed this as completed Nov 30, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant