You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the wm import ia-known-pages script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.
Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.
After loading the list of page URLs in import_ia_db_urls():
Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).
The text was updated successfully, but these errors were encountered:
This adds a new set of `get_*()` methods to replace `list_*()` in `db.Client`. For example, `get_pages()` replaces `list_pages()`.
These new methods return an iterator over *the entire result set* rather than a single, paginated chunk of results. They make for much nicer usage and should make #664 easier.
The old methods will eventually be removed; they’re not what you want 99% of the time, and when you do need them, you can still call `client.request_json()`. It’s slightly more verbose, but that’s OK for the uncommon use case.
Fixes#660.
Depends on #659, #660.
In the
wm import ia-known-pages
script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.
After loading the list of page URLs in
import_ia_db_urls()
:web-monitoring-processing/web_monitoring/cli/cli.py
Lines 555 to 561 in 06e3e51
We should load all the versions in the timeframe (using the new features in #660) and add them to
version_filter
, e.g:Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).
The text was updated successfully, but these errors were encountered: