Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command to import known page URLs from IA #174

Merged
merged 19 commits into from
Sep 4, 2019

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Mar 18, 2018

Fixes #86, based on #173.

This adds the import ia-known-pages command, which imports only pages we already know about in the DB from Internet Archive. Use it like so:

> ./scripts/wm import ia-known-pages --from 2018-03-17
Importing 192 URLs from 2 Domains:
  yosemite.epa.gov
  www.usgs.gov
importing: 0 versions [00:00, ? versions/s]Submitting Versions to web-monitoring-db...
importing: 7 versions [00:18,  1.85s/ versions]
Import jobs IDs: (66,)
Polling web-monitoring-db until import jobs are finished...

If you set LOG_LEVEL=INFO, it’ll tell you how many URLs IA had that we skipped because we didn’t know them.

If you set LOG_LEVEL=DEBUG, it’ll print every skipped URL.


# TODO: this should probably be a method on db.Client, but db.Client could also
# do well to transform the `links` into callables, e.g:
# more_pages = pages['links']['next']()
Copy link
Member Author

@Mr0grog Mr0grog Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wary of actually doing this here because it:

  • Probably has a lot of room for discussion over different APIs/implementations
  • Ought to be applied across all list_* methods, which is not a small surface area
  • Probably entails big changes because we ought to break apart the URL composition/request from the response parsing bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to all of the above.

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-db that referenced this pull request Mar 18, 2018
Internet Archive stores all the redirects in the original request, which means they get replayed to us or anyone else who tries to retrieve an IA page. It turns out that sometimes that's a lot of redirects! Increase our limit to 10 for now (the default in HTTParty is 5).

This fixes some issues I hit while working on edgi-govdata-archiving/web-monitoring-processing#174
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 18, 2018

Also unsure if we should be using Tornado or Asyncio or something for this, since it’s probably a lot of HTTP traffic.

@Mr0grog
Copy link
Member Author

Mr0grog commented Apr 15, 2018

Need to rebase and incorporate work from #179.

@Mr0grog Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from 5f0dab6 to b26e74d Compare April 16, 2018 04:01
@Mr0grog
Copy link
Member Author

Mr0grog commented Apr 16, 2018

Rebased on master and added --skip-unchanged support that got merged in #179.

@Mr0grog
Copy link
Member Author

Mr0grog commented Apr 16, 2018

Ugh, this got way more complicated, but I think it’s probably doing the job better now.

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


# TODO: this should probably be a method on db.Client, but db.Client could also
# do well to transform the `links` into callables, e.g:
# more_pages = pages['links']['next']()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to all of the above.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 3, 2018

I think I need to re-working things here based on discussions I had with IA folks recently about revisit records and other kinds of un-playback-able records. We’ve made some pretty incorrect assumptions here about what we’re getting when following redirects and looking up CDX results.

  • Many mementos are not playback-able (for a variety of reasons). Requests to URLs corresponding to a non-playback-able memento are treated just like requests to non-existent mementos: you get redirected to the nearest-in-time playback-able one. You can tell whether you are getting a playback by checking for the presence of the Memento-Datetime header. It is only present in playback responses (included playbacks that are redirects).

  • “Self-redirects,” or redirects to the same URL at a different protocol are not playback-able.

  • Revisits may be playback-able. If there are very few captures of the URL, the revisit will play back. Otherwise, it will not. This threshold is arbitrary and potentially malleable, so we can’t meaningfully know when a revisit will be playback-able.

  • You can find the actual response data corresponding to a revisit by searching the whole domain (since the revisit may be a response from a totally different page) for the revisit’s hash, using the ?filter=digest:{hash goes here} query arg, but that query is very slow. This only gets a matching response body, though! If the response was a redirect with an empty body, this information is not going to help you because it doesn’t include headers — and you can’t tell from the CDX record what the revisit’s status code was (i.e. you can’t tell if it was a 3xx redirect).

The first major thing to note here is that this means not every CDX result can be transformed into an importable snapshot/version (so we need to be more careful with how we handle responses and redirects when loading a memento). We’re also going to encounter lots of revisits in our pages because they scrape so often, and knowing how to best handle those feels like it needs some thought.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 6, 2018

Rebased on master.

Also worth noting, re: the last comment about IA import issues, we talked about it in today’s meeting and I don’t think there are any clever solutions that will work well (mainly because of redirects). So the best we can do here is to fail saving a memento in timestamped_uri_to_version if the memento isn’t playback-able.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 11, 2018

Once we are happy with #255, I’ll rebase this on it. For now, the rebased and improved version is on 86-import-known-db-pages-from-ia-2.

@Mr0grog Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from a76c6ad to 079eeef Compare September 14, 2018 04:55
@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 14, 2018

Rebased this insane mess, now that #255 is merged.

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through again, looks fine. Made some picky suggestions about logging for your consideration.

Is there anything else that needs to be done before we can start trying this thing out in production?

skipped += 1
logger.debug('Skipping URL "%s"', version.url)
except ValueError as error:
logger.warn(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use logger.exception for this.

except ValueError:
    logger.exception('Error while applying version_filter to URL %r', url)

The Traceback will be automatically captured by logger.exception and shown below the log message.

Copy link
Member Author

@Mr0grog Mr0grog Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is really here to handle the error if there were no versions to list:

if not last_hashes:
raise ValueError("Internet archive does not have archived "
"versions of {}".format(url))

…which, well, probably shouldn’t actually be a ValueError at all, since it’s not really a problem with your input. Actually, I’m not really sure why we have an exception for this in the first place. Should we get rid of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Anyway, implicit point there is that this isn’t really an exception or unexpected scenario, so I don’t think logger.exception makes sense for it. But the code and the error classes involved really don’t make that clear.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

view_url=version.view_url)
except ia.MementoPlaybackError as error:
wayback_errors['playback'].append(error)
logger.info(f' {error}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast to my comment about logger.exception, these messages make sense as INFO messages to me because they are tracking expected errors coming from a service that issues them frequently.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 14, 2018

Ahhhh, sorry, wasn’t clear; I wasn’t really feeling like this was super mergeable at this point. Was planning to try and do the async bit and clean all this up (e.g. they flow between things has gotten super wonky; this needs to at least be extracted out of cli.py and re-organized a bit, maybe into a class or classes).

@danielballan
Copy link
Contributor

Gotcha. This seemed either ready or not at all ready, depending on scope.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 14, 2018

Yeah, I guess we could merge it in an ugly state. My main concern is the async bit, because a full run of this across all 90-ish domain names we are monitoring can take a while. I think we need parallel downloading from IA to make this feasible.

@danielballan
Copy link
Contributor

I agree and I don't think we need to rush it. Was just trying to figure out if your "Rebased" comment implied "So can we merged this please?" or not. :-D Let's wait for async.

@Mr0grog Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from f538658 to 4edb330 Compare September 17, 2018 06:16
@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 30, 2019

@danielballan I might come back and try and tidy things up slightly over the next few days, but I don’t think I’m going to make any more significant architectural changes at this point, so it’s probably ready for a high-level review.

You might be able to see the places where I got stuck trying to architect a super fancy data pipeline system. I tried a few different approaches and think I came up with some neat stuff, but realized in the end that it was a lot of abstraction for something that would only be used once or twice in the codebase. It didn’t save enough work to be worthwhile.

Anyway, the big pieces here are probably:

  • The main work is in cli.import_ia_urls. It spawns threads for searching CDX, loading mementos, and sending import requests to web-monitoring-db. The all get connected with FiniteQueue instances (see below):

    def import_ia_urls(urls, *, from_date=None, to_date=None,
    maintainers=None, tags=None,
    skip_unchanged='resolved-response',
    version_filter=None, worker_count=0,
    create_pages=True, unplaybackable_path=None,
    dry_run=False):
    skip_responses = skip_unchanged == 'response'
    worker_count = worker_count if worker_count > 0 else PARALLEL_REQUESTS
    unplaybackable = load_unplaybackable_mementos(unplaybackable_path)
    with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event:
    cdx_records = utils.FiniteQueue()
    cdx_thread = threading.Thread(target=lambda: utils.iterate_into_queue(
    cdx_records,
    _list_ia_versions_for_urls(
    urls,
    from_date,
    to_date,
    skip_responses,
    version_filter,
    # Use a custom session to make sure CDX calls are extra robust.
    client=ia.WaybackClient(ia.WaybackSession(retries=10, backoff=4)),
    stop=stop_event)))
    cdx_thread.start()
    summary = {}
    versions_queue = utils.FiniteQueue()
    memento_thread = threading.Thread(target=lambda: WaybackRecordsWorker.parallel_with_retries(
    worker_count,
    summary,
    cdx_records,
    versions_queue,
    maintainers,
    tags,
    stop_event,
    unplaybackable,
    tries=(None,
    dict(retries=3, backoff=4, timeout=(30.5, 2)),
    dict(retries=7, backoff=4, timeout=60.5))))
    memento_thread.start()
    uploadable_versions = versions_queue
    if skip_unchanged == 'resolved-response':
    uploadable_versions = _filter_unchanged_versions(versions_queue)
    if dry_run:
    uploader = threading.Thread(target=lambda: _log_adds(uploadable_versions))
    else:
    uploader = threading.Thread(target=lambda: _add_and_monitor(uploadable_versions, create_pages, stop_event))
    uploader.start()
    cdx_thread.join()
    memento_thread.join()
    print('\nLoaded {total} CDX records:\n'
    ' {success:6} successes ({success_pct:.2f}%),\n'
    ' {playback:6} could not be played back ({playback_pct:.2f}%),\n'
    ' {missing:6} had no actual memento ({missing_pct:.2f}%),\n'
    ' {unknown:6} unknown errors ({unknown_pct:.2f}%).'.format(
    **summary))
    uploader.join()
    if not dry_run:
    print('Saving list of non-playbackable URLs...')
    save_unplaybackable_mementos(unplaybackable_path, unplaybackable)

  • cli.WaybackRecordsImporter is a kind of crazy megaclass. We spawn a bunch of them to read CDX records off a queue. A class method handles running N of them in parallel and another class method handles re-queueing them for retries.

    class WaybackRecordsWorker(threading.Thread):
    """
    WaybackRecordsWorker is a thread that takes CDX records from a queue and
    loads the corresponding mementos from Wayback. It then transforms the
    mementos into Web Monitoring import records and emits them on another
    queue. If a `failure_queue` is provided, records that fail to load in a way
    that might be worth retrying are emitted on that queue.
    """
    def __init__(self, records, results_queue, maintainers, tags, cancel,
    failure_queue=None, session_options=None,
    unplaybackable=None):
    super().__init__()
    self.summary = self.create_summary()
    self.results_queue = results_queue
    self.failure_queue = failure_queue
    self.cancel = cancel
    self.records = records
    self.maintainers = maintainers
    self.tags = tags
    self.unplaybackable = unplaybackable
    session_options = session_options or dict(retries=3, backoff=2,
    timeout=(30.5, 2))
    session = ia.WaybackSession(**session_options)
    self.wayback = ia.WaybackClient(session=session)
    def is_active(self):
    return not self.cancel.is_set()
    def run(self):
    """
    Work through the queue of CDX records to load them from Wayback,
    transform them to Web Monitoring DB import entries, and queue them for
    importing.
    """
    while self.is_active():
    try:
    record = next(self.records)
    self.summary['total'] += 1
    except StopIteration:
    break
    self.handle_record(record, retry_connection_failures=True)
    self.wayback.close()
    return self.summary
    def handle_record(self, record, retry_connection_failures=False):
    """
    Handle a single CDX record.
    """
    # Check for whether we already know this can't be played and bail out.
    if self.unplaybackable is not None and record.raw_url in self.unplaybackable:
    self.summary['playback'] += 1
    return
    try:
    version = self.process_record(record, retry_connection_failures=True)
    self.results_queue.put(version)
    self.summary['success'] += 1
    except ia.MementoPlaybackError as error:
    self.summary['playback'] += 1
    if self.unplaybackable is not None:
    self.unplaybackable[record.raw_url] = datetime.utcnow()
    logger.info(f' {error}')
    except requests.exceptions.HTTPError as error:
    if error.response.status_code == 404:
    logger.info(f' Missing memento: {record.raw_url}')
    self.summary['missing'] += 1
    else:
    # TODO: consider not logging this at a lower level, like debug
    # unless failure_queue does not exist. Unsure how big a deal
    # this error is to log if we are retrying.
    logger.info(f' (HTTPError) {error}')
    if self.failure_queue:
    self.failure_queue.put(record)
    else:
    self.summary['unknown'] += 1
    except ia.WaybackRetryError as error:
    logger.info(f' {error}; URL: {record.raw_url}')
    if self.failure_queue:
    self.failure_queue.put(record)
    else:
    self.summary['unknown'] += 1
    except Exception as error:
    # FIXME: getting read timed out connection errors here...
    # requests.exceptions.ConnectionError: HTTPConnectionPool(host='web.archive.org', port=80): Read timed out.
    # TODO: don't count or log (well, maybe DEBUG log) if failure_queue
    # is present and we are ultimately going to retry.
    logger.exception(f' ({type(error)}) {error}; URL: {record.raw_url}')
    if self.failure_queue:
    self.failure_queue.put(record)
    else:
    self.summary['unknown'] += 1
    def process_record(self, record, retry_connection_failures=False):
    """
    Load the actual Wayback memento for a CDX record and transform it to
    a Web Monitoring DB import record.
    """
    try:
    return self.wayback.timestamped_uri_to_version(record.date,
    record.raw_url,
    url=record.url,
    maintainers=self.maintainers,
    tags=self.tags,
    view_url=record.view_url)
    except Exception as error:
    # On connection failures, reset the session and try again. If we
    # don't do this, the connection pool for this thread is pretty much
    # dead. It's not clear to me whether there is a problem in urllib3
    # or Wayback's servers that requires this.
    # This unfortunately requires string checking because the error can
    # get wrapped up into multiple kinds of higher-level errors :(
    if retry_connection_failures and ('failed to establish a new connection' in str(error).lower()):
    self.wayback.session.reset()
    return self.process_record(record)
    # Otherwise, re-raise the error.
    raise error
    @classmethod
    def create_summary(cls):
    """
    Create a dictionary that summarizes the results of processing all the
    CDX records on a queue.
    """
    return {'total': 0, 'success': 0, 'playback': 0, 'missing': 0,
    'unknown': 0}
    @classmethod
    def summarize(cls, workers, initial=None):
    """
    Combine the summaries from multiple `WaybackRecordsWorker` instances
    into a single summary.
    """
    return cls.merge_summaries((w.summary for w in workers), initial)
    @classmethod
    def merge_summaries(cls, summaries, intial=None):
    merged = intial or cls.create_summary()
    for summary in summaries:
    for key in merged.keys():
    if key in summary:
    merged[key] += summary[key]
    # Add percentage calculations
    if merged['total']:
    merged.update({f'{k}_pct': 100 * v / merged['total']
    for k, v in merged.items()
    if k != 'total' and not k.endswith('_pct')})
    else:
    merged.update({f'{k}_pct': 0.0
    for k, v in merged.items()
    if k != 'total' and not k.endswith('_pct')})
    return merged
    @classmethod
    def parallel(cls, count, *args, **kwargs):
    """
    Run several `WaybackRecordsWorker` instances in parallel. When this
    returns, the workers will have finished running.
    Parameters
    ----------
    count: int
    Number of instances to run in parallel.
    *args
    Arguments to pass to each instance.
    **kwargs
    Keyword arguments to pass to each instance.
    Returns
    -------
    list of WaybackRecordsWorker
    """
    workers = []
    for i in range(count):
    worker = cls(*args, **kwargs)
    workers.append(worker)
    worker.start()
    for worker in workers:
    worker.join()
    return workers
    @classmethod
    def parallel_with_retries(cls, count, summary, records, results_queue, *args, tries=None, **kwargs):
    """
    Run several `WaybackRecordsWorker` instances in parallel and retry
    records that fail to load.
    Parameters
    ----------
    count: int
    Number of instances to run in parallel.
    summary: dict
    Dictionary to populate with summary data from all worker runs.
    records: web_monitoring.utils.FiniteQueue
    Queue of CDX records to load mementos for.
    results_queue: web_monitoring.utils.FiniteQueue
    Queue to place resulting import records onto.
    *args
    Arguments to pass to each instance.
    **kwargs
    Keyword arguments to pass to each instance.
    Returns
    -------
    list of WaybackRecordsWorker
    """
    if tries is None or len(tries) == 0:
    tries = (None,)
    # Initialize the summary (we have to keep a reference so other threads can read)
    summary.update(cls.create_summary())
    total_tries = len(tries)
    retry_queue = None
    workers = []
    for index, try_setting in enumerate(tries):
    if retry_queue and not retry_queue.empty():
    print(f'\nRetrying about {retry_queue.qsize()} failed records...', flush=True)
    retry_queue.end()
    records = retry_queue
    if index == total_tries - 1:
    retry_queue = None
    else:
    retry_queue = utils.FiniteQueue()
    workers.extend(cls.parallel(count, records, results_queue, *args, **kwargs))
    summary.update(cls.summarize(workers, summary))
    results_queue.end()

    In the super fancy stuff I came up with, I wanted to extract the logic for all the different kinds of errors, but again, it turned into a lot abstraction for little gain.

  • utils.FiniteQueue is my super-simple take at trying to make a queue that has a defined end and is iterable. It's useful for linking all the threads above together since they can just read to the end. There's something similar in PyPI, but for multiprocessing instead of threading, so it didn't seem like a fit here.

    class FiniteQueue(queue.SimpleQueue):
    """
    A queue that is iterable, with a defined end.
    The end of the queue is indicated by the `FiniteQueue.QUEUE_END` object.
    If you are using the iterator interface, you won't ever encounter it, but
    if reading the queue with `queue.get`, you will receive
    `FiniteQueue.QUEUE_END` if you’ve reached the end.
    """
    # Use a class instad of `object()` for more readable names for debugging.
    class QUEUE_END:
    ...
    def __init__(self):
    super().__init__()
    self._ended = False
    # The Queue documentation suggests that put/get calls can be
    # re-entrant, so we need to use RLock here.
    self._lock = threading.RLock()
    def end(self):
    self.put(self.QUEUE_END)
    def get(self, *args, **kwargs):
    with self._lock:
    if self._ended:
    return self.QUEUE_END
    else:
    value = super().get(*args, **kwargs)
    if value is self.QUEUE_END:
    self._ended = True
    return value
    def __iter__(self):
    return self
    def __next__(self):
    item = self.get()
    if item is self.QUEUE_END:
    raise StopIteration
    return item

  • Some other useful classes in utils for handling signals (similar to what you linked before) and for reference counting.

  • internetarchive.WaybackClient now depends on a new WaybackSession object that handles all the networking/connection/retry/reset logic. It turns out retrying properly needs some care, and while we'd like to keep a session open so we can re-use connections, Wayback frequently craps out on a connection (still a mystery to me), so having WaybackSession.reset() is a huge requirement. That single feature made the biggest difference in our error rate by far.

  • CDX searches now have some maddening correctness checking, since it turns out the CDX index is chock full of nonsense entries like http://<<mailto:[email protected]>>/ and even data: URLs. 😳

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 30, 2019

(Also, as a bonus, WaybackSession now does a lot of retrying by default, and so should be way less failure-prone when building docs.)

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I found this reasonably easy to follow, considering what it does. I like the FiniteQueue and the way that WaybackClient and WaybackSession work together. Yes, I was struck by WaybackRecordsWorker but I can see that breaking that up could be a lot of effort for little reusable abstraction and arguable not any better readability.

I think the overall structure is good to run with. I left some small comments related to error handling, only one or two of them important.

web_monitoring/__init__.py Outdated Show resolved Hide resolved
web_monitoring/utils.py Outdated Show resolved Hide resolved
web_monitoring/cli.py Outdated Show resolved Hide resolved
web_monitoring/cli.py Show resolved Hide resolved

with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever.

skipped += 1
logger.debug('Skipping URL "%s"', version.url)
except ia.BlockedByRobotsError as error:
logger.warn(str(error))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the exception type.

Suggested change
logger.warn(str(error))
logger.warn(repr(error))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I didn’t do that because this exception always has nice descriptive text:

except Exception:
if 'RobotAccessControlException' in text:
raise BlockedByRobotsError(f'CDX search for URL was blocked by robots.txt "{query["url"]}" (parameters: {final_query})')

Do you think we should still use repr here? Should I cut down the error message if so?

Copy link
Member Author

@Mr0grog Mr0grog Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I guess I should move most of that text to the actual error class instead of here where we raise it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, doesn't seem like including the exception type in the warning would add much here then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, went to move the text and remembered why I didn’t originally do that: we could probably end up encountering a robots.txt issue in other API besides CDX as well (although I’ve only seen it and know how it’s formatted in CDX).

Any thoughts on how to better design this for that? Should we let the constructor take some extra info like:

raise BlockedByRobotsError(query['url'], f'In CDX search {final_query}')

Should we leave it to where the error is handled?

try:
    raise ia.BlockedByRobotsError(query['url'])
except ia.BlockedByRobotsError as error:
    logger.warn(f'CDX search error: {error!r}')

Or something else? Leave it as-is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I like the snippet above because it’s the job of the caller to explain what it was doing that caused the error. I don’t think exception’s constructor need be extended to carry arbitrary contextual information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I took a crack at that in 91e48ce.

web_monitoring/cli.py Outdated Show resolved Hide resolved
web_monitoring/cli.py Outdated Show resolved Hide resolved
web_monitoring/internetarchive.py Outdated Show resolved Hide resolved
web_monitoring/utils.py Outdated Show resolved Hide resolved
@Mr0grog Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch 2 times, most recently from b3c8cd0 to 58ea3d8 Compare September 2, 2019 17:39
We previously set special log levels for urllib3 because it was too noisy (it logged every retry as a warning, but Wayback's systems fail a lot and we *expect* having to retry often). However, we later stopped using urllib3's retry functionality because it couldn't distinguish between Wayback failures (which we want to retry) and mementos of failures (which are OK). So the custom logging setup is no longer necessary at all!

This also adds a few more comments to the retry code, and captures the above situation in the same place where we describe the problems with urllib3's built-in retry functionality that will hopefully get fixed in the future.
@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 2, 2019

@danielballan I think I addressed all the issues you brought up. Want to give it another once-over?

Way back at the end of last year, we added `status` as a first-class field on versions (edgi-govdata-archiving/web-monitoring-db#453), but never switched to actually sending it properly in our import data! This moves it from `source_metadata` (where it would be redundant) to the new first-class field.
The web-monitoring-db project is removing totals from list queries in order to improve performance (counting the total possible results is super expensive). You can still get the totals by using the `?include_total` query parameter, so this just makes sure to set that in the one query where we need that information (determining the size of the population to run a random sample over).

See the relevant change in the API: edgi-govdata-archiving/web-monitoring-db#596
@Mr0grog Mr0grog merged commit b362860 into master Sep 4, 2019
@lightandluck
Copy link
Collaborator

Huzzah! 🎉

Mr0grog added a commit that referenced this pull request Sep 5, 2019
We made a bunch of improvements to our Wayback tools in #174, but we didn’t update the docs as well as we should have in that PR. This attempts to remedy that:

- Use `WaybackClient.get_memento()` in Wayback tutorial.
- Use the term “memento” instead of “snapshot” throughout the tutorial.
- Add a diagram and docstring to the top of the `cli` module.
- Ensure we generate API docs for `WaybackClient.get_memento()` and `WaybackSession`
Mr0grog added a commit that referenced this pull request Sep 6, 2019
In #174, I made a last minute fix to add `status` as a top level field in our version imports (it was already supported by the API, but we had failed to update our code to send it). I must have missed committing a file though, because it's failing!
Mr0grog added a commit that referenced this pull request Sep 10, 2019
In #174, I made a last minute fix to add `status` as a top level field in our version imports (it was already supported by the API, but we had failed to update our code to send it). I must have missed committing a file though, because it's failing!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add way to import from IA only the pages/URLs we already have in the DB
3 participants