Add command to import known page URLs from IA #174

Mr0grog · 2018-03-18T20:59:29Z

Fixes #86, based on #173.

This adds the import ia-known-pages command, which imports only pages we already know about in the DB from Internet Archive. Use it like so:

> ./scripts/wm import ia-known-pages --from 2018-03-17
Importing 192 URLs from 2 Domains:
  yosemite.epa.gov
  www.usgs.gov
importing: 0 versions [00:00, ? versions/s]Submitting Versions to web-monitoring-db...
importing: 7 versions [00:18,  1.85s/ versions]
Import jobs IDs: (66,)
Polling web-monitoring-db until import jobs are finished...

If you set LOG_LEVEL=INFO, it’ll tell you how many URLs IA had that we skipped because we didn’t know them.

If you set LOG_LEVEL=DEBUG, it’ll print every skipped URL.

Mr0grog · 2018-03-18T21:01:52Z

web_monitoring/cli.py

+
+# TODO: this should probably be a method on db.Client, but db.Client could also
+# do well to transform the `links` into callables, e.g:
+#     more_pages = pages['links']['next']()


I was wary of actually doing this here because it:

Probably has a lot of room for discussion over different APIs/implementations

Ought to be applied across all list_* methods, which is not a small surface area

Probably entails big changes because we ought to break apart the URL composition/request from the response parsing bits.

Agreed to all of the above.

Internet Archive stores all the redirects in the original request, which means they get replayed to us or anyone else who tries to retrieve an IA page. It turns out that sometimes that's a lot of redirects! Increase our limit to 10 for now (the default in HTTParty is 5). This fixes some issues I hit while working on edgi-govdata-archiving/web-monitoring-processing#174

Mr0grog · 2018-03-18T21:24:06Z

Also unsure if we should be using Tornado or Asyncio or something for this, since it’s probably a lot of HTTP traffic.

Mr0grog · 2018-04-15T21:36:39Z

Need to rebase and incorporate work from #179.

Mr0grog · 2018-04-16T04:02:49Z

Rebased on master and added --skip-unchanged support that got merged in #179.

Mr0grog · 2018-04-16T06:30:53Z

Ugh, this got way more complicated, but I think it’s probably doing the job better now.

danielballan

Looks good.

danielballan · 2018-04-24T12:04:22Z

web_monitoring/cli.py

+
+# TODO: this should probably be a method on db.Client, but db.Client could also
+# do well to transform the `links` into callables, e.g:
+#     more_pages = pages['links']['next']()


Agreed to all of the above.

Mr0grog · 2018-09-03T06:09:55Z

I think I need to re-working things here based on discussions I had with IA folks recently about revisit records and other kinds of un-playback-able records. We’ve made some pretty incorrect assumptions here about what we’re getting when following redirects and looking up CDX results.

Many mementos are not playback-able (for a variety of reasons). Requests to URLs corresponding to a non-playback-able memento are treated just like requests to non-existent mementos: you get redirected to the nearest-in-time playback-able one. You can tell whether you are getting a playback by checking for the presence of the Memento-Datetime header. It is only present in playback responses (included playbacks that are redirects).
“Self-redirects,” or redirects to the same URL at a different protocol are not playback-able.
Revisits may be playback-able. If there are very few captures of the URL, the revisit will play back. Otherwise, it will not. This threshold is arbitrary and potentially malleable, so we can’t meaningfully know when a revisit will be playback-able.
You can find the actual response data corresponding to a revisit by searching the whole domain (since the revisit may be a response from a totally different page) for the revisit’s hash, using the ?filter=digest:{hash goes here} query arg, but that query is very slow. This only gets a matching response body, though! If the response was a redirect with an empty body, this information is not going to help you because it doesn’t include headers — and you can’t tell from the CDX record what the revisit’s status code was (i.e. you can’t tell if it was a 3xx redirect).

The first major thing to note here is that this means not every CDX result can be transformed into an importable snapshot/version (so we need to be more careful with how we handle responses and redirects when loading a memento). We’re also going to encounter lots of revisits in our pages because they scrape so often, and knowing how to best handle those feels like it needs some thought.

Mr0grog · 2018-09-06T04:18:52Z

Rebased on master.

Also worth noting, re: the last comment about IA import issues, we talked about it in today’s meeting and I don’t think there are any clever solutions that will work well (mainly because of redirects). So the best we can do here is to fail saving a memento in timestamped_uri_to_version if the memento isn’t playback-able.

Mr0grog · 2018-09-11T21:18:42Z

Once we are happy with #255, I’ll rebase this on it. For now, the rebased and improved version is on 86-import-known-db-pages-from-ia-2.

Mr0grog · 2018-09-14T04:55:58Z

Rebased this insane mess, now that #255 is merged.

danielballan

Read through again, looks fine. Made some picky suggestions about logging for your consideration.

Is there anything else that needs to be done before we can start trying this thing out in production?

danielballan · 2018-09-14T14:12:15Z

web_monitoring/cli.py

+                    skipped += 1
+                    logger.debug('Skipping URL "%s"', version.url)
+        except ValueError as error:
+            logger.warn(error)


Might as well use logger.exception for this.

except ValueError: logger.exception('Error while applying version_filter to URL %r', url)

The Traceback will be automatically captured by logger.exception and shown below the log message.

Oh, this is really here to handle the error if there were no versions to list:

web-monitoring-processing/web_monitoring/internetarchive.py

Lines 377 to 379 in 4bb08f8

if not last_hashes:

raise ValueError("Internet archive does not have archived "

"versions of {}".format(url))

…which, well, probably shouldn’t actually be a ValueError at all, since it’s not really a problem with your input. Actually, I’m not really sure why we have an exception for this in the first place. Should we get rid of that?

(Anyway, implicit point there is that this isn’t really an exception or unexpected scenario, so I don’t think logger.exception makes sense for it. But the code and the error classes involved really don’t make that clear.)

danielballan · 2018-09-14T14:14:30Z

web_monitoring/cli.py

+                                                             view_url=version.view_url)
+                except ia.MementoPlaybackError as error:
+                    wayback_errors['playback'].append(error)
+                    logger.info(f'  {error}')


In contrast to my comment about logger.exception, these messages make sense as INFO messages to me because they are tracking expected errors coming from a service that issues them frequently.

Mr0grog · 2018-09-14T16:04:07Z

Ahhhh, sorry, wasn’t clear; I wasn’t really feeling like this was super mergeable at this point. Was planning to try and do the async bit and clean all this up (e.g. they flow between things has gotten super wonky; this needs to at least be extracted out of cli.py and re-organized a bit, maybe into a class or classes).

danielballan · 2018-09-14T16:26:54Z

Gotcha. This seemed either ready or not at all ready, depending on scope.

Mr0grog · 2018-09-14T16:51:21Z

Yeah, I guess we could merge it in an ugly state. My main concern is the async bit, because a full run of this across all 90-ish domain names we are monitoring can take a while. I think we need parallel downloading from IA to make this feasible.

danielballan · 2018-09-14T19:48:34Z

I agree and I don't think we need to rush it. Was just trying to figure out if your "Rebased" comment implied "So can we merged this please?" or not. :-D Let's wait for async.

Mr0grog · 2019-08-30T00:55:08Z

@danielballan I might come back and try and tidy things up slightly over the next few days, but I don’t think I’m going to make any more significant architectural changes at this point, so it’s probably ready for a high-level review.

You might be able to see the places where I got stuck trying to architect a super fancy data pipeline system. I tried a few different approaches and think I came up with some neat stuff, but realized in the end that it was a lot of abstraction for something that would only be used once or twice in the codebase. It didn’t save enough work to be worthwhile.

Anyway, the big pieces here are probably:

The main work is in cli.import_ia_urls. It spawns threads for searching CDX, loading mementos, and sending import requests to web-monitoring-db. The all get connected with FiniteQueue instances (see below):

web-monitoring-processing/web_monitoring/cli.py

Lines 388 to 452 in 3e0ab53

    
           def import_ia_urls(urls, *, from_date=None, to_date=None, 
        
                              maintainers=None, tags=None, 
        
                              skip_unchanged='resolved-response', 
        
                              version_filter=None, worker_count=0, 
        
                              create_pages=True, unplaybackable_path=None, 
        
                              dry_run=False): 
        
               skip_responses = skip_unchanged == 'response' 
        
               worker_count = worker_count if worker_count > 0 else PARALLEL_REQUESTS 
        
               unplaybackable = load_unplaybackable_mementos(unplaybackable_path) 
        
               with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event: 
        
                   cdx_records = utils.FiniteQueue() 
        
                   cdx_thread = threading.Thread(target=lambda: utils.iterate_into_queue( 
        
                       cdx_records, 
        
                       _list_ia_versions_for_urls( 
        
                           urls, 
        
                           from_date, 
        
                           to_date, 
        
                           skip_responses, 
        
                           version_filter, 
        
                           # Use a custom session to make sure CDX calls are extra robust. 
        
                           client=ia.WaybackClient(ia.WaybackSession(retries=10, backoff=4)), 
        
                           stop=stop_event))) 
        
                   cdx_thread.start() 
        
                   summary = {} 
        
                   versions_queue = utils.FiniteQueue() 
        
                   memento_thread = threading.Thread(target=lambda: WaybackRecordsWorker.parallel_with_retries( 
        
                       worker_count, 
        
                       summary, 
        
                       cdx_records, 
        
                       versions_queue, 
        
                       maintainers, 
        
                       tags, 
        
                       stop_event, 
        
                       unplaybackable, 
        
                       tries=(None, 
        
                              dict(retries=3, backoff=4, timeout=(30.5, 2)), 
        
                              dict(retries=7, backoff=4, timeout=60.5)))) 
        
                   memento_thread.start() 
        
                   uploadable_versions = versions_queue 
        
                   if skip_unchanged == 'resolved-response': 
        
                       uploadable_versions = _filter_unchanged_versions(versions_queue) 
        
                   if dry_run: 
        
                       uploader = threading.Thread(target=lambda: _log_adds(uploadable_versions)) 
        
                   else: 
        
                       uploader = threading.Thread(target=lambda: _add_and_monitor(uploadable_versions, create_pages, stop_event)) 
        
                   uploader.start() 
        
                   cdx_thread.join() 
        
                   memento_thread.join() 
        
                   print('\nLoaded {total} CDX records:\n' 
        
                         '  {success:6} successes ({success_pct:.2f}%),\n' 
        
                         '  {playback:6} could not be played back ({playback_pct:.2f}%),\n' 
        
                         '  {missing:6} had no actual memento ({missing_pct:.2f}%),\n' 
        
                         '  {unknown:6} unknown errors ({unknown_pct:.2f}%).'.format( 
        
                           **summary)) 
        
                   uploader.join() 
        
                   if not dry_run: 
        
                       print('Saving list of non-playbackable URLs...') 
        
                       save_unplaybackable_mementos(unplaybackable_path, unplaybackable)

cli.WaybackRecordsImporter is a kind of crazy megaclass. We spawn a bunch of them to read CDX records off a queue. A class method handles running N of them in parallel and another class method handles re-queueing them for retries.

web-monitoring-processing/web_monitoring/cli.py

Lines 116 to 354 in 3e0ab53

    
           class WaybackRecordsWorker(threading.Thread): 
        
               """ 
        
               WaybackRecordsWorker is a thread that takes CDX records from a queue and 
        
               loads the corresponding mementos from Wayback. It then transforms the 
        
               mementos into Web Monitoring import records and emits them on another 
        
               queue. If a `failure_queue` is provided, records that fail to load in a way 
        
               that might be worth retrying are emitted on that queue. 
        
               """ 
        
               def __init__(self, records, results_queue, maintainers, tags, cancel, 
        
                            failure_queue=None, session_options=None, 
        
                            unplaybackable=None): 
        
                   super().__init__() 
        
                   self.summary = self.create_summary() 
        
                   self.results_queue = results_queue 
        
                   self.failure_queue = failure_queue 
        
                   self.cancel = cancel 
        
                   self.records = records 
        
                   self.maintainers = maintainers 
        
                   self.tags = tags 
        
                   self.unplaybackable = unplaybackable 
        
                   session_options = session_options or dict(retries=3, backoff=2, 
        
                                                             timeout=(30.5, 2)) 
        
                   session = ia.WaybackSession(**session_options) 
        
                   self.wayback = ia.WaybackClient(session=session) 
        
               def is_active(self): 
        
                   return not self.cancel.is_set() 
        
               def run(self): 
        
                   """ 
        
                   Work through the queue of CDX records to load them from Wayback, 
        
                   transform them to Web Monitoring DB import entries, and queue them for 
        
                   importing. 
        
                   """ 
        
                   while self.is_active(): 
        
                       try: 
        
                           record = next(self.records) 
        
                           self.summary['total'] += 1 
        
                       except StopIteration: 
        
                           break 
        
                       self.handle_record(record, retry_connection_failures=True) 
        
                   self.wayback.close() 
        
                   return self.summary 
        
               def handle_record(self, record, retry_connection_failures=False): 
        
                   """ 
        
                   Handle a single CDX record. 
        
                   """ 
        
                   # Check for whether we already know this can't be played and bail out. 
        
                   if self.unplaybackable is not None and record.raw_url in self.unplaybackable: 
        
                       self.summary['playback'] += 1 
        
                       return 
        
                   try: 
        
                       version = self.process_record(record, retry_connection_failures=True) 
        
                       self.results_queue.put(version) 
        
                       self.summary['success'] += 1 
        
                   except ia.MementoPlaybackError as error: 
        
                       self.summary['playback'] += 1 
        
                       if self.unplaybackable is not None: 
        
                           self.unplaybackable[record.raw_url] = datetime.utcnow() 
        
                       logger.info(f'  {error}') 
        
                   except requests.exceptions.HTTPError as error: 
        
                       if error.response.status_code == 404: 
        
                           logger.info(f'  Missing memento: {record.raw_url}') 
        
                           self.summary['missing'] += 1 
        
                       else: 
        
                           # TODO: consider not logging this at a lower level, like debug 
        
                           # unless failure_queue does not exist. Unsure how big a deal 
        
                           # this error is to log if we are retrying. 
        
                           logger.info(f'  (HTTPError) {error}') 
        
                           if self.failure_queue: 
        
                               self.failure_queue.put(record) 
        
                           else: 
        
                               self.summary['unknown'] += 1 
        
                   except ia.WaybackRetryError as error: 
        
                       logger.info(f'  {error}; URL: {record.raw_url}') 
        
                       if self.failure_queue: 
        
                           self.failure_queue.put(record) 
        
                       else: 
        
                           self.summary['unknown'] += 1 
        
                   except Exception as error: 
        
                       # FIXME: getting read timed out connection errors here... 
        
                       # requests.exceptions.ConnectionError: HTTPConnectionPool(host='web.archive.org', port=80): Read timed out. 
        
                       # TODO: don't count or log (well, maybe DEBUG log) if failure_queue 
        
                       # is present and we are ultimately going to retry. 
        
                       logger.exception(f'  ({type(error)}) {error}; URL: {record.raw_url}') 
        
                       if self.failure_queue: 
        
                           self.failure_queue.put(record) 
        
                       else: 
        
                           self.summary['unknown'] += 1 
        
               def process_record(self, record, retry_connection_failures=False): 
        
                   """ 
        
                   Load the actual Wayback memento for a CDX record and transform it to 
        
                   a Web Monitoring DB import record. 
        
                   """ 
        
                   try: 
        
                       return self.wayback.timestamped_uri_to_version(record.date, 
        
                                                                      record.raw_url, 
        
                                                                      url=record.url, 
        
                                                                      maintainers=self.maintainers, 
        
                                                                      tags=self.tags, 
        
                                                                      view_url=record.view_url) 
        
                   except Exception as error: 
        
                       # On connection failures, reset the session and try again. If we 
        
                       # don't do this, the connection pool for this thread is pretty much 
        
                       # dead. It's not clear to me whether there is a problem in urllib3 
        
                       # or Wayback's servers that requires this. 
        
                       # This unfortunately requires string checking because the error can 
        
                       # get wrapped up into multiple kinds of higher-level errors :( 
        
                       if retry_connection_failures and ('failed to establish a new connection' in str(error).lower()): 
        
                           self.wayback.session.reset() 
        
                           return self.process_record(record) 
        
                       # Otherwise, re-raise the error. 
        
                       raise error 
        
               @classmethod 
        
               def create_summary(cls): 
        
                   """ 
        
                   Create a dictionary that summarizes the results of processing all the 
        
                   CDX records on a queue. 
        
                   """ 
        
                   return {'total': 0, 'success': 0, 'playback': 0, 'missing': 0, 
        
                           'unknown': 0} 
        
               @classmethod 
        
               def summarize(cls, workers, initial=None): 
        
                   """ 
        
                   Combine the summaries from multiple `WaybackRecordsWorker` instances 
        
                   into a single summary. 
        
                   """ 
        
                   return cls.merge_summaries((w.summary for w in workers), initial) 
        
               @classmethod 
        
               def merge_summaries(cls, summaries, intial=None): 
        
                   merged = intial or cls.create_summary() 
        
                   for summary in summaries: 
        
                       for key in merged.keys(): 
        
                           if key in summary: 
        
                               merged[key] += summary[key] 
        
                   # Add percentage calculations 
        
                   if merged['total']: 
        
                       merged.update({f'{k}_pct': 100 * v / merged['total'] 
        
                                      for k, v in merged.items() 
        
                                      if k != 'total' and not k.endswith('_pct')}) 
        
                   else: 
        
                       merged.update({f'{k}_pct': 0.0 
        
                                      for k, v in merged.items() 
        
                                      if k != 'total' and not k.endswith('_pct')}) 
        
                   return merged 
        
               @classmethod 
        
               def parallel(cls, count, *args, **kwargs): 
        
                   """ 
        
                   Run several `WaybackRecordsWorker` instances in parallel. When this 
        
                   returns, the workers will have finished running. 
        
                   Parameters 
        
                   ---------- 
        
                   count: int 
        
                       Number of instances to run in parallel. 
        
                   *args 
        
                       Arguments to pass to each instance. 
        
                   **kwargs 
        
                       Keyword arguments to pass to each instance. 
        
                   Returns 
        
                   ------- 
        
                   list of WaybackRecordsWorker 
        
                   """ 
        
                   workers = [] 
        
                   for i in range(count): 
        
                       worker = cls(*args, **kwargs) 
        
                       workers.append(worker) 
        
                       worker.start() 
        
                   for worker in workers: 
        
                       worker.join() 
        
                   return workers 
        
               @classmethod 
        
               def parallel_with_retries(cls, count, summary, records, results_queue, *args, tries=None, **kwargs): 
        
                   """ 
        
                   Run several `WaybackRecordsWorker` instances in parallel and retry 
        
                   records that fail to load. 
        
                   Parameters 
        
                   ---------- 
        
                   count: int 
        
                       Number of instances to run in parallel. 
        
                   summary: dict 
        
                       Dictionary to populate with summary data from all worker runs. 
        
                   records: web_monitoring.utils.FiniteQueue 
        
                       Queue of CDX records to load mementos for. 
        
                   results_queue: web_monitoring.utils.FiniteQueue 
        
                       Queue to place resulting import records onto. 
        
                   *args 
        
                       Arguments to pass to each instance. 
        
                   **kwargs 
        
                       Keyword arguments to pass to each instance. 
        
                   Returns 
        
                   ------- 
        
                   list of WaybackRecordsWorker 
        
                   """ 
        
                   if tries is None or len(tries) == 0: 
        
                       tries = (None,) 
        
                   # Initialize the summary (we have to keep a reference so other threads can read) 
        
                   summary.update(cls.create_summary()) 
        
                   total_tries = len(tries) 
        
                   retry_queue = None 
        
                   workers = [] 
        
                   for index, try_setting in enumerate(tries): 
        
                       if retry_queue and not retry_queue.empty(): 
        
                           print(f'\nRetrying about {retry_queue.qsize()} failed records...', flush=True) 
        
                           retry_queue.end() 
        
                           records = retry_queue 
        
                       if index == total_tries - 1: 
        
                           retry_queue = None 
        
                       else: 
        
                           retry_queue = utils.FiniteQueue() 
        
                       workers.extend(cls.parallel(count, records, results_queue, *args, **kwargs)) 
        
                   summary.update(cls.summarize(workers, summary)) 
        
                   results_queue.end()

In the super fancy stuff I came up with, I wanted to extract the logic for all the different kinds of errors, but again, it turned into a lot abstraction for little gain.

utils.FiniteQueue is my super-simple take at trying to make a queue that has a defined end and is iterable. It's useful for linking all the threads above together since they can just read to the end. There's something similar in PyPI, but for multiprocessing instead of threading, so it didn't seem like a fit here.

web-monitoring-processing/web_monitoring/utils.py

Lines 109 to 152 in 3e0ab53

    
           class FiniteQueue(queue.SimpleQueue): 
        
               """ 
        
               A queue that is iterable, with a defined end. 
        
               The end of the queue is indicated by the `FiniteQueue.QUEUE_END` object. 
        
               If you are using the iterator interface, you won't ever encounter it, but 
        
               if reading the queue with `queue.get`, you will receive 
        
               `FiniteQueue.QUEUE_END` if you’ve reached the end. 
        
               """ 
        
               # Use a class instad of `object()` for more readable names for debugging. 
        
               class QUEUE_END: 
        
                   ... 
        
               def __init__(self): 
        
                   super().__init__() 
        
                   self._ended = False 
        
                   # The Queue documentation suggests that put/get calls can be 
        
                   # re-entrant, so we need to use RLock here. 
        
                   self._lock = threading.RLock() 
        
               def end(self): 
        
                   self.put(self.QUEUE_END) 
        
               def get(self, *args, **kwargs): 
        
                   with self._lock: 
        
                       if self._ended: 
        
                           return self.QUEUE_END 
        
                       else: 
        
                           value = super().get(*args, **kwargs) 
        
                           if value is self.QUEUE_END: 
        
                               self._ended = True 
        
                           return value 
        
               def __iter__(self): 
        
                   return self 
        
               def __next__(self): 
        
                   item = self.get() 
        
                   if item is self.QUEUE_END: 
        
                       raise StopIteration 
        
                   return item

Some other useful classes in utils for handling signals (similar to what you linked before) and for reference counting.
internetarchive.WaybackClient now depends on a new WaybackSession object that handles all the networking/connection/retry/reset logic. It turns out retrying properly needs some care, and while we'd like to keep a session open so we can re-use connections, Wayback frequently craps out on a connection (still a mystery to me), so having WaybackSession.reset() is a huge requirement. That single feature made the biggest difference in our error rate by far.
CDX searches now have some maddening correctness checking, since it turns out the CDX index is chock full of nonsense entries like http://<<mailto:[email protected]>>/ and even data: URLs. 😳

Mr0grog · 2019-08-30T01:10:33Z

(Also, as a bonus, WaybackSession now does a lot of retrying by default, and so should be way less failure-prone when building docs.)

danielballan

Overall I found this reasonably easy to follow, considering what it does. I like the FiniteQueue and the way that WaybackClient and WaybackSession work together. Yes, I was struck by WaybackRecordsWorker but I can see that breaking that up could be a lot of effort for little reusable abstraction and arguable not any better readability.

I think the overall structure is good to run with. I left some small comments related to error handling, only one or two of them important.

web_monitoring/__init__.py

web_monitoring/utils.py

web_monitoring/cli.py

danielballan · 2019-09-01T18:39:12Z

web_monitoring/cli.py


+    with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event:


This is clever.

danielballan · 2019-09-01T18:39:53Z

web_monitoring/cli.py

+                        skipped += 1
+                        logger.debug('Skipping URL "%s"', version.url)
+            except ia.BlockedByRobotsError as error:
+                logger.warn(str(error))


Include the exception type.

Suggested change

logger.warn(str(error))

logger.warn(repr(error))

In this case, I didn’t do that because this exception always has nice descriptive text:

web-monitoring-processing/web_monitoring/internetarchive.py

Lines 501 to 503 in 98daf63

except Exception:

if 'RobotAccessControlException' in text:

raise BlockedByRobotsError(f'CDX search for URL was blocked by robots.txt "{query["url"]}" (parameters: {final_query})')

Do you think we should still use repr here? Should I cut down the error message if so?

Side note: I guess I should move most of that text to the actual error class instead of here where we raise it.

OK, doesn't seem like including the exception type in the warning would add much here then.

Argh, went to move the text and remembered why I didn’t originally do that: we could probably end up encountering a robots.txt issue in other API besides CDX as well (although I’ve only seen it and know how it’s formatted in CDX).

Any thoughts on how to better design this for that? Should we let the constructor take some extra info like:

raise BlockedByRobotsError(query['url'], f'In CDX search {final_query}')

Should we leave it to where the error is handled?

try: raise ia.BlockedByRobotsError(query['url']) except ia.BlockedByRobotsError as error: logger.warn(f'CDX search error: {error!r}')

Or something else? Leave it as-is?

I see. I like the snippet above because it’s the job of the caller to explain what it was doing that caused the error. I don’t think exception’s constructor need be extended to carry arbitrary contextual information.

OK, I took a crack at that in 91e48ce.

web_monitoring/cli.py

web_monitoring/internetarchive.py

web_monitoring/utils.py

Co-Authored-By: Dan Allan <[email protected]>

We previously set special log levels for urllib3 because it was too noisy (it logged every retry as a warning, but Wayback's systems fail a lot and we *expect* having to retry often). However, we later stopped using urllib3's retry functionality because it couldn't distinguish between Wayback failures (which we want to retry) and mementos of failures (which are OK). So the custom logging setup is no longer necessary at all! This also adds a few more comments to the retry code, and captures the above situation in the same place where we describe the problems with urllib3's built-in retry functionality that will hopefully get fixed in the future.

Mr0grog · 2019-09-02T23:05:49Z

@danielballan I think I addressed all the issues you brought up. Want to give it another once-over?

Way back at the end of last year, we added `status` as a first-class field on versions (edgi-govdata-archiving/web-monitoring-db#453), but never switched to actually sending it properly in our import data! This moves it from `source_metadata` (where it would be redundant) to the new first-class field.

The web-monitoring-db project is removing totals from list queries in order to improve performance (counting the total possible results is super expensive). You can still get the totals by using the `?include_total` query parameter, so this just makes sure to set that in the one query where we need that information (determining the size of the population to run a random sample over). See the relevant change in the API: edgi-govdata-archiving/web-monitoring-db#596

lightandluck · 2019-09-04T23:14:56Z

Huzzah! 🎉

We made a bunch of improvements to our Wayback tools in #174, but we didn’t update the docs as well as we should have in that PR. This attempts to remedy that: - Use `WaybackClient.get_memento()` in Wayback tutorial. - Use the term “memento” instead of “snapshot” throughout the tutorial. - Add a diagram and docstring to the top of the `cli` module. - Ensure we generate API docs for `WaybackClient.get_memento()` and `WaybackSession`

In #174, I made a last minute fix to add `status` as a top level field in our version imports (it was already supported by the API, but we had failed to update our code to send it). I must have missed committing a file though, because it's failing!

Mr0grog requested a review from danielballan March 18, 2018 20:59

Mr0grog added the in progress label Mar 18, 2018

Mr0grog commented Mar 18, 2018

View reviewed changes

Mr0grog mentioned this pull request Mar 18, 2018

Increase redirect limit to 10 when archiving a URL edgi-govdata-archiving/web-monitoring-db#251

Merged

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from 1db5cfd to 5f0dab6 Compare March 19, 2018 21:48

Mr0grog mentioned this pull request Mar 27, 2018

IA import scripts need option to include all snapshots #176

Closed

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from 5f0dab6 to b26e74d Compare April 16, 2018 04:01

danielballan approved these changes Apr 24, 2018

View reviewed changes

Mr0grog mentioned this pull request May 3, 2018

Unblock #264-db with Nov. 1, 2016 cutoff edgi-govdata-archiving/web-monitoring-ui#218

Merged

weatherpattern mentioned this pull request May 6, 2018

Milestone 0.0.2 issues edgi-govdata-archiving/web-monitoring#109

Closed

6 tasks

Mr0grog mentioned this pull request Aug 1, 2018

Radical suggestion: Deprecate this repo and break up pieces #206

Closed

Mr0grog mentioned this pull request Sep 5, 2018

☂ Pull Versions from IA for diffing edgi-govdata-archiving/web-monitoring#23

Closed

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from 6951d82 to a76c6ad Compare September 6, 2018 04:14

Mr0grog mentioned this pull request Sep 6, 2018

Be nicer to Internet Archive and also be more strict when loading mementos #255

Merged

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from a76c6ad to 079eeef Compare September 14, 2018 04:55

danielballan approved these changes Sep 14, 2018

View reviewed changes

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch from f538658 to 4edb330 Compare September 17, 2018 06:16

Mr0grog requested a review from danielballan August 30, 2019 00:34

Remove unused lock

cb08483

Update Travis to Python 3.7

357b19e

Replace retry tests with WaybackSession tests

98daf63

danielballan reviewed Sep 1, 2019

View reviewed changes

Mr0grog commented Sep 2, 2019

View reviewed changes

web_monitoring/utils.py Outdated Show resolved Hide resolved

Mr0grog and others added 4 commits September 1, 2019 17:51

Fix typo in QuitSignal docstring

fa9318d

Fix signal handler exit logic

51b6271

Add examples for Signal and QuitSignal

e9e7242

Apply suggestions from code review

430bf1c

Co-Authored-By: Dan Allan <[email protected]>

Mr0grog force-pushed the 86-import-known-db-pages-from-ia branch 2 times, most recently from b3c8cd0 to 58ea3d8 Compare September 2, 2019 17:39

Mr0grog added 2 commits September 2, 2019 10:39

Change how robots.txt errors are represented

91e48ce

Mr0grog requested a review from danielballan September 2, 2019 23:05

Mr0grog added 2 commits September 4, 2019 09:55

danielballan approved these changes Sep 4, 2019

View reviewed changes

Mr0grog merged commit b362860 into master Sep 4, 2019

Mr0grog mentioned this pull request Sep 4, 2019

Update Wayback docs to better use new features #474

Merged

Mr0grog mentioned this pull request Sep 6, 2019

HOTFIX: version uploads with status field fail #475

Merged

	if not last_hashes:
	raise ValueError("Internet archive does not have archived "
	"versions of {}".format(url))


		with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event:

	except Exception:
	if 'RobotAccessControlException' in text:
	raise BlockedByRobotsError(f'CDX search for URL was blocked by robots.txt "{query["url"]}" (parameters: {final_query})')

Add command to import known page URLs from IA #174

Add command to import known page URLs from IA #174

Conversation

Mr0grog commented Mar 18, 2018

Mr0grog Mar 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog commented Mar 18, 2018

Mr0grog commented Apr 15, 2018

Mr0grog commented Apr 16, 2018

Mr0grog commented Apr 16, 2018

danielballan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog commented Sep 3, 2018 • edited Loading

Mr0grog commented Sep 6, 2018

Mr0grog commented Sep 11, 2018

Mr0grog commented Sep 14, 2018

danielballan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog Sep 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog commented Sep 14, 2018

danielballan commented Sep 14, 2018

Mr0grog commented Sep 14, 2018

danielballan commented Sep 14, 2018

Mr0grog commented Aug 30, 2019

Mr0grog commented Aug 30, 2019

danielballan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog Sep 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog commented Sep 2, 2019

lightandluck commented Sep 4, 2019

Mr0grog Mar 18, 2018 •

edited

Loading

Mr0grog commented Sep 3, 2018 •

edited

Loading

Mr0grog Sep 14, 2018 •

edited

Loading

Mr0grog Sep 2, 2019 •

edited

Loading