-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add command to import known page URLs from IA #174
Conversation
|
||
# TODO: this should probably be a method on db.Client, but db.Client could also | ||
# do well to transform the `links` into callables, e.g: | ||
# more_pages = pages['links']['next']() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wary of actually doing this here because it:
- Probably has a lot of room for discussion over different APIs/implementations
- Ought to be applied across all
list_*
methods, which is not a small surface area - Probably entails big changes because we ought to break apart the URL composition/request from the response parsing bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to all of the above.
Internet Archive stores all the redirects in the original request, which means they get replayed to us or anyone else who tries to retrieve an IA page. It turns out that sometimes that's a lot of redirects! Increase our limit to 10 for now (the default in HTTParty is 5). This fixes some issues I hit while working on edgi-govdata-archiving/web-monitoring-processing#174
Also unsure if we should be using Tornado or Asyncio or something for this, since it’s probably a lot of HTTP traffic. |
1db5cfd
to
5f0dab6
Compare
Need to rebase and incorporate work from #179. |
5f0dab6
to
b26e74d
Compare
Rebased on master and added |
Ugh, this got way more complicated, but I think it’s probably doing the job better now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
||
# TODO: this should probably be a method on db.Client, but db.Client could also | ||
# do well to transform the `links` into callables, e.g: | ||
# more_pages = pages['links']['next']() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to all of the above.
I think I need to re-working things here based on discussions I had with IA folks recently about revisit records and other kinds of un-playback-able records. We’ve made some pretty incorrect assumptions here about what we’re getting when following redirects and looking up CDX results.
The first major thing to note here is that this means not every CDX result can be transformed into an importable snapshot/version (so we need to be more careful with how we handle responses and redirects when loading a memento). We’re also going to encounter lots of revisits in our pages because they scrape so often, and knowing how to best handle those feels like it needs some thought. |
6951d82
to
a76c6ad
Compare
Rebased on master. Also worth noting, re: the last comment about IA import issues, we talked about it in today’s meeting and I don’t think there are any clever solutions that will work well (mainly because of redirects). So the best we can do here is to fail saving a memento in |
Once we are happy with #255, I’ll rebase this on it. For now, the rebased and improved version is on |
a76c6ad
to
079eeef
Compare
Rebased this insane mess, now that #255 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through again, looks fine. Made some picky suggestions about logging for your consideration.
Is there anything else that needs to be done before we can start trying this thing out in production?
web_monitoring/cli.py
Outdated
skipped += 1 | ||
logger.debug('Skipping URL "%s"', version.url) | ||
except ValueError as error: | ||
logger.warn(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use logger.exception
for this.
except ValueError:
logger.exception('Error while applying version_filter to URL %r', url)
The Traceback will be automatically captured by logger.exception
and shown below the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is really here to handle the error if there were no versions to list:
web-monitoring-processing/web_monitoring/internetarchive.py
Lines 377 to 379 in 4bb08f8
if not last_hashes: | |
raise ValueError("Internet archive does not have archived " | |
"versions of {}".format(url)) |
…which, well, probably shouldn’t actually be a ValueError
at all, since it’s not really a problem with your input. Actually, I’m not really sure why we have an exception for this in the first place. Should we get rid of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Anyway, implicit point there is that this isn’t really an exception or unexpected scenario, so I don’t think logger.exception
makes sense for it. But the code and the error classes involved really don’t make that clear.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
web_monitoring/cli.py
Outdated
view_url=version.view_url) | ||
except ia.MementoPlaybackError as error: | ||
wayback_errors['playback'].append(error) | ||
logger.info(f' {error}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast to my comment about logger.exception
, these messages make sense as INFO messages to me because they are tracking expected errors coming from a service that issues them frequently.
Ahhhh, sorry, wasn’t clear; I wasn’t really feeling like this was super mergeable at this point. Was planning to try and do the async bit and clean all this up (e.g. they flow between things has gotten super wonky; this needs to at least be extracted out of |
Gotcha. This seemed either ready or not at all ready, depending on scope. |
Yeah, I guess we could merge it in an ugly state. My main concern is the async bit, because a full run of this across all 90-ish domain names we are monitoring can take a while. I think we need parallel downloading from IA to make this feasible. |
I agree and I don't think we need to rush it. Was just trying to figure out if your "Rebased" comment implied "So can we merged this please?" or not. :-D Let's wait for async. |
f538658
to
4edb330
Compare
@danielballan I might come back and try and tidy things up slightly over the next few days, but I don’t think I’m going to make any more significant architectural changes at this point, so it’s probably ready for a high-level review. You might be able to see the places where I got stuck trying to architect a super fancy data pipeline system. I tried a few different approaches and think I came up with some neat stuff, but realized in the end that it was a lot of abstraction for something that would only be used once or twice in the codebase. It didn’t save enough work to be worthwhile. Anyway, the big pieces here are probably:
|
(Also, as a bonus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I found this reasonably easy to follow, considering what it does. I like the FiniteQueue
and the way that WaybackClient
and WaybackSession
work together. Yes, I was struck by WaybackRecordsWorker
but I can see that breaking that up could be a lot of effort for little reusable abstraction and arguable not any better readability.
I think the overall structure is good to run with. I left some small comments related to error handling, only one or two of them important.
|
||
with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever.
web_monitoring/cli.py
Outdated
skipped += 1 | ||
logger.debug('Skipping URL "%s"', version.url) | ||
except ia.BlockedByRobotsError as error: | ||
logger.warn(str(error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the exception type.
logger.warn(str(error)) | |
logger.warn(repr(error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I didn’t do that because this exception always has nice descriptive text:
web-monitoring-processing/web_monitoring/internetarchive.py
Lines 501 to 503 in 98daf63
except Exception: | |
if 'RobotAccessControlException' in text: | |
raise BlockedByRobotsError(f'CDX search for URL was blocked by robots.txt "{query["url"]}" (parameters: {final_query})') |
Do you think we should still use repr
here? Should I cut down the error message if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I guess I should move most of that text to the actual error class instead of here where we raise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, doesn't seem like including the exception type in the warning would add much here then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, went to move the text and remembered why I didn’t originally do that: we could probably end up encountering a robots.txt
issue in other API besides CDX as well (although I’ve only seen it and know how it’s formatted in CDX).
Any thoughts on how to better design this for that? Should we let the constructor take some extra info like:
raise BlockedByRobotsError(query['url'], f'In CDX search {final_query}')
Should we leave it to where the error is handled?
try:
raise ia.BlockedByRobotsError(query['url'])
except ia.BlockedByRobotsError as error:
logger.warn(f'CDX search error: {error!r}')
Or something else? Leave it as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I like the snippet above because it’s the job of the caller to explain what it was doing that caused the error. I don’t think exception’s constructor need be extended to carry arbitrary contextual information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I took a crack at that in 91e48ce.
b3c8cd0
to
58ea3d8
Compare
We previously set special log levels for urllib3 because it was too noisy (it logged every retry as a warning, but Wayback's systems fail a lot and we *expect* having to retry often). However, we later stopped using urllib3's retry functionality because it couldn't distinguish between Wayback failures (which we want to retry) and mementos of failures (which are OK). So the custom logging setup is no longer necessary at all! This also adds a few more comments to the retry code, and captures the above situation in the same place where we describe the problems with urllib3's built-in retry functionality that will hopefully get fixed in the future.
@danielballan I think I addressed all the issues you brought up. Want to give it another once-over? |
Way back at the end of last year, we added `status` as a first-class field on versions (edgi-govdata-archiving/web-monitoring-db#453), but never switched to actually sending it properly in our import data! This moves it from `source_metadata` (where it would be redundant) to the new first-class field.
The web-monitoring-db project is removing totals from list queries in order to improve performance (counting the total possible results is super expensive). You can still get the totals by using the `?include_total` query parameter, so this just makes sure to set that in the one query where we need that information (determining the size of the population to run a random sample over). See the relevant change in the API: edgi-govdata-archiving/web-monitoring-db#596
Huzzah! 🎉 |
We made a bunch of improvements to our Wayback tools in #174, but we didn’t update the docs as well as we should have in that PR. This attempts to remedy that: - Use `WaybackClient.get_memento()` in Wayback tutorial. - Use the term “memento” instead of “snapshot” throughout the tutorial. - Add a diagram and docstring to the top of the `cli` module. - Ensure we generate API docs for `WaybackClient.get_memento()` and `WaybackSession`
In #174, I made a last minute fix to add `status` as a top level field in our version imports (it was already supported by the API, but we had failed to update our code to send it). I must have missed committing a file though, because it's failing!
In #174, I made a last minute fix to add `status` as a top level field in our version imports (it was already supported by the API, but we had failed to update our code to send it). I must have missed committing a file though, because it's failing!
Fixes #86, based on #173.
This adds the
import ia-known-pages
command, which imports only pages we already know about in the DB from Internet Archive. Use it like so:If you set
LOG_LEVEL=INFO
, it’ll tell you how many URLs IA had that we skipped because we didn’t know them.If you set
LOG_LEVEL=DEBUG
, it’ll print every skipped URL.