Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exception for blocked sites in search() #34

Closed
Mr0grog opened this issue Mar 25, 2020 · 2 comments · Fixed by #46
Closed

Add exception for blocked sites in search() #34

Mr0grog opened this issue Mar 25, 2020 · 2 comments · Fixed by #46
Labels
enhancement New feature or request

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Mar 25, 2020

Looking at @edsu’s very awesome COVID-19 notebook, it turns out CDX searches can return a special error for blocked sites, e.g. http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true

Just like we have a custom BlockedByRobotsError, we should have another error for this, rather than just raising a not-so-great HTTP error.

In this case, the response code is 403 and there is a header like:

X-Archive-Wayback-Runtime-Error: org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error

(And the same text as the header in the response body.)

We can probably follow Wayback’s naming and call this AdministrativeAccessControlException or BlockedSiteError.

It might even make sense to generalize this for any 4xx/5xx response that has an X-Archive-Wayback-Runtime-Error header.

@Mr0grog Mr0grog added the enhancement New feature or request label Mar 25, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 25, 2020

Also worth noting: in the v2 web/timemap/cdx API (that we haven’t yet implemented, see #8) this error is formatted a little differently.

  • No response body.

  • The header has the same name, but the value is more concise:

    X-Archive-Wayback-Runtime-Error: AdministrativeAccessControlException: Blocked Site Error
    
  • Still has a 403 status code.

It’d be good to implement this in a way that supports both.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 28, 2020

Some more quick notes:

So, I think we probably don’t need to worry too much about other possible exception types or more generic handling based on the X-Archive-Wayback-Runtime-Error header for now.

Mr0grog added a commit that referenced this issue Jun 28, 2020
Searching for a blocked URL used to raise a relatively uninformative and generic `WaybackException` error. Now it raises a `BlockedSiteError`. Fixes #34.
Mr0grog added a commit that referenced this issue Jun 29, 2020
`WaybackClient.search()` and `WaybackClient.get_memento()` now raise `BlockedSiteError` any time you request a URL that has been blocked from access (for example, in situations where the Internet Archive has received a takedown notice).

These previously would have resulted in different (and more generic, less informative) errors depending on which method you called. Now blocked URLs always cause the same error across this library.

Fixes #34.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant