Skip to content

Releases: edgi-govdata-archiving/wayback

v0.4.5

01 Feb 19:01
0ef2797
Compare
Choose a tag to compare

In v0.4.4, we broke archived mementos of rate limit errors — they started raising exceptions instead of returning the actual memento. We now correctly return mementos of rate limit errors while still raising exceptions for actual live rate limit errors from the Wayback Machine itself. (#158)

Full Changelog: v0.4.4...v0.4.5

Version 0.4.4

28 Nov 00:47
b8f647f
Compare
Choose a tag to compare

This release makes some small fixes to rate limits and retries in order to better match with the current behavior of Wayback Machine servers:

  • Updates the WaybackClient.search rate limit to 1 call per second (it was previously 1.5 per second). (#140)

  • Delays retries for 60 seconds when receiving rate limit errors from the server. (#142)

  • Adds more logging around requests and rate limiting. This should make it easier to debug future rate limit issues. (#139)

  • Fixes calculation of the time attribute on wayback.exceptions.WaybackRetryError. It turns out it was only accounting for the time spent waiting between retries and skipping the time waiting for the server to respond! (#142)

  • Fixes some spots where we leaked HTTP connections during retries or during exception handling. (#142)

The next minor release (v0.5) will almost certainly include some bigger changes to how rate limits and retries are handled.

Version 0.4.3

26 Sep 16:56
edd08e8
Compare
Choose a tag to compare

This is mainly a compatibility release: it adds support for urllib3 v2.x and the next upcoming major release of Python, v3.12.0. It also adds support for multiple filters in searches. There are no breaking changes.

Features

You can now apply multiple filters to a search by using a list or tuple for the filter_field parameter of WaybackClient.search. Previously, you could only supply a string with a single filter. (#119)

For example, to search for all captures at nasa.gov with a 404 status code and “feature” somewhere in the URL:

client.search('nasa.gov/',
              match_type='prefix',
              filter_field=['statuscode:404',
                            'urlkey:.*feature.*'])

Fixes & Maintenance

  • Add support for Python 3.12.0. (#123)
  • Add support for urllib3 v2.x (urllib3 v1.20+ also still works). (#116)

Version 0.4.3a1

22 Sep 22:02
45ff79e
Compare
Choose a tag to compare
Version 0.4.3a1 Pre-release
Pre-release

This is a test release for properly supporting the upcoming release of Python 3.12.0. Please file an issue if you encounter issues using on Python 3.12.0rc3 or later. (#123)

Version 0.4.2

30 May 06:16
43f553f
Compare
Choose a tag to compare

Wayback is not compatible with urllib3 v2, and this release updates the package's requirements to make sure Pip and other package managers install compatible versions of Wayback and urllib3. There are no other fixes or new features.

Version 0.4.1

08 Mar 05:12
7fd74ef
Compare
Choose a tag to compare

Features

wayback.Memento now has a links property with information about other URLs that are related to the memento, such as the previous or next mementos in time. It’s a dict where the keys identify the relationship (e.g. 'prev memento') and the values are dicts with additional information about the link. (#57)

For example::

{
    'original': {
        'url': 'https://www.fws.gov/birds/',
        'rel': 'original'
    },
    'first memento': {
        'url': 'https://web.archive.org/web/20050323155300id_/http://www.fws.gov:80/birds',
        'rel': 'first memento',
        'datetime': 'Wed, 23 Mar 2005 15:53:00 GMT'
    },
    'prev memento': {
        'url': 'https://web.archive.org/web/20210125125216id_/https://www.fws.gov/birds/',
        'rel': 'prev memento',
        'datetime': 'Mon, 25 Jan 2021 12:52:16 GMT'
    },
    'next memento': {
        'url': 'https://web.archive.org/web/20210321180831id_/https://www.fws.gov/birds',
        'rel': 'next memento',
        'datetime': 'Sun, 21 Mar 2021 18:08:31 GMT'
    },
    'last memento': {
        'url': 'https://web.archive.org/web/20221006031005id_/https://fws.gov/birds',
        'rel': 'last memento',
        'datetime': 'Thu, 06 Oct 2022 03:10:05 GMT'
    }
}

One use for these is to iterate through additional mementos. For example, to get the previous memento::

client.get_memento(memento.links['prev memento']['url'])

Fixes & Maintenance

  • Fix an issue where the Memento.url attribute might be slightly off from the exact URL that was captured (it could have a different protocol, different upper/lower-casing, etc.). (#99)

  • Fix an error when getting a memento for a redirect in view mode. If you called wayback.WaybackClient.get_memento with a URL that turned out to be a redirect at the given time and set the mode option to wayback.Mode.view, you’d get an exception saying “Memento at {url} could not be played.” Now this works just fine. (#109)

Version 0.4.0

10 Nov 18:35
e2af777
Compare
Choose a tag to compare

Breaking Changes

This release includes a significant overhaul of parameters for WaybackClient.search.

  • Removed parameters that did nothing, could break search, or that were for internal use only: gzip, showResumeKey, resumeKey, page, pageSize, previous_result.

  • Removed support for extra, arbitrary keyword parameters that could be added to each request to the search API.

  • All parameters now use snake_case. (Previously, parameters that were passed unchanged to the HTTP API used camelCase, while others used snake_case.) The old, non-snake-case names are deprecated, but still work. They’ll be completely removed in v0.5.0.

    • matchTypematch_type
    • fastLatestfast_latest
    • resolveRevisitsresolve_revisits
  • The limit parameter now has a default value. There are very few cases where you should not set a limit (not doing so will typically break pagination), and there is now a default value to help prevent mistakes. We’ve also added documentation to explain how and when to adjust this value, since it is pretty complex. (#65)

  • Expanded the method documentation to explain things in more depth and link to more external references.

While we were at it, we also renamed the datetime parameter of WaybackClient.get_memento to timestamp for consistency with the CdxRecord and Memento classes. The old name still works for now, but it will be fully removed in v0.5.0.

Features

  • Memento.headers is now case-insensitive. The keys of the headers dict are returned with their original case when iterating, but lookups are performed case-insensitively. For example:

    list(memento.headers) == ['Content-Type', 'Date']
    memento.headers['Content-Type'] == memento.headers['content-type']

    (#98)

  • There are now built-in, adjustable rate limits for calls to both search() and get_memento(). The default values should keep you from getting temporarily blocked by the Wayback Machine servers, but you can also adjust them when instantiating WaybackSession:

    # Limit get_memento() calls to 2 per second (or one every 0.5 seconds):
    client = WaybackClient(WaybackSession(memento_calls_per_second=2))
    
    # These now take a minimum of 0.5 seconds, even if the Wayback Machine
    # responds instantly (there's no delay on the first call):
    client.get_memento('http://www.noaa.gov/', timestamp='20180816111911')
    client.get_memento('http://www.noaa.gov/', timestamp='20180829092926')

    A huge thanks to @LionSzl for implementing this. (#12)

Fixes & Maintenance

  • All API requests to archive.org now use HTTPS instead of HTTP. Thanks to @sundhaug92 for calling this out. (#81)

  • Headers from the original archived response are again included in Memento.headers. As part of this, the headers attribute is now case-insensitive (see new features above), since the Internet Archive servers now return headers with different cases depending on how the request was made. (#98)

Version 0.3.3

30 Sep 19:11
3ff9a73
Compare
Choose a tag to compare

This release extends the timestamp parsing fix from version 0.3.2 to handle a similar problem, but with the month portion of timestamps in addition to the day. It also implements a small performance improvement in timestamp parsing. Thanks to @edsu for discovering this issue and addressing this. (#88)

Full Changelog: v0.3.2...v0.3.3

Version 0.3.2

17 Nov 07:35
2047b07
Compare
Choose a tag to compare

Some Wayback CDX records have invalid timestamps with "00" for the day-of-month portion. wayback.WaybackClient.search previously raised an exception when parsing CDX records with this issue, but now handles them safely. Thanks to @8W9aG for discovering this issue and addressing it. (#85)

Version 0.3.1

15 Oct 03:30
b406f3d
Compare
Choose a tag to compare

Some Wayback CDX records have no length information, and previously caused WaybackClient.search to raise an exception. These records now have their length property set to None instead of a number. Thanks to @8W9aG for discovering this issue and addressing it! (#83)