Skip to content

Python package docs

Akash Mahanty edited this page Feb 9, 2022 · 86 revisions

You are currently reading waybackpy docs to use it as a python package. If you want to use waybackpy as CLI tool visit our CLI docs.


Contents

Archiving or Saving a webpage

>>> import waybackpy
>>> 
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> 
>>> save_api = waybackpy.WaybackMachineSaveAPI(url, user_agent=user_agent)
>>> save_api.save()
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> save_api.cached_save
False
>>> save_api.headers
{'Server': 'nginx/1.19.5', 'Date': 'Sat, 22 Jan 2022 10:20:19 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Fri, 21 Jan 2022 23:32:39 GMT', 'x-archive-orig-server': 'mw1407.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Fri, 21 Jan 2022 23:16:22 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '38855', 'x-archive-orig-x-cache': 'cp4027 miss, cp4030 hit/2', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4030"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-set-cookie': 'WMF-Last-Access=22-Jan-2022;Path=/;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, WMF-Last-Access-Global=22-Jan-2022;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, GeoIP=US:CA:San_Francisco:37.78:-122.47:v4; Path=/; secure; Domain=.wikipedia.org', 'x-archive-orig-x-client-ip': '207.241.227.105', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '28504', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sat, 22 Jan 2022 10:20:14 GMT', 'link': '<https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timegate", <https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus>; rel="first memento"; datetime="Fri, 22 Apr 2005 13:01:29 GMT", <https://web.archive.org/web/20220118154923/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="prev memento"; datetime="Tue, 18 Jan 2022 15:49:23 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="last memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220122093153-wwwb-spn23.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=138.943024, exclusion.robots;dur=0.124457, exclusion.robots.policy;dur=0.114278, cdx.remote;dur=0.091306, esindex;dur=0.011012, LoadShardBlock;dur=101.247564, PetaboxLoader3.datanode;dur=44.420167, CDXLines.iter;dur=25.235685, PetaboxLoader3.resolve;dur=25.677021, load_resource;dur=4.737038', 'x-app-server': 'wwwb-app201', 'x-ts': '200', 'x-tr': '315', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculusIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}
>>> save_api.timestamp()
datetime.datetime(2022, 1, 22, 10, 20, 14)
>>> save_api.archive_url
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'

Try this out in your browser @ https://replit.com/@akamhy/WaybackPySaveExample


Retrieve archive of webpage

Retrieving the oldest archive for an URL using oldest()
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.google.com"
>>> user_agent = "Any-user-agent-you-want"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>> availability_api.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> availability_api.json
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/19981111184551/http://google.com:80/', 'timestamp': '19981111184551'}}, 'timestamp': '199401221029'}
>>> availability_api.timestamp()
datetime.datetime(1998, 11, 11, 18, 45, 51)

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyOldestExample

Retrieving the newest archive for an URL using newest()
>>> import waybackpy
>>> url = "https://www.eff.org"
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.newest()
https://web.archive.org/web/20220122070041/https://www.eff.org/
>>> availability_api.archive_url
'https://web.archive.org/web/20220122070041/https://www.eff.org/'
>>> availability_api.timestamp()
datetime.datetime(2022, 1, 22, 7, 0, 41)
>>> availability_api.json
{'url': 'https://www.eff.org', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20220122070041/https://www.eff.org/', 'timestamp': '20220122070041'}}, 'timestamp': '20220122104234'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNewestExample

Retrieving archive close to a specified year, month, day, hour, and a minute or a UNIX timestamp using near()
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.facebook.com/zuck"
>>> user_agent = "YOUR USER AGENT"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
>>> availability_api.near(year=2012, month=10, day=29, hour=12, minute=16)
https://web.archive.org/web/20121029122242/https://www.facebook.com/zuck
>>> availability_api.json
{'url': 'https://www.facebook.com/zuck', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20121029122242/https://www.facebook.com/zuck', 'timestamp': '20121029122242'}}, 'timestamp': '201210291216'}
>>> availability_api.timestamp()
datetime.datetime(2012, 10, 29, 12, 22, 42)
>>> import waybackpy
>>> url = "https://www.google.com" 
>>> unix_time = 1200144258 # you can pass str, int or float.
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.near(unix_timestamp=unix_time)
https://web.archive.org/web/20080114115458/http://www.google.com/
>>> availability_api.archive_url
'https://web.archive.org/web/20080114115458/http://www.google.com/'
>>> availability_api.timestamp()
datetime.datetime(2008, 1, 14, 11, 54, 58)
>>> availability_api.json
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20080114115458/http://www.google.com/', 'timestamp': '20080114115458'}}, 'timestamp': '20080112132418'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNearExample


List of URLs that Wayback Machine knows and has archived for a domain name

  • To include URLs from subdomain set subdomain=True

Please note that known_urls is built on top of the CDX API interface and you can do much more if you can master using the CDX API interface.

>>> import waybackpy
>>> user_agent = "This is an example user agent"
>>> url = "pypi.org"
>>> wayback = waybackpy.Url(url=url, user_agent=user_agent)
>>> known_urls = wayback.known_urls(subdomain=False)
>>> for url in known_urls:
...     print(url)
... 
https://pypi.org/project/coralillo/1.0.0/
https://pypi.org/project/coralillo/1.0.1/
https://pypi.org/project/coraline-eda/
https://pypi.org/project/coralinedb/
https://pypi.org/project/coralogix/
https://pypi.org/project/coralogix/0.2.5.10/
https://pypi.org/project/coralogix/0.2.6.0/
.
. # Millions of other URLs redacted from the output for readability and size limit of this GitHub wiki page
.
https://pypi.org/project/coralogix/0.2.6.5/
https://pypi.org/project/coralogix/0.2.6.6/
https://pypi.org/project/corappo/
https://pypi.org/project/coras/
https://pypi.org/project/corax/
https://pypi.org/project/cord-19-tools/
https://pypi.org/project/cord-robot/
https://pypi.org/project/cord-workflow-controller-client/
https://pypi.org/project/corda/

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py

CDX Server API

This CDX server API doc is derived from the https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

Basic usage

The following code snippet should print all archives with https://github.com/akamhy/ as prefix as we are using the wildcard "*".

from waybackpy import WaybackMachineCDXServerAPI
url = "https://github.com/akamhy/*"
user_agent = "Your-user-agent"

cdx = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent)
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot)
com,github)/akamhy/antispam 20210113054521 https://github.com/akamhy/antispam text/html 404 DOVRV3NM56PCPIQ2IH2RUINLRDDFXXZO 17318
com,github)/akamhy/dhashpy 20211001180207 https://github.com/akamhy/dhashpy text/html 200 56W6EQISXHZ4PXBCRN7G7ZGWPV2YEMQG 37087
com,github)/akamhy/dhashpy/code_menu_contents/main 20211001180209 https://github.com/akamhy/dhashpy/code_menu_contents/main text/html 200 TC4HVXHKIJIZVWOUIT2NP6M6CE7KGB4S 4929
.
. # Many URLs redacted for readability
.
com,github)/akamhy/waybackpy/workflows/ci/badge.svg 20211029162652 https://github.com/akamhy/waybackpy/workflows/CI/badge.svg warc/revisit - W3FRZ5W6JL4BXITTRZNIKH4YD5D7XCIA 1372
com,github)/akamhy/waybackpy/workflows/ci/badge.svg 20211208051152 https://github.com/akamhy/waybackpy/workflows/CI/badge.svg warc/revisit - W3FRZ5W6JL4BXITTRZNIKH4YD5D7XCIA 1371
com,github)/akamhy/youtubereviewbot 20200909031810 https://github.com/akamhy/YouTubeReviewBot text/html 404 K6CZ667YBWFMRSN3OKTPDD4HQS66TM3U 89082

Try this out in your browser @ https://repl.it/@akamhy/CDX-Basic-usage#main.py

Url Match Scope

The default behavior is to return matches for an exact URL. However, the CDX server can also return results matching a certain prefix, a certain host, or all sub-hosts by using the match_type= param.

  • match_type=exact (default if omitted) will return results matching exactly archive.org/about/
  • match_type=prefix will return results for all results under the path archive.org/about/
  • match_type=host will return results from host archive.org
  • match_type=domain will return results from host archive.org and all sub-hosts *.archive.org
from waybackpy import WaybackMachineCDXServerAPI
url = "archive.org/about/"
user_agent = "Your-user-agent"

cdx = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix")

snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/CDX-UrlMatchScope#main.py

Filtering
Date Range

Date Range: Results may be filtered by timestamp using start_timestamp= and end_timestamp= params. The ranges are inclusive and are specified in the same 1 to 14 digit format used for wayback captures: yyyyMMddhhmmss

from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, start_timestamp=1998, end_timestamp=2000)
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/CDX-Filtering-Date-Range#main.py

Regex filtering
  • It is possible to filter on a specific field or the entire CDX line (which is space-delimited). Filtering by specific field is often simpler. Any number of filter params of the following form may be specified: filters=["[!]field:regex"] may be specified.

    • field is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by mimetype or statuscode

    • Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.

    • regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)

  • Ex: Query for 2 capture results with a non-200 status code:

from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 2:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering1#main.py

  • Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200", "!mimetype:text/html", "digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.digest, snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 10:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering2#main.py

Collapsing

A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that is duplicate and are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.

To use collapsing, add one or more field or field:N to 'collapses=[]' where the field is one of (urlkey, timestamp, original, mimetype, statuscode, digest, and length) and N is the first N characters of the field to test.

  • Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since the first 10 digits 2013022601 matches, the 2nd capture will be filtered out.
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp:10"])
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-first#main.py

  • Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["digest"])
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-second#main.py

  • Ex: Only show unique URLs in a prefix query (filtering out captures except for the first capture of a given URL). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["urlkey"], match_type="prefix")
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-last#main.py