Skip to content

Commit

Permalink
Support multiple filters in CDX search (#127)
Browse files Browse the repository at this point in the history
The `filter_field` parameter for `WaybackClient.search()` can now be a list or tuple of strings, letting you add multiple filters. For example, to search for all captures at `nasa.gov` with a 404 status and “feature” somewhere in the URL:

```python
client.search('nasa.gov/',
              match_type='prefix',
              from_date=date(2022, 1, 1),
              to_date=date(2022, 2, 1),
              filter_field=['statuscode:404',
                            'urlkey:.*feature.*'])
```

Thanks to @BilibalaX for starting this in #120.

Fixes #119.
  • Loading branch information
Mr0grog committed Sep 25, 2023
1 parent 45ff79e commit f7aa5d8
Show file tree
Hide file tree
Showing 6 changed files with 366 additions and 8 deletions.
11 changes: 10 additions & 1 deletion docs/source/release-history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,16 @@ N/A
Features
^^^^^^^^

N/A
- You can now apply multiple filters to a search by using a list or tuple for the ``filter_field`` parameter of :meth:`wayback.WaybackClient.search`. (:issue:`119`)

For example, to search for all captures at ``nasa.gov`` with a 404 status and “feature” somewhere in the URL:

.. code-block:: python
client.search('nasa.gov/',
match_type='prefix',
filter_field=['statuscode:404',
'urlkey:.*feature.*'])
Fixes & Maintenance
Expand Down
17 changes: 10 additions & 7 deletions wayback/_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ class WaybackClient(_utils.DepthCountedContext):
Parameters
----------
session : :class:`WaybackSession`, optional
session : WaybackSession, optional
"""
def __init__(self, session=None):
self.session = session or WaybackSession()
Expand Down Expand Up @@ -570,9 +570,12 @@ def search(self, url, *, match_type=None, limit=1000, offset=None,
Only include captures before this date. Equivalent to the `to`
argument in the CDX API. If it does not have a time zone, it is
assumed to be in UTC.
filter_field : str, optional
A filter for any field in the results. Equivalent to the ``filter``
argument in the CDX API. (format: ``[!]field:regex``)
filter_field : str or list of str or tuple of str, optional
A filter or list of filters for any field in the results. Equivalent
to the ``filter`` argument in the CDX API. To apply multiple
filters, use a list of strings instead of a single string. Format:
``[!]field:regex``, e.g. ``'!statuscode:200'`` to select only
captures with a non-200 status code.
collapse : str, optional
Collapse consecutive results that match on a given field. (format:
`fieldname` or `fieldname:N` -- N is the number of chars to match.)
Expand Down Expand Up @@ -636,9 +639,7 @@ def search(self, url, *, match_type=None, limit=1000, offset=None,
stacklevel=2)
resolve_revisits = resolve_revisits or resolveRevisits

# TODO: support args that can be set multiple times: filter, collapse
# Should take input as a sequence and convert to repeat query args
# TODO: Check types
# TODO: Check types (requires major update)
query_args = {'url': url, 'matchType': match_type, 'limit': limit,
'offset': offset, 'from': from_date,
'to': to_date, 'filter': filter_field,
Expand All @@ -651,6 +652,8 @@ def search(self, url, *, match_type=None, limit=1000, offset=None,
if value is not None:
if isinstance(value, str):
query[key] = value
elif isinstance(value, (list, tuple)):
query[key] = value
elif isinstance(value, date):
query[key] = _utils.format_timestamp(value)
else:
Expand Down
111 changes: 111 additions & 0 deletions wayback/tests/cassettes/test_search_with_filter
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
interactions:
- request:
body: null
headers:
Accept-Encoding:
- gzip, deflate
User-Agent:
- wayback/0.4.3a1.post3.dev0+g738efb9 (+https://github.com/edgi-govdata-archiving/wayback)
method: GET
uri: https://web.archive.org/cdx/search/cdx?url=nasa.gov%2F&matchType=prefix&limit=10&from=20220101000000&to=20220201000000&showResumeKey=true&resolveRevisits=true
response:
body:
string: !!binary |
H4sIAAAAAAAAAK3TPU/DMBAG4L2/wksHBGrsO19M2KqENA11ndB8NVsHlA5AkRo1/HwSBiSkWEUi
8nq6R3fvuTld7t4P58ONw4ADcNE/jh5IdmzbjwfH6bpuMRQsmtPFYe3LZ+sc27fXvpqz5TKDx9yE
lUxNHCujSrUyAWD0rItIU8oQCGbNGCFBqmsEcsFqdx1LN67qXGX+xg/W2wyNLsKooFWd7HOmSFiF
+2/h/N8pJHFpMQjFRAYiHzUEl6gmM8bTEJz4Txq2JACkzrfVTqPeGUqKkjauLqX2gyBP0v1Tvya0
dRcwSdakLAIKnEpwrQJOlsL4NQn06OoUf/x2/aIGYw7+0GQO3hzC218Szr4AdYEU1/sDAAA=
headers:
Connection:
- keep-alive
Content-Type:
- text/plain;charset=UTF-8
Date:
- Mon, 25 Sep 2023 00:35:32 GMT
Permissions-Policy:
- interest-cohort=()
Referrer-Policy:
- no-referrer-when-downgrade
Server:
- nginx/1.25.1
Transfer-Encoding:
- chunked
X-NA:
- '0'
X-NID:
- '-'
X-Page-Cache:
- BYPASS
X-RL:
- '0'
X-location:
- cdx
content-encoding:
- gzip
x-app-server:
- wwwb-app15
x-tr:
- '253'
x-ts:
- '200'
status:
code: 200
message: OK
- request:
body: null
headers:
Accept-Encoding:
- gzip, deflate
User-Agent:
- wayback/0.4.3a1.post3.dev0+g738efb9 (+https://github.com/edgi-govdata-archiving/wayback)
method: GET
uri: https://web.archive.org/cdx/search/cdx?url=nasa.gov%2F&matchType=prefix&limit=10&from=20220101000000&to=20220201000000&filter=statuscode%3A404&showResumeKey=true&resolveRevisits=true
response:
body:
string: !!binary |
H4sIAAAAAAAAALWUX0/CMBTF3/kUvPTBKGt7266db4TIg8bE+A/DC2mgbIusXVh1+u3dIISpQx2J
e7g7SZv7O+e2W+xez6wu9AlGQLK0KFJnZ7mOTYEzvS6wNeVWZcYn2pog8dmqDwSAUBqRiFCAfuJ9
fo5xWZZB3SuI3Wundt68ebxRnPD+5fUEwvubOzZ5uBtyeTUUT49wPb29mk7E+Ene9EFR2oubxr3L
03mBtVk7q198rZdG+5d1xfXr1WxhsvfCp8vULD4HoJxyHm4CFC0JOvftnkS2JvFmnli3cvH7Hvic
+nli7MDltZFmDpA0pIKrQwfRsWnXEDISzRAGkCJIw84cEMIYhwNDvtjsHgLujpX0ByyjlFUv9Q9Y
/hWrkFru0hIixS5tK7XaPD6CqlSDOtjRaPVIxltpg2OyNa/jIF8sMQ0jAirTqZ0t9MrMrCv1c1Ct
NCxUirVb+LlD56+FNC9agOaANNnW/fgFi7hodROgEaAh2dYjhhPBIfr3uv/DKGAs/N3P93rM8ZFe
bRHBqIYgiBCMAwSisgSisvZHdfrJvOx9AOhJokklBgAA
headers:
Connection:
- keep-alive
Content-Type:
- text/plain;charset=UTF-8
Date:
- Mon, 25 Sep 2023 00:35:33 GMT
Permissions-Policy:
- interest-cohort=()
Referrer-Policy:
- no-referrer-when-downgrade
Server:
- nginx/1.25.1
Transfer-Encoding:
- chunked
X-NA:
- '0'
X-NID:
- '-'
X-Page-Cache:
- BYPASS
X-RL:
- '0'
X-location:
- cdx
content-encoding:
- gzip
x-app-server:
- wwwb-app52
x-tr:
- '371'
x-ts:
- '200'
status:
code: 200
message: OK
version: 1
111 changes: 111 additions & 0 deletions wayback/tests/cassettes/test_search_with_filter_list
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
interactions:
- request:
body: null
headers:
Accept-Encoding:
- gzip, deflate
User-Agent:
- wayback/0.4.3a1.post3.dev0+g738efb9 (+https://github.com/edgi-govdata-archiving/wayback)
method: GET
uri: https://web.archive.org/cdx/search/cdx?url=nasa.gov%2F&matchType=prefix&limit=10&from=20220101000000&to=20220201000000&showResumeKey=true&resolveRevisits=true
response:
body:
string: !!binary |
H4sIAAAAAAAAAK3TPU/DMBAG4L2/wksHBGrsO19M2KqENA11ndB8NVsHlA5AkRo1/HwSBiSkWEUi
8nq6R3fvuTld7t4P58ONw4ADcNE/jh5IdmzbjwfH6bpuMRQsmtPFYe3LZ+sc27fXvpqz5TKDx9yE
lUxNHCujSrUyAWD0rItIU8oQCGbNGCFBqmsEcsFqdx1LN67qXGX+xg/W2wyNLsKooFWd7HOmSFiF
+2/h/N8pJHFpMQjFRAYiHzUEl6gmM8bTEJz4Txq2JACkzrfVTqPeGUqKkjauLqX2gyBP0v1Tvya0
dRcwSdakLAIKnEpwrQJOlsL4NQn06OoUf/x2/aIGYw7+0GQO3hzC218Szr4AdYEU1/sDAAA=
headers:
Connection:
- keep-alive
Content-Type:
- text/plain;charset=UTF-8
Date:
- Mon, 25 Sep 2023 00:35:34 GMT
Permissions-Policy:
- interest-cohort=()
Referrer-Policy:
- no-referrer-when-downgrade
Server:
- nginx/1.25.1
Transfer-Encoding:
- chunked
X-NA:
- '0'
X-NID:
- '-'
X-Page-Cache:
- BYPASS
X-RL:
- '0'
X-location:
- cdx
content-encoding:
- gzip
x-app-server:
- wwwb-app15
x-tr:
- '132'
x-ts:
- '200'
status:
code: 200
message: OK
- request:
body: null
headers:
Accept-Encoding:
- gzip, deflate
User-Agent:
- wayback/0.4.3a1.post3.dev0+g738efb9 (+https://github.com/edgi-govdata-archiving/wayback)
method: GET
uri: https://web.archive.org/cdx/search/cdx?url=nasa.gov%2F&matchType=prefix&limit=10&from=20220101000000&to=20220201000000&filter=statuscode%3A404&filter=urlkey%3A.%2Afeature.%2A&showResumeKey=true&resolveRevisits=true
response:
body:
string: !!binary |
H4sIAAAAAAAAAK2TQWvjMBCF7/0VueSwUEWWLMf23sIuPbQUSrvblF6WqTWRRG0pyEq8+fcb24kT
SgJNWbDxG8yMvvfGVm59baGGb3TMo+CWpqgpoHcWVqHVC4Sw8ljT4Ms/EqtNHczCoJzoUJUjHnEe
MZYzwYSYjnQIy/o7pU3TTNqhE+XWX5ob8G+gnRKRGN3ez/n018NTPP/9NBPp3Sx5eeb3r493r/Pk
5iV9GPGMpVfqhJOAhbaudGpzOPDdhEKjJW7Zghz74CmbskRknY/zNj479FITaZ4cmYAKh5SocnvE
iGVJytPTUX/ooZcjZPl5hHHBxyCHp0e0vWyMVb2qjNJhs9POL/XhDQT0Bsq6L82uFYwvPCxCX0ms
jbKHryoSScRObuMj2g8+nv0cnh1aJzu0Tu3Qer1H66sBrSvNrnVA66oe7fJEMxGdT5S0lz9aLc9F
8qnV9p1fWfD0COcwT0rwkm4RctpoCEQ6rEnQSCpTvm9IA9sb2wSJXr29lUjASqLAADF2jdt/V21D
3C8uZnHMk/y0kf985sURxGnMr/4BbQemSPYEAAA=
headers:
Connection:
- keep-alive
Content-Type:
- text/plain;charset=UTF-8
Date:
- Mon, 25 Sep 2023 00:35:34 GMT
Permissions-Policy:
- interest-cohort=()
Referrer-Policy:
- no-referrer-when-downgrade
Server:
- nginx/1.25.1
Transfer-Encoding:
- chunked
X-NA:
- '0'
X-NID:
- '-'
X-Page-Cache:
- BYPASS
X-RL:
- '0'
X-location:
- cdx
content-encoding:
- gzip
x-app-server:
- wwwb-app53
x-tr:
- '172'
x-ts:
- '200'
status:
code: 200
message: OK
version: 1
58 changes: 58 additions & 0 deletions wayback/tests/cassettes/test_search_with_filter_tuple
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
interactions:
- request:
body: null
headers:
Accept-Encoding:
- gzip, deflate
User-Agent:
- wayback/0.4.3a1.post3.dev0+g738efb9 (+https://github.com/edgi-govdata-archiving/wayback)
method: GET
uri: https://web.archive.org/cdx/search/cdx?url=nasa.gov%2F&matchType=prefix&limit=10&from=20220101000000&to=20220201000000&filter=statuscode%3A404&filter=urlkey%3A.%2Afeature.%2A&showResumeKey=true&resolveRevisits=true
response:
body:
string: !!binary |
H4sIAAAAAAAAAK2TQWvjMBCF7/0VueSwUEWWLMf23sIuPbQUSrvblF6WqTWRRG0pyEq8+fcb24kT
SgJNWbDxG8yMvvfGVm59baGGb3TMo+CWpqgpoHcWVqHVC4Sw8ljT4Ms/EqtNHczCoJzoUJUjHnEe
MZYzwYSYjnQIy/o7pU3TTNqhE+XWX5ob8G+gnRKRGN3ez/n018NTPP/9NBPp3Sx5eeb3r493r/Pk
5iV9GPGMpVfqhJOAhbaudGpzOPDdhEKjJW7Zghz74CmbskRknY/zNj479FITaZ4cmYAKh5SocnvE
iGVJytPTUX/ooZcjZPl5hHHBxyCHp0e0vWyMVb2qjNJhs9POL/XhDQT0Bsq6L82uFYwvPCxCX0ms
jbKHryoSScRObuMj2g8+nv0cnh1aJzu0Tu3Qer1H66sBrSvNrnVA66oe7fJEMxGdT5S0lz9aLc9F
8qnV9p1fWfD0COcwT0rwkm4RctpoCEQ6rEnQSCpTvm9IA9sb2wSJXr29lUjASqLAADF2jdt/V21D
3C8uZnHMk/y0kf985sURxGnMr/4BbQemSPYEAAA=
headers:
Connection:
- keep-alive
Content-Type:
- text/plain;charset=UTF-8
Date:
- Mon, 25 Sep 2023 00:35:38 GMT
Permissions-Policy:
- interest-cohort=()
Referrer-Policy:
- no-referrer-when-downgrade
Server:
- nginx/1.25.1
Transfer-Encoding:
- chunked
X-NA:
- '0'
X-NID:
- '-'
X-Page-Cache:
- BYPASS
X-RL:
- '0'
X-location:
- cdx
content-encoding:
- gzip
x-app-server:
- wwwb-app53
x-tr:
- '137'
x-ts:
- '200'
status:
code: 200
message: OK
version: 1
Loading

0 comments on commit f7aa5d8

Please sign in to comment.