Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search_v2() method #103

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Nov 2, 2022

🚧 Work in Progress! 🚧

This adds support for the Internet Archive's new, beta CDX search endpoint at /web/timemap/cdx. It deals with pagination much better and is eventually slated to replace the search currently at /cdx/search/cdx, but is a little slower and still being tested. Fixes #8.

There are still a bunch of things to be done before merging:

  • Get clarity from Internet Archive folks on fields, order (see if they get output=json working), etc.
  • Determine whether all the other parameters still work and work the same (filter, collapse, resolveRevisits, etc) and whether there are new ones we can/should use. (Update: resolveRevisits is badly broken, but the rest are the same as original search and work fine. Checking w/ Wayback folks for more detail.)
  • Implement rate limiting
  • Determine if search_v2 is the right name, or if it should be something else (search_beta()? search_next()?)
  • Add tests

Comment on lines 752 to 790
# Since pages are a number of *blocks searched* and not results, a page
# in the middle of the result set may have nothing in it. The only way
# to know when to stop iterating is to check how many pages there are.
page_count = int(self.session.request('GET', CDX_SEARCH_2_URL, params={
**query,
'showNumPages': 'true'
}).text)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be necessary! I just discovered that sending too high a page value gets a 400 error with the header x-archive-wayback-runtime-error: page must be smaller than numpages, so we can in theory check for that and stop.

That said, that’s a very human-readable message and feels unstable. We should check with folks at the Internet Archive about what approach they’d prefer people use.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 2, 2022

Update: the way you control output format in the new search is not with ?output=<format> but instead with the path: /web/timemap/<format>. So the right way to get JSON is:

GET /web/timemap/json?url=<url>&other=params&go=here

This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested.

This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported.

Fixes #8.
@Mr0grog Mr0grog force-pushed the 8-the-new-search-is-a-better-search branch from ada8423 to 5093982 Compare December 20, 2022 20:07
@Mr0grog Mr0grog force-pushed the 8-the-new-search-is-a-better-search branch from 5c41ba6 to 42d5f7d Compare December 20, 2022 21:04
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 21, 2022

Some updates:

  • Most of the other parameters continue to work just as they did. But resolveRevisits seems to break searches. I’ve ignored it in the code for now and asked Wayback folks about whether it planned or no longer intended to work, and will figure out the right way to handle it after I hear back.

  • I added copies of most of the search() tests. I think I should probably re-organize these so they are interleaved with the original search tests (so it’s easier to make sure a new test covers both) or even parametrize some of these tests so one test covers both implementations. (Not feasible in cases where we are exercising behavior or arguments that differ between the two.)

  • There is a weird issue where the search sometimes returns some very recent results (like just the last week-ish?) even if they are after the to timestamp. Our tests currently fail because of this. I’ve asked Wayback folks about it (seems like a bug), and am debating whether the best behavior here is to skip or implement the test differently (and expose users to the bug), or to strip the bad results client-side.

At this point, there’s a little more I can do (rate limiting, cleanup), but we the main blocker by lack of clarity on bugs/intended behavior from Wayback, which we’ll have to wait to hear back on.

@Mr0grog Mr0grog added this to the v0.5.x milestone Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Implement CDX search based on newer timemap CDX API
1 participant