Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CDX search based on newer timemap CDX API #8

Open
Mr0grog opened this issue Jan 5, 2018 · 12 comments · May be fixed by #103
Open

Implement CDX search based on newer timemap CDX API #8

Mr0grog opened this issue Jan 5, 2018 · 12 comments · May be fixed by #103
Assignees
Milestone

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jan 5, 2018

From a conversation on the Internet Archive’s Research Slack today:

kenji
Igor http://spacex.com/robots.txt has Disallow: /includes/ and http://web.archive.org/cdx/search still honors robots.txt exclusion (because it’s served by older wayback machine), while playback ignores robots.txt (served by new wayback machine).

http://web.archive.org/web/timemap/cdx?url=www.spacex.com&matchType=domain&gzip=false&filter=statuscode:200&to=20041229235959 will give you more results, including those under /include/ path. /web/timemap/cdx is served by new wayback.

I’m sorry for the confusing, inconsistent results - we’re trying to migrate all services to new wayback

oh btw, a tip: to=2004 will be interpreted as 20041231235959 (if you’re not excluding day 30 and 31 on purpose 😄) (edited)

Igor
kenji Thank you!

mr0grog
Oh, I did not know about /web/timemap/cdx as opposed to just /cdx/search/cdx. Should I be using the former instead of the latter?

kenji
/web/timemap/cdx is better functionality-wise, but it’s slower than /cdx/search. So I’d suggest /cdx/search as long as it works ok for your purpose.

mr0grog
ah, ok
Will need to consider which is the right path. Is there anything that documents the functional differences? e.g. the robots.txt issue would be a hard one to discover

Do you have a rough sense of how much slower /web/timemap/cdx is?

kenji
I don’t have good benchmark result (it’s nice to have), but I find /web/timemap/cdx 10-20% slower for matchType=exact query. matchType=domain can be much slower.

We need to look into whether we should switch to /web/timemap/cdx.

@Mr0grog Mr0grog self-assigned this Jan 5, 2018
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 5, 2018

Other notes I have discovered in edgi-govdata-archiving/web-monitoring-processing#174: this new API doesn’t support resumeKey; you have to use page and pageSize for iterating through results (which is not as straightforward as you might think).

@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 11, 2019

Update: since the above conversation happened, Wayback folks have started gently pushing us to more actively use the newer services, like timemap CDX and SPN2. So I think the answer to this issue is probably “yes we should” now.

@Mr0grog Mr0grog changed the title Investigate whether we should be using IA timemap CDX API Implement CDX search based on newer timemap CDX API Mar 11, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 11, 2019

Since this is still beta-ish, we should probably implement this alongside the old /cdx/search API.

@stale stale bot closed this as completed Nov 14, 2019
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Nov 15, 2019
@Mr0grog Mr0grog reopened this Nov 15, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 15, 2019

I’ve been holding off on this since @danielballan is in the middle of splitting off this code into https://github.com/edgi-govdata-archiving/wayback. It should be done, but in that new repo whenever it’s ready.

@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Nov 28, 2019
@danielballan
Copy link
Contributor

Note to selves: once this is closed, it might be kind to state in the release notes how to migrate wayback v0.1 code to whatever API we settle on for timemap, if doing so is not too much trouble.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

FWIW, I think the API (from a user of this package’s perspective) would be the same. The Timemap CDX API (which, to be clear, is not the timemap API, which is a whole other thing!):

  • Returns data in the same format as the CDX API, but has some extra fields on the end that aren’t generally useful unless you have access to internal archive.org services (supposedly these will be removed from the public API at some point).

  • Does paging differently, but we don’t expose access to the paging in our Python API anyway, so this should mostly be an implementation detail that is largely invisible to a user. (In the current CDX API, you can paginate via resumeKey or via actual page size & number, but the latter will not give you recent data. In the new Timemap CDX API, there is no resumeKey and you must use page size & number, but it should include up-to-date data.)

@danielballan
Copy link
Contributor

Ah, I was conflating the Timemap CDX API with the timemap API. I have half-absorbed the fact that they are different things, but I got confused here. Which one did wayback v0.1 implement?

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

Wayback v0.1 implemented the Timemap API (not Timemap CDX, which isn’t really it’s name, but it doesn’t have one, and ¯\_(ツ)_/¯).

If helpful (since Wayback APIs are a half-documented, scattered situation):

The CDX API, which lets you search through a CDX-based index (and returns a subset of fields from each matching CDX record), is at http://web.archive.org/cdx/search/cdx

The “Timemap CDX” API is the same thing, but uses different code and (I think?) a separate CDX index, is at http://web.archive.org/web/timemap/cdx

(I call it “Timemap CDX” because of the URL. I have also heard “new CDX,” “beta CDX,” “CDX v2,” etc.)

The Timemap API is part of the Memento protocol (guide, RFC, Wayback-specific “docs”) which is a semi-standard agreed to by lots of archives. It doesn’t allow searching (it just lists mementos for a given URL), and lists results in HTTP Link header format at http://web.archive.org/web/timemap/link/<url>, e.g. http://web.archive.org/web/timemap/link/https://www.epa.gov/

(There is supposed to be an official JSON format, but I don’t know how to get it from Wayback. http://web.archive.org/web/timemap/json/<url> returns timemap data in CDX-json format, which is 🤷‍♀)

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

I kind of feel like Timemap may be redundant when you have CDX available (since you can always search CDX for an “exact” [really SURT, not exact] URL match). But it’s possible timemap may be more optimized.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

Also, best documentation link I know of is here: https://archive.readme.io/docs

It’s mostly links to other docs, but at least it gets most of all the APIs listed. (Not how much it’s kept up-to-date, though. 🙁)

@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 26, 2022

Some updates here from recent conversations:

  • The old CDX search (/cdx/search/cdx) has some real funky issues around limit and showResumeKey that were major drivers for this new CDX search (/web/timemap/cdx). (See Add a default limit to WaybackClient.search() #65)
  • The new search supports limit, but not showResumeKey, and doesn’t do weird stuff with limit.
  • The new search only paginates with page + pageSize (which are still about blocks; size is not referring to a number of results), and is reliable, and includes all the indexes (so it’s up-to-date).
  • BUT if you use a non-exact search (i.e. matchType=prefix|host|domain or you use an * in the URL), it does not include the index for recent SavePageNow captures. It takes roughly 3 days for things in that index to make it into other indexes that do support those queries. So there are still caveats here, but they are simpler to explain and are actually pretty predictable (the out-of-date issue is only a few days, not a few months).
  • archive.org is doing a slow transition to the new search, using it for some things under the hood to test it out.
  • Eventually (no concrete timeline yet) the old search will be replaced with the new one.
  • The new search includes extra fields (length, offset, WARC filename) that they expect to remove when replacing the old search, so we should not expect them to always be present.

So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):

  1. search_v1() uses /cdx/search/cdx and paginates via showResumeKey (i.e. what is currently called search()).
  2. search_v2() uses /web/timemap/cdx and paginates via page + pageSize (i.e. the new search).
  3. search() just forwards to one of those implementations.

I’m also thinking we might want to rename search*() methods to listMementos() or listCaptures() or something, since the Internet Archive has an actual free text search of wayback now (e.g. https://web.archive.org/web/*/environment which is powered by https://web.archive.org/__wb/search/anchor?q=<text>, but also some endpoints at https://be-api.us.archive.org/ia-pub-fts-api, /services/search/v1/scrape, and /advancedsearch.php, all of which I don’t know enough about the differences or pros/cons for).

That renaming might be out of scope here, though.

Mr0grog added a commit that referenced this issue Nov 2, 2022
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested.

This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported.

Fixes #8.
@Mr0grog Mr0grog linked a pull request Nov 2, 2022 that will close this issue
5 tasks
@Mr0grog Mr0grog added this to the v0.4.x milestone Nov 10, 2022
Mr0grog added a commit that referenced this issue Dec 20, 2022
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested.

This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported.

Fixes #8.
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 13, 2023

Circling back on the naming issue here, my current feeling is that the name should involve timemap rather than v2. The two have existed alongside each other for a long time now, and it’s no longer clear exactly what the migration or succession path is supposed to be (at one point I was told that the old CDX search at /cdx/search/cdx would call into the new implementation at /web/timemap/cdx under the hood, but trying the two confirms that they hit different backend servers and behave differently, and it’s been several years).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Web Monitoring
  
Icebox
Development

Successfully merging a pull request may close this issue.

2 participants