You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think this is medium term (dunno if we have the time/effort to spare right now), but thinking about all the analysis work @ericnost has been doing, I wonder if we can/should be making bulk and/or processed data available so that doing large scale analysis doesn’t require making big API queries and then making a boatload of requests for the response body of each version.
Could we host files that bundle up large sets of metadata and response content? (Not sure if the metadata would be best mixed in with the bodies or not. Maybe these are WARCs?)
Could we host files that that bundle up large sets of processed response bodies? (E.g. lists of all the links, readable text extracted from HTML and PDF [maybe separated into header/body/footer if we can, maybe plain text or maybe tokenized], etc.)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
I think this is medium term (dunno if we have the time/effort to spare right now), but thinking about all the analysis work @ericnost has been doing, I wonder if we can/should be making bulk and/or processed data available so that doing large scale analysis doesn’t require making big API queries and then making a boatload of requests for the response body of each version.
Could we host files that bundle up large sets of metadata and response content? (Not sure if the metadata would be best mixed in with the bodies or not. Maybe these are WARCs?)
Could we host files that that bundle up large sets of processed response bodies? (E.g. lists of all the links, readable text extracted from HTML and PDF [maybe separated into header/body/footer if we can, maybe plain text or maybe tokenized], etc.)
Somewhat related to edgi-govdata-archiving/web-monitoring-db#45.
/cc @ericnost @danielballan @jsnshrmn
The text was updated successfully, but these errors were encountered: