Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: extract WARCs specified with index/length #7

Open
gwern opened this issue Nov 29, 2015 · 1 comment
Open

Feature: extract WARCs specified with index/length #7

gwern opened this issue Nov 29, 2015 · 1 comment

Comments

@gwern
Copy link

gwern commented Nov 29, 2015

In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using warcat.

One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like

[...] unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

I can handwrite the extraction using dd:

$ dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.warc.gz bs=1 && gunzip 1.warc.gz
1326824+0 records in
1326824+0 records out
1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s

Which is >11,200x faster than extracting everything in warcat and looking for the file I need.

The downside is needing to mess with dd, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.

It'd be great if warcat could include some additional arguments to the extract functionality like a pair of --length=n and --index=i flags to provide a nicer interface to pulling out a few warcs.

This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)

@chfoo
Copy link
Owner

chfoo commented Dec 2, 2015

Also to add that currently Warcat uses Python's built in HTTP library which does not handle edge cases that web browsers do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants