Extract performance is extremely slow on megawarcs #9

gwern · 2015-11-29T20:07:03Z

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.

I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)

This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd using the CDX metadata on index/length of the specific warcs I needed).

The text was updated successfully, but these errors were encountered:

chfoo · 2015-12-02T05:14:55Z

Yeah, the thing is ridiculously slow. You won't expect ungzip performance because it's not just ungzip. When it's extracting, it has to parse each WARC record because it's human-readable and string processing in Python is not that great.

I've been meaning to write a faster application implementation but I haven't gotten the incentive to do so. 😞

I am open to any suggestions.

chfoo added the help wanted label Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract performance is extremely slow on megawarcs #9

Extract performance is extremely slow on megawarcs #9

gwern commented Nov 29, 2015

chfoo commented Dec 2, 2015

Extract performance is extremely slow on megawarcs #9

Extract performance is extremely slow on megawarcs #9

Comments

gwern commented Nov 29, 2015

chfoo commented Dec 2, 2015