Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract performance is extremely slow on megawarcs #9

Open
gwern opened this issue Nov 29, 2015 · 1 comment
Open

Extract performance is extremely slow on megawarcs #9

gwern opened this issue Nov 29, 2015 · 1 comment

Comments

@gwern
Copy link

gwern commented Nov 29, 2015

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.

I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)

This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd using the CDX metadata on index/length of the specific warcs I needed).

@chfoo
Copy link
Owner

chfoo commented Dec 2, 2015

Yeah, the thing is ridiculously slow. You won't expect ungzip performance because it's not just ungzip. When it's extracting, it has to parse each WARC record because it's human-readable and string processing in Python is not that great.

I've been meaning to write a faster application implementation but I haven't gotten the incentive to do so. 😞

I am open to any suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants