You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently processing a large-ish (on the order of 600GB) batch of WARC files containing a number of dumped homepages.
I am sifting through all of these files for image content which I then extract and do some further processing of. Once in a while, I come across records in the WARC files that cause PIL to emit warnings, all similar to this:
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 14392491347
Remainder: b'l>\r\n'
Looking at the remainder above, it seems to me that PIL is expecting the content to end a little bit too early since it does in fact seem to end with a newline a few characters further ahead.
Now, I cannot quite guess where this problem comes from. I guess it could be an error in the software that originally encoded the image. It could be an error in the software that originally dumped the web page into the WARC file, or perhaps it could be an error in warcio determining the size of the payload in the WARC record?
I would both appreciate any help in determining the cause of this problem, which I can help debug, and I am also going to refer to this issue in another issue I will post shortly which I suspect may be related.
The text was updated successfully, but these errors were encountered:
Hm, perhaps the error message can be improved. I think this usually is a sign that the Content-Length is too short, eg. in the above case, if it was +2, it would parse correctly.
Do you have an example that you could share? It would be good to confirm that this is the case, and not warcio messing up on the parsing..
Probably to make this easier to it should print the offset of the valid record to make it easier to extract the record for testing..
I am currently processing a large-ish (on the order of 600GB) batch of WARC files containing a number of dumped homepages.
I am sifting through all of these files for image content which I then extract and do some further processing of. Once in a while, I come across records in the WARC files that cause PIL to emit warnings, all similar to this:
Looking at the remainder above, it seems to me that PIL is expecting the content to end a little bit too early since it does in fact seem to end with a newline a few characters further ahead.
Now, I cannot quite guess where this problem comes from. I guess it could be an error in the software that originally encoded the image. It could be an error in the software that originally dumped the web page into the WARC file, or perhaps it could be an error in warcio determining the size of the payload in the WARC record?
I would both appreciate any help in determining the cause of this problem, which I can help debug, and I am also going to refer to this issue in another issue I will post shortly which I suspect may be related.
The text was updated successfully, but these errors were encountered: