-
Notifications
You must be signed in to change notification settings - Fork 297
Description
When crawling using Heritrix, if both sendIfModifiedSince
and writeRevisitForNotModified
are set to true
(although the latter has been deprecated, presumably equivalent to always being true
), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.bl.uk/
WARC-Date: 2014-11-24T08:13:54Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.194.151.38
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Truncated: length
WARC-Etag: "4078134-aed6-6117a140"
WARC-Record-ID: <urn:uuid:d41c9044-fad4-402a-bdc8-ff6c63d0f419>
Content-Length: 0
Here the WARC-Payload-Digest
has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.
The WARC spec. does say that:
For records using this profile, the payload is defined as the original payload content from which a 'LastModified' and/or 'ETag' value was taken.
Whether this means that the WARC-Payload-Digest
should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.