Skip to content

Deduplication & "Not Modified" WARC Records #224

@PsypherPunk

Description

@PsypherPunk

When crawling using Heritrix, if both sendIfModifiedSince and writeRevisitForNotModified are set to true (although the latter has been deprecated, presumably equivalent to always being true), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.bl.uk/
WARC-Date: 2014-11-24T08:13:54Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.194.151.38
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Truncated: length
WARC-Etag: "4078134-aed6-6117a140"
WARC-Record-ID: <urn:uuid:d41c9044-fad4-402a-bdc8-ff6c63d0f419>
Content-Length: 0

Here the WARC-Payload-Digest has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.

The WARC spec. does say that:

For records using this profile, the payload is defined as the original payload content from which a 'LastModified' and/or 'ETag' value was taken.

Whether this means that the WARC-Payload-Digest should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions