[Feature]: Deduplicate crawls

### Description

Repeated crawling of certain sites can often yield duplicate data, which adds significantly to the storage needed to store web archives. Deduplication reduces storage by not storing duplicate content.

### Requirements

See subtasks

### Context

Related:
- https://github.com/webrecorder/browsertrix-crawler/issues/884