Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

Prevent collect.js from sending duplicate DOM content to collector #9

Open
groovecoder opened this issue May 21, 2015 · 1 comment
Open

Comments

@groovecoder
Copy link
Contributor

Every visitor to a page re-sends the full DOM content to the webalyzer backend.

Instead, collect.js should calculate a hash of the DOM content, and send a check request to the webalyzer backend (e.g., HEAD /collector/{hash}) to see if it already has the DOM content.

If the backend has the DOM content, it should increment a counter for that page and return a truthy value to collect.js.

If the backend does NOT have the DOM content, collect.js should send the DOM content along with the hash value for the backend to store.

@peterbe
Copy link
Contributor

peterbe commented May 22, 2015

Excellent idea! I can imagine a diff like this:

class Page(models.Model):
    domain = models.CharField(max_length=100, db_index=True)
    url = models.URLField()
    html = models.TextField()
    size = models.PositiveIntegerField(default=0)
    title = models.TextField(null=True)
    added = models.DateTimeField(auto_now_add=True)
    modified = models.DateTimeField(auto_now=True)
+   html_hash = models.CharField(max_length=32, db_index=True)
+   repeats = models.PositiveIntegerField(default=0)

+@receiver(models.signals.pre_save, sender=Page)
+def set_html_hash(sender, instance, **kwargs):
+    if not instance.html_hash:
+       instance.html_hash = hashlib.md5(instance.html).hexdigest()

@stephaniehobson stephaniehobson self-assigned this Jun 5, 2015
stephaniehobson added a commit that referenced this issue Jun 29, 2015
Issue #9 - Prevent duplicate HTML collection
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants