Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The asset caching mechanism should update assets based on HTTP headers #6

Open
mpiraux opened this issue Oct 4, 2018 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@mpiraux
Copy link
Collaborator

mpiraux commented Oct 4, 2018

Currently, the caching mechanism is not updating assets in cache at any time. It should use the information exposed in HTTP headers (Last-Modified, Cache-Control, etc) to update assets in cache.

We have to define a policy for this update mechanism to trigger. Sending an HTTP request every time the PluginManager sees a slide referencing an external asset seems a bit heavy, but it is the policy that ensures the most consistent cache.

@Drumor
Copy link
Contributor

Drumor commented Oct 18, 2018

Before we discussed about this case, I wrote this in the cache manager

def cache_file_at_url(self, url):
    """ Caches the remote file or retrieves its cached version if one exists. Returns None if an error occurs. """
    filename, extension = os.path.splitext(url)
    cache_asset_hash = hash(str(self.channel_id) + filename)
    asset = Asset.selectBy(plugin_channel=self.channel_id, filename=filename, is_cached=True).getOne(None)
    if asset is not None:
        # image comparaison on
        import urllib.request,web,math
        original = str(url)
        cache_file = ((str(web.ctx.homedomain)+ '/' + asset.path) if asset.path is not None else str(web.ctx.homedomain)+ '/cache/' + str(
            asset.id))
        from PIL import Image
        original_file = io.BytesIO(urllib.request.urlopen(original).read())
        cache_file_open = io.BytesIO(urllib.request.urlopen(cache_file).read())
        image1 = Image.open(original_file)
        image2 = Image.open(cache_file_open)
        h1 = image1.histogram()
        h2 = image2.histogram()
        rms = math.sqrt(sum(map(lambda a, b: (a - b) ** 2, h1, h2)) / len(h1))

        print(rms)
        if not rms == 0:
            asset = None

    if asset is None:
        if CacheManager._get_lock(cache_asset_hash, blocking=False):
            try:
                asset = Asset(plugin_channel=self.channel_id, filename=filename, user=None,
                              extension=extension if extension is not None and len(extension) > 0 else None,
                              is_cached=True)
                self.download_manager.enqueue_asset(asset)
            except (URLError, OSError):
                logger.warning('Exception encountered when attempting to cache file at url %s', url, exc_info=True)
                CacheManager._release_lock(cache_asset_hash)
                return None
            CacheManager._release_lock(cache_asset_hash)
        else:
            CacheManager._get_lock(cache_asset_hash, blocking=True)
            CacheManager._release_lock(cache_asset_hash)
            asset = Asset.selectBy(plugin_channel=self.channel_id, filename=filename, is_cached=True).getOne(None)

    return asset

_name_to_lock = {}
_master_lock = Lock()  # The lock of all locks

It's probably awful and works for image only. I'll change this code and work with your option!

@Drumor
Copy link
Contributor

Drumor commented Oct 18, 2018

For above example, it's not fully functional as it doesn't dereference/rereference the asset.

Drumor added a commit that referenced this issue Oct 18, 2018
@mpiraux
Copy link
Collaborator Author

mpiraux commented Oct 18, 2018

For above example, it's not fully functional as it doesn't dereference/rereference the asset.

It has several flaws actually. First, it assumes that assets are images, this is not true as it can be videos too.
Second, it downloads the whole file at the url and then performs a comparison. There is no gain in spending CPU time to compare the files if it has already been downloaded. The real cost we want to avoid is the download of the remote file. I would expect that using HTTP caching mechanism we can avoid downloading files entirely if not needed.

Regarding #12, this piece of code will be executed each time the PluginManager looks at a slide containing an URL. Opening the remote file and the local file (!) via HTTP(S) there will impact the slides generation latency quite heavily. I think the DownloadManager is more suited to handle the task. It could be extended to refresh an asset using the HTTP If-Modified-Since header.

I implemented this inside the client cache_daemon actually, have a look at these:
https://github.com/UCL-INGI/ICTV/blob/master/ictv/client/cache_daemon.py#L134
https://github.com/UCL-INGI/ICTV/blob/master/ictv/client/cache_daemon.py#L288

@Drumor
Copy link
Contributor

Drumor commented Oct 19, 2018

First part was a first inspiration and can be cancelled.
For #12 , i'll adapt it with your remarks.

@mpiraux mpiraux added the enhancement New feature or request label Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants