Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CachedViews #146

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Add CachedViews #146

wants to merge 14 commits into from

Conversation

iamkroot
Copy link

@iamkroot iamkroot commented May 11, 2022

Motivation

  • Over at trakt-scrobbler, some of my confuse templates are getting more and more computationally expensive. Example: I have a template that needs to parse a list of strings as Regex patterns.
  • Repeatedly computing the view value according to the template is not feasible in the hot path, so I have to settle with doing a view.get(template) in the global scope once when the module loads.
  • My app is meant to be used as a long running service. Whenever the user changes a config value externally (via a cli), they have to restart the service to have their change registered, because of the limitation above.
  • It would be good if the updated value could be reflected immediately from the next access of the view.

Insight

The main pattern to notice is that we are extremely likely to use the same template to parse any given view over time. Once we decide the specific template for a view, we want to keep calling get with the same template over and over again (in case the value has changed).

Method

This calls for the introduction of a caching layer that can store the values for expensive templates. In most cases, the config gets are much more frequent than sets, so it will benefit from such caching.

API

  • The PR adds a new subclass of ConfigView, called the CachedConfigView.
  • The main addition to it is the get_handle(self, template) method. Instead of computing the value of the view according to the given template eagerly, it returns a CachedConfigHandle object that can be used as a proxy for the value.
  • In the first call to get on a CachedConfigHandle, the template is applied to the view, and the value is stored internally. All subsequent gets can re-use this cached value.
  • When the view is updated from elsewhere, all handle objects for that view are automatically invalidated, and the next call to get on them will cause a re-computation of the value.
  • Note that we still retain the ability to call get directly on a view for cases when the caching doesn't make sense (cheap to compute values).

Example

To view this in action, see the ipc branch of my app.
You can also get an overview of the API from the unit tests.

TODO

This is just a draft PR, tailored to my specific use-case. I am sure I will have missed some edge cases.

  • The nomenclature of invalidate and unset is misleading, and does not fit the usual meanings - an invalidated cache is usually supposed to be fixable by performing a recomputation, but that is unset here. The invalidated here means the underlying view itself has been rendered invalid, probably because some of its parents' values changed without preserving the structure. Need to find better names for these. (Can't tackle both the hard things in Computer Science at once 😮‍💨)
  • Need to add more docs/comments
  • Add more test cases
  • Fix type hints; make them Py2 compatible maybe?
  • Do we really need a get_handle on the RootView?
  • Probably need to make it thread-safe

I will be happy to clean it up after an initial review from your end.

Related

This PR is kind of the other half of #130. Once this is done, I probably will need to solve the "persist changes to disk" problem too.

@iamkroot iamkroot marked this pull request as ready for review May 11, 2022 17:15
@sampsyo
Copy link
Member

sampsyo commented May 12, 2022

Awesome; this is a super cool idea! I’d love to incorporate this.

Here are some initial high-level impressions:

  • A very small nit: maybe this new stuff should live in a confuse.cache module instead of directly in confuse.core.
  • I see that calling get on an invalidated handle raises an error. It could be marginally more convenient to instead just automatically refresh the value in the handle when this happens. We still have the view and the template, so hopefully this is just a matter of calling view.get(template)?
  • This is not exactly an actionable note, but just to record some thoughts… your design got me thinking about whether it would be possible to eliminate the distinction between cached handles and views themselves. The special CachedViewMixin would take care of caching itself. That is, you could call view.cache_get(tmpl) or similar, and this would mean that future calls for the same view.cache_get(tmpl) would return a cached value. The cached source would then just need to invalidate all these views. However, the problem with this would be that the view would need to maintain a template-to-value mapping... which would mean views would need to be equatable, which seems tricky. So the handle “layer” is probably here to stay.

I don’t think we need to worry about py2 support; we should officially drop that if we haven’t already.

@iamkroot
Copy link
Author

iamkroot commented May 15, 2022

Thanks for giving this a look!

maybe this new stuff should live in a confuse.cache module

Will do.

automatically refresh the value in the handle when this happens

There is a slight confusion in the naming as I mentioned in the first TODO. The "invalid" state in the code today means that the underlying view itself has been invalidated - the key doesn't exist due to some structural modification. In this case, it is not possible to get any value from the current handle, no matter the template. We have to raise an exception in this case. See unit tests for some examples. Do note that this is will be a pretty uncommon occurance- the code is here for the sake of completeness.

For cases when the view does remain valid, we keep the value as undefined, and it does get refreshed on the next call to handle.get() by doing a view.get(template).

How about renaming the current undefined to invalid (to match general usage), and the current invalid to something like invalid_view or destroyed_view?

you could call view.cache_get(tmpl)

I did explore this API initially, but as you noticed, it will require keeping a mapping from templates to values inside a view. The major issue here would be hashing the template on every call to cache_get. Many templates wrap over dicts and lists internally, which would have to be frozen before being hashed. This computation would be undesired for the common case, where the template structure doesn't change at all. The handle abstraction is superior for this reason IMO.

I don’t think we need to worry about py2 support; we should officially drop that if we haven’t already.

I see that the Build CI has failed on the Py2.7 setup because of the new type annotation syntax. Could you please specify what versions of Python would you like to target going forward? (I'd prefer >=3.6 due to the type syntax changes)

@sampsyo
Copy link
Member

sampsyo commented May 15, 2022

Aha, thank you for explaining! I obviously read your original explanation of the "invalid" state too quickly. This makes sense! I suppose I like the idea of renaming the state. Maybe something like "missing" would be most evocative?

(I know this is a stretch, but as someone who thinks about cache coherence protocols in my day job, I tend to think of "invalid" as meaning merely "I don't have the data locally, but I know where to go get it if anybody needs it." Not that this is the normal usage in the real world, of course…)

And good point about the CI and Python 2. Yes, the new minimum should be 3.6 (matching beets). I'll work on updating the docs, metadata, and Actions config…

Also removes the ConfigHandleInvalidatedError since it is the same as
NotFoundError.
@iamkroot
Copy link
Author

iamkroot commented May 15, 2022

I realized that ConfigHandleInvalidatedError that I'd cooked up is pretty much the same as a NotFoundError that is raised when the value (really, the view) is missing. So now we raise this error in the same circumstances.

Do we want to support providing a default value for this case as an arg to get_handle or CachedHandle.get()?

@sampsyo
Copy link
Member

sampsyo commented May 16, 2022

That unification of exception types seems like a good idea to me!

If I understand correctly, I think it's probably not necessary to add a second "layer" of defaults… that is, hopefully, you can get all the default behavior you want by using appropriate templates and fallback sources. Maybe put differently, it seems like the "law" should be that config.get_handle(view).get() should always produce the same value (or raise the same exception) as config.get(view); as a consequence, you can set up default fallbacks in the same way for both. Does that philosophizing make any kind of sense, roughly?

This matches the behaviour of view.get(template)
@iamkroot
Copy link
Author

That does sound very clean. I have updated the code to call template.get_default_value() when we call handle.get() on a missing view. This will match the existing behaviour of view.get(template) pretty much perfectly (return default val if present in template, otherwise raise a NotFoundError).

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!! Here are a few comments/thoughts… one of which is somewhat philosophical; I apologize if that one went off the deep end.

confuse/cache.py Outdated
Comment on lines 11 to 12
_INVALID = object()
_MISSING = object()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some short comments on these to explain what they mean would be helpful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

confuse/cache.py Outdated Show resolved Hide resolved
# keep track of all the handles from this view
self.handles: List[CachedHandle] = []
# need to cache the subviews to be able to access their handles
self.subviews: Dict[str, CachedConfigView] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is a good idea, but just to try it out here: would it simplify things at all to keep the list of handles only on the RootView? That is, any time any data changes anywhere in the entire configuration, we'd invalidate everything. I know this is possibly a bit more inefficient, but it would centralize the tracking of handles and avoid the need for all views to keep track of their subviews for invalidation purposes.

The reason I'm slightly nervous (perhaps unfoundedly) about the latter thing is that we have never before required that subviews be unique… that is, doing config['foo'] twice in "normal" Confuse could give you back two different objects (equivalent objects, but different ones). These objects were short-lived and disposable; they generally did not stick around. Now we're saying that they must stick around, and in fact, that doing config['foo'] twice always gives you back exactly the same memoized view object. I'm just a tad worried that this means that, if there were some other way of constructing a view object onto the same "place" in the configuration hierarchy (such as via iteration?), it would not enjoy the same invalidation benefits. So this seems to place a few more restrictions on the way that views can be created, stored, and used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention that I could perhaps try prototyping this idea if it seems plausible to you!

confuse/cache.py Outdated Show resolved Hide resolved
@iamkroot
Copy link
Author

iamkroot commented May 18, 2022

Two things-

  1. I took another look at the docs relating to "missing" views (the View Theory section). It says-

    resolve a view at the same granularity you want config files to override each other

    In essence, we will return an older value for a sub-view even when the parent view has been updated to change its structure.

    It feels a bit counter-intuitive to me (why am I able to read an overridden value?), but since this is the expected behaviour, it seems like the "missing" case is not needed at all for our cache. This will simplify the invalidation logic considerably :) Will push an update soon.

  2. Regarding the creation of non-ephemeral subviews, I do agree that it deviates from the existing behaviour. But as far as I can tell, all code paths in which a Subview is created go through the __getitem__ method, which we have overriden (even iteration eventually does self[index]). So I don't think it will pose any danger to return the same subview each time.

    On the other hand, I do agree that this might trip someone up (especially if the user tries to create a CachedSubview manually). Keeping all the handles at the root does sound like a way around this, but I'm not too happy that a single set would invalidate everything. I think it should be possible to maintain the entire tree of handles with the same structure as the views, kinda like a "shadow config". (Though I'm not sure what happens in the "missing" case)

@iamkroot
Copy link
Author

@sampsyo Do we want to resume this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants