Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making cachalot caches faster by partitioning data to avoid invalidation #164

Open
Andrew-Chen-Wang opened this issue Sep 24, 2020 · 12 comments

Comments

@Andrew-Chen-Wang
Copy link
Collaborator

Description

Besides the new PR which allows for a context manager to disable cachalot, there needs to be a better way to avoid constant cache invalidation.

The following proposal is based on django-cache-machine's per object caching. Although we're caching every query, we can include some extra information inside that cache key. Inspired by some work with Simple JWT and PRNGs, I believe what we can do is the following:

Each cache key is a three part key. Let's say there's a modification to a signal row, we don't want to invalidate all caches related to that table. Instead, we can use cache key wildcards, e.g. cache.delete("beginning:TABLE CACHE KEY:CACHALOT ORIGINAL GENERATED CACHE KEY"), to find any data that pertains to the primary key of the table.

In other words, all objects' IDs need to be fed in to some kind of hashing algorithm that outputs all their IDs for the first part. Then, using the table cache key generator, we can use that for the second part (i.e. in the format of ASD:ASD:ASD). This way, every time a certain primary key is modified (or multiple) using our algorithm, we can filter the cache by using that aforementioned format.

Rationale

Django cachalot has this single issue: it caches everything. All queries are cached and once a modification is made on a table, all caches linked to that table are invalidated. Not good, especially with new cache misses happening that could eventually crash the server with heavy latency.

@abodacs
Copy link

abodacs commented Feb 12, 2021

@Andrew-Chen-Wang
Is that something like per host caching?
something like
# get_request() returns current request saved to threadlocal by some middleware cachalot_prefix = lambda _: get_request().get_host()

@Andrew-Chen-Wang
Copy link
Collaborator Author

No. Currently cachalot seems to only be useful if your model, just the model, does a single type of select query. I imagine the reason is that the original author's intention is for you to constantly do Model.objects.all() or use the same filter constantly. Check out monkey_patch.py and see which cache key keeps changing.

What I'm going to do is allow for caching of ALL queries rather than just one for a single model. It'll be an experimental feature I suppose. I'll post my draft proposal at some point next week; just been busy lately.

@jcass77
Copy link

jcass77 commented Feb 14, 2021

I like the idea of making cache invalidation more conservative, though it might be tricky to get right.

Cachalot's simplicity might also be its greatest strength: I've tried several of the other caching apps for Django and keep coming back to cachalot as the only 'install it and forget about it' option that delivers reliable cache hits.

The others all seem to make compromises that can produce surprising results, or require the user to do fairly hands-on active cache management themselves anyway. Some examples from other projects' docs might help to illustrate the challenges with implementing finer grained granularity:

Calls to QuerySet.count() can be cached, but they cannot be reliably invalidated...
...does not invalidate queries when a new object is created...
Nothing will be cached if the QuerySet is not iterated through completely...
At this time, caching has not been implemented for QuerySet.values or QuerySet.values_list...
Normally qs.update(...) doesn't emit any events and thus doesn't trigger invalidation....
Conditions other than __exact, __in and __isnull=True don't make invalidation more granular...
Conditions on TextFields, FileFields and BinaryFields don't make it either...
Update of "selected_related" object does not invalidate cache for queryset...
Mass updates don't trigger invalidation by default...
Sliced queries are invalidated as non-sliced ones...
Doesn't work with .raw() and other sql queries...
Conditions on subqueries don't affect invalidation...
Doesn't work right with multi-table inheritance...

I'm not saying that cachalot does not suffer from some of the above issues as well (though I am not aware of any), but it does seem to avoid large categories of invalidation-related issues by solely focussing on the table-level.

So I guess what I am saying is that there are already various options out there that try to do cache invalidation on a finer level of granularity with some degree of success. But in my experience cachalot is the only one that produces query results that are not stale, and based on the data as it actually exists in the database at any given point in time.

@jcass77
Copy link

jcass77 commented Feb 14, 2021

Everything mentioned above is probably already captured more succinctly in the feature comparison table included in django-cachalot's project documentation.

@Andrew-Chen-Wang, do you think it would be possible to improve cache invalidation while still maintaining all of the features / benefits as outlined in that table?

@Andrew-Chen-Wang
Copy link
Collaborator Author

Andrew-Chen-Wang commented Feb 14, 2021 via email

@Andrew-Chen-Wang

This comment has been minimized.

@Andrew-Chen-Wang
Copy link
Collaborator Author

This is my current draft proposal (note that it was before I found out that every new query on a table replaces another query's cache). Once I finish writing it, it'll make a little more sense, and I'll hide this comment:


Cachalot should specifically find which values are grabbed per query. Let's say we have a table like this:

from django.db import models

class Book(models.Model):
    id = models.BigAutoField(primary_key=True)
    title = models.CharField(max_length=255)
    content = models.TextField(max_length=1000000)

Follow the following insertions and SELECTs.

book1 = Book.models.create(title="hello1", content="content")
book2 = Book.models.create(title="hello2", content="content")
# Make a SELECT and force Django to evaluate it.
query1 = list(Book.objects.values_list("id", "title").filter(title__startswith="hello"))

The query is cached. It's cached by finding the actual SQL query of the SELECT AND with the table and db alias used. But beforehand, both are individually hashed using sha1 to become a cache key as seen here:

def get_query_cache_key(compiler):
"""
Generates a cache key from a SQLCompiler.
This cache key is specific to the SQL query and its context
(which database is used). The same query in the same context
(= the same database) must generate the same cache key.
:arg compiler: A SQLCompiler that will generate the SQL query
:type compiler: django.db.models.sql.compiler.SQLCompiler
:return: A cache key
:rtype: int
"""
sql, params = compiler.as_sql()
check_parameter_types(params)
cache_key = '%s:%s:%s' % (compiler.using, sql,
[str(p) for p in params])
return sha1(cache_key.encode('utf-8')).hexdigest()
def get_table_cache_key(db_alias, table):
"""
Generates a cache key from a SQL table.
:arg db_alias: Alias of the used database
:type db_alias: str or unicode
:arg table: Name of the SQL table
:type table: str or unicode
:return: A cache key
:rtype: int
"""
cache_key = '%s:%s' % (db_alias, table)
return sha1(cache_key.encode('utf-8')).hexdigest()


Example 1: specifying which columns to grab

So here's what I'm proposing. Let's say we update a specific column:

# Grabs only id and title with specific title parameter:
query1 = list(Book.objects.values_list("id", "title").filter(title__startswith="hello"))
# Grabs all columns
query2 = list(Book.objects.filter(title__startswith="hello"))

# Update attribute "content"
book1.content = "New Content 1"
book1.save(update_fields=["content"])
"""
You must specify update_fields. Otherwise, cachalot
(in this proposal) will invalidate all queries. In general,
it's also good practice since the UPDATE query uses
the update_fields list to write which fields should be updated.
If none are specified, all fields are updated.
"""

# Let's re-evaluate the two queries:
cached = list(Book.objects.values_list("id", "title").filter(title__startswith="hello"))
not_cached = list(Book.objects.filter(title__startswith="hello"))

What happened? The first query still has a cache, but the second query does not. What happened? The cache invalidation method is dependent on the columns that were grabbed. The first query only grabbed the id and title attributes, but because only the attribute content changed, the cache was never invalidated. On the other hand, the second query's cache was invalidated because the second query grabs all columns, including the content attribute which was changed.

Example 2: specifying which columns to filter on

The filter method aka WHERE is also important. Both queries from above were dependent on the title attribute. If we update a single row based on the title attribute, then both of the queries would be invalidated. If we instead had a query that filtered on content__startswith instead of title__startswith, then this query would not be invalidated.

Other worries

  • Subqueries
  • Annotations
  • etc. involving more complex queries basically

Method of Invalidation

We'll need to change up how the cache keys are generated. Currently, every time a table is updated, regardless of the number of records, only one record, or just a specific column is being updated, then all queries related to a table is invalidated. What I'm currently thinking (I've mostly forgotten about this; sorry, I underestimated college workload with a startup in tow too):

  • Using sqlparse, find all columns that were used in a WHERE clause or in some filtration method (i.e. SELECT table.id FROM app_book). This is a little difficult when it comes to subqueries.
  • Like before, we still need to hash the query so we know which cache key to specifically grab.
  • We'll add a new prefix: the columns that are involved in the operation. Heads up, this is where I get stuck. But anyways:

This way, when cachalot tries to grab the cache for a specific query, it first checks for the specific table (and which database alias was used), then it ALSO checks for which columns are involved. If the content column was changed, even JUST for a single record, then all queries that have that column would be invalidated.

Recall, cachalot is not a per object caching system; there may be a better method, for example, knowing if a SELECT statement only grabbed a single object, but that's is overcomplicating this and design-wise very difficult to determine in changing database schemas.

But why did I say "I'm stuck?" Let's recall the original purpose: individual column attribution. The current format could look something like: a6df5as7d8fde72:id,column,column2:83cf743b02.

Unlike table names, which is a single "thing," the columns specified could be in any order (of course we can sort them alphabetically but besides that), changing based on schema redesigns or SELECT ordering. Hashing isn't really an option:

If we specify columns id,title,content and in a different query we specify id,title, the two hashes are different. Additional worries include subqueries (we could do "table.id" like Django, but sometimes Django outputs "U.id" for getting foreign keys and their attributes).


References

This is where we cache all queries:

return _get_result_or_execute_query(
and this is where the caching actually happens:
def _get_result_or_execute_query(execute_query_func, cache,

@Andrew-Chen-Wang
Copy link
Collaborator Author

@jcass77 Sorry about the long proposal. You don't need to read it; just something to get out there for now since I'm busy with school. This proposal (and the other solution I mentioned) will still be in the code base, but that huge proposal I just wrote is mainly to fix this: "Useful for > 50 modifications per minute: X" in that table.

Cachalot does per-table caching, but each table query is completely different and there's only one cache per table.

UNLESS I'm reading the code wrong. I didn't have time to just test it out unfortunately, so I'm looking to do that next week and deduce if it's true there's only a single cache for one table.

@jcass77
Copy link

jcass77 commented Feb 15, 2021

Cachalot does per-table caching, but each table query is completely different and there's only one cache per table.

UNLESS I'm reading the code wrong. I didn't have time to just test it out unfortunately, so I'm looking to do that next week and deduce if it's true there's only a single cache for one table.

I wrote a quick test and it seems like django-cachalot maintains a cache for multiple queries on the same table, as expected:

def test_cache_maintained_for_multiple_queries_on_same_table(self):
    # Create cached entry for first query
    with self.assertNumQueries(1):
        Test.objects.get(name='test1')
    with self.assertNumQueries(0):
        Test.objects.get(name='test1')

    # Create cached entry for second query
    with self.assertNumQueries(1):
        Test.objects.get(name='test2')
    with self.assertNumQueries(0):
        Test.objects.get(name='test2')

    # Verify both queries are still cached
    with self.assertNumQueries(0):
        Test.objects.get(name='test1')
    with self.assertNumQueries(0):
        Test.objects.get(name='test2')

Not sure if that covers the usecase that you are referring to, but seems fine.

@Andrew-Chen-Wang
Copy link
Collaborator Author

Andrew-Chen-Wang commented Feb 15, 2021

as expected

Yup, as expected :) Sorry was just not reading the code right.

Edit: ugh my mistake. I didn't see this (my brain was not functioning correctly last night, sorry):

to_be_set[cache_key] = (now, result)

Not sure if that covers the usecase that you are referring to

The per query caching is correct, but this issue was more looking into the values selected and filtered per query. Instead of invalidating an entire table, only invalidate caches that contain certain columns. So if column/attribute Title was updated for a single object or multiple, then invalidate any cache that filters or selects attribute title. If a query didn't, then don't invalidate that cache.

In other words, more refined invalidation of caching.

@abodacs
Copy link

abodacs commented Apr 6, 2021

@Andrew-Chen-Wang

Could you point me where code I need to patch to achieve "adding a prefix to cache keys"?
https://github.com/Suor/django-cacheops#sharing-redis-instance

The main goal to add a prefix by hostname Thanks

@Andrew-Chen-Wang
Copy link
Collaborator Author

@abodacs ref #176 this is not the issue you were looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants