Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last N Value Cache #25091

Open
11 of 15 tasks
hiltontj opened this issue Jun 25, 2024 · 7 comments
Open
11 of 15 tasks

Last N Value Cache #25091

hiltontj opened this issue Jun 25, 2024 · 7 comments
Assignees

Comments

@hiltontj
Copy link
Contributor

hiltontj commented Jun 25, 2024

A Last N Value Cache will allow users to access the last value of many series (either by identifier or group) very quickly (<10ms).

Users should be able to specify for a given table and set of columns, the last N values they want to keep cached in RAM. This will be a feature available in both open source and Pro, but there will be limitations in the former.

For a given table, the user would specify the lookup key (i.e. columns to lookup by), the number of values to cache, and the columns (either by name or *) that they want in the cache. The time of the values will always be included.

Cache Creation

To create a cache, users specify:

  1. Name of the table and database
  2. A name for the cache (default to <table_name>_<key_columns>_last_cache)
  3. Key columns (default to series key, tag set sort order, or tags in lexicographic order)
  4. Number of values to store (default to 1, limit to 10)
  5. Columns to cache (default to all non-key columns)
  6. A Time to Live (TTL) which specifies how long values will live in the cache until they are evicted (default to 4 hours)
  7. (Pro only) How far back to query to load cache on boot-up

We would like the front-end for this to be available via a REST API.

The configuration of each cache will be stored in the catalog.

Populating the Cache

In open source, the cache should be populated as a write through while the server is running. In Pro, this will also be the case, but Pro will also have the ability to fill the cache from historical data on boot-up.

Cache Queries

Querying the cache will require a specialized query. The query syntax could look like so:

SELECT foo, time FROM last_cache('some_table');
SELECT foo, time FROM last_cache('some_table') WHERE cola in ['pepsi', 'coke'];
SELECT foo, time FROM last_cache('some_table') WHERE key_col = 'someval';

This is a use-case for DataFusion's User-Defined Table Functions (UDTF).

In some cases, query predicates may be handled directly by the cache's TableProvider/TableFunctionImpl, while more complicated predicates could just be passed back up to the query engine, but where we draw that line remains TBD.

Other Requirements

  • The key columns, if not specified on creation of the cache, will default to the series key (if present), the user defined sort order (if present), or the tags in lexicographical order.
  • Only string, int, or bool columns can be used as key columns.
  • Values columns in the cache, if not specified on creation, will default to all non-key columns in the table, and newly added fields should be used in the cache when added.
  • (Pro) The last N value cache populate on boot-up can be disabled on start-up of the server for recovery purposes.
@hiltontj hiltontj self-assigned this Jun 25, 2024
@alamb
Copy link
Contributor

alamb commented Jun 25, 2024

BTW another potential implementation would be to use an OptimizerRule to rewrite plans with relevant references to use a new table provider. Here is an example of how to do that: apache/datafusion#11087

@pauldix
Copy link
Member

pauldix commented Jun 25, 2024

I'd like it to be explicit to the user that they're requesting values from the cache. That way they know the semantics behind it (i.e. the cache only has data from when the server was running and accepting writes).

We could do the optimizer in addition to that, but ensuring the actual result is the same as a non-optimized result will be tricky as it's just a cache and not the raw underlying data.

@hiltontj
Copy link
Contributor Author

hiltontj commented Jul 2, 2024

Here is a good example of the last cache behaviour and how the key columns are used in the cache: #25109 (review)

@hiltontj
Copy link
Contributor Author

hiltontj commented Jul 2, 2024

Edit: added some additional points to the Other Requirements section in issue description detailing key and value column requirements:

  • The key columns, if not specified on creation of the cache, will default to the series key (if present), the user defined sort order (if present), or the tags in lexicographical order.
  • Only string, int, or bool columns can be used as key columns.
  • Values columns in the cache, if not specified on creation, will default to all non-key columns in the table, and newly added fields should be used in the cache when added.

@hiltontj
Copy link
Contributor Author

hiltontj commented Jul 2, 2024

Edit: added cache creation requirements to original issue description:

(2.) A name for the cache (default to <table_name>_<key_columns>_last_cache)
(6.) A Time to Live (TTL) which specifies how long values will live in the cache until they are evicted (default to 4 hours)

TTL was suggested here: #25109 (comment)

@hiltontj
Copy link
Contributor Author

hiltontj commented Jul 2, 2024

Keeping track of some questions I have here, I will update this as I think of more.

  1. Is it allowed for a user to specify an empty set of key columns, or must they always specify at least one column for the cache key?

  2. Say there are multiple caches on a given table (named cpu) and their names are cache1 and cache2, how does the following query behave?

    SELECT * FROM last_cache('cpu');

    Does it pull and merge records from all caches associated with the cpu table? or is an error returned stating that a cache name must be specified, e.g.,

    SELECT * FROM last_cache('cpu', 'cache1');

    Alternatively, do we base the cache selection on the predicate provided? So, if cache1 has key [region, host] while cache2 has key [id], given the following queries:

    SELECT * FROM last_cache('cpu') WHERE region = '_' AND host = '_'              -- use cache1
    SELECT * FROM last_cache('cpu') WHERE id = '_'                                 -- use cache2
    SELECT * FROM last_cache('cpu') WHERE region = '_' AND host = '_' AND id = '_' -- error ?
  3. Is using a predicate for a non-existent key column an error, or ignored?

    For example, if a cache on the cpu table has the key columns region and host, and the following query is made:

    SELECT * FROM last_cache('cpu') WHERE container_id = '12345'
                                          ^^^^^^^^^^^^
                                          There is no key named 'container_id'

@hiltontj
Copy link
Contributor Author

hiltontj commented Jul 2, 2024

Is it allowed for a user to specify an empty set of key columns, or must they always specify at least one column for the cache key?

I think this would be okay, it is just going to cache all columns (or whatever value columns they specify), but wont be optimized for any columns as predicates.

Does it pull and merge records from all caches associated with the cpu table? or is an error returned stating that a cache name must be specified

My thinking is that it can pull records from all associated caches when a name is not specified - all the caches associated with a table will all be populated from the same write batches, so their time stamps should be aligned and therefore this would be possible.

In a lot of cases, users may only have one cache associated with a given table, so I think it would be a better UX if they did not have to specify the cache name on every call. A more common use case than above would be if there was a single cache associated with a table. For example, if there is only one cache, named cache1, associated with the cpu table, then the following two queries would be equivalent:

SELECT * FROM last_cache('cpu') -- defaults to using cache1, the only cache for the `cpu` table
SELECT * FROM last_cache('cpu', 'cache1')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants