-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to store cache to disk #20
Comments
Thanks for the background info!
|
I like your ideas. I sometimes forget how much you can do with Rust macros. What about having users use the |
it might be possible to make a custom |
So I'm also currently looking at this feature for a side project and would also be willing to help implement this. Personally, I'd already be happy with having
@jaemk you mean having "permanent" variants of
@cjbassi Implementing |
My original thinking around this was that introducing serialize-ability to the existing cache types would mean that the type constraints would need to change to include Serialize/Deserialize, and that's not a limitation that everyone cares about. Now that I'm thinking about it again, a better approach would probably be a new trait that lets you convert to/from a cache-type (whose types are constrained to Serialize/Deserialize) and a json string (or file containing json). I would think there would be "no promises" around efficiency of storing/loading this data since it may (in some circumstances) require making a bunch of copies to covert to/from a cache-type's layout and a layout that's serde/json-friendly.
Some minimal api like (assuming this new trait is in scope) #[cached(using_file="up/to/you/to/manage")]
fn do_stuff(...) -> u32 { ... } would probably be sufficient (?). |
@jaemk Hmmm. Do you reckon an API like |
@jaemk in the meantime, would it be acceptable to make |
@jakobn-ai Good idea - yeah, I'd be ok with making the |
Hi @jaemk @jakobn-ai , |
@jqnatividad Bincode looks interesting. Atm, I'm using self-written formatters with serde and flate2-rs. |
Hi! This sounds like a super useful feature, the API cached has makes it so easy to use, having it persist to disk as well would be great. As for I'm also thinking that it might be useful to have it first search a smaller memory cache, and only if that fails look on the disk for a cached result. What about implementing something like that? I tried giving it a shot to code a draft this, but I still know to little about macros to follow that part of the code I realized when I opened the source. Anyone else that has made some experiments so far? |
I've thought about this a bit, but haven't gotten to the point of trying to write any code related to it. Seems to boil down to two areas of interest that can be implemented separately: de/serialization
persistence (macro integration) One thing to get out of the way - I think persistence should be an atomic full-write, and we shouldn't go anywhere near trying to incrementally write to disk. Some options for persistence:
|
For my use case I would be more interested in serializing/deserializing by writing the struct binary data to files directly, as this is only for caching, I would not be that interested in having the files on disk human readable in JSON format or other text formats. A huge plus would also be if it would be possible without the need to manually add derive traits to every data type that I might want to cache. Bincode looked interesting but also seem to need derive traits Encode/Decode to the structs, same as serde's Serialize/Deserialize. The cached crate has a big bonus as I think about it in that it does not need a lot of setup with the easy use cases, only having to add something like The feature needs might of course vary depending on use case, but for the ones I'm thinking of saving as binary seems more performant than strings. Just some thoughts at least. |
Just wanted to add that with the recent addition of the In my application, I used the use once_cell::sync::Lazy;
static DEFAULT_REDIS_CONN_STR: &str = "redis://127.0.0.1:6379";
static DEFAULT_REDIS_TTL_SECONDS: u64 = 60 * 60 * 24 * 28; // 28 days in seconds
struct RedisConfig {
conn_str: String,
ttl_secs: u64,
ttl_refresh: bool,
}
impl RedisConfig {
fn load() -> Self {
Self {
conn_str: std::env::var("QSV_REDIS_CONNECTION_STRING")
.unwrap_or_else(|_| DEFAULT_REDIS_CONN_STR.to_string()),
ttl_secs: std::env::var("QSV_REDIS_TTL_SECS")
.unwrap_or_else(|_| DEFAULT_REDIS_TTL_SECONDS.to_string())
.parse()
.unwrap(),
ttl_refresh: std::env::var("QSV_REDIS_TTL_REFRESH").is_ok(),
}
}
}
static REDISCONFIG: Lazy<RedisConfig> = Lazy::new(RedisConfig::load);
#[io_cached(
type = "cached::RedisCache<String, String>",
key = "String",
convert = r#"{ format!("{}", url) }"#,
create = r##" {
RedisCache::new("qf", REDISCONFIG.ttl_secs)
.set_refresh(REDISCONFIG.ttl_refresh)
.set_connection_string(&REDISCONFIG.conn_str)
.build()
.expect("error building redis cache")
} "##,
map_error = r##"|e| FetchRedisError::RedisError(format!("{:?}", e))"##,
with_cached_flag = true
)]
fn get_redis_response(
url: &str,
client: &reqwest::blocking::Client,
limiter: &governor::RateLimiter<NotKeyed, InMemoryState, DefaultClock, NoOpMiddleware>,
flag_jql: &Option<String>,
flag_store_error: bool,
) -> Result<cached::Return<String>, FetchRedisError> {
// code here
} And with a robust, mature cache store like Redis, you have the added bonus of being able to use tools like cc @cjbassi , @jakobn-ai , @tirithen |
I re-discovered the same limitation yet again when starting on another project. The Redis support is great for anyone using Redis on their project. But only adding an entire Redis service just to use this crate makes little sense for projects not using Redis for persisting other state. Having an easy to user save to disk option as well makes a lot of sense for some projects and would therefore make sense as another option. For a lot of projects part of the goal is to have it run as a single executable without relying on external services to run, in those cases Redis is not an option. Does anyone have hints on other good crates that offer cache annotation with disk persistence that we that need it can try out until this issue is resolved? |
For my qsv project, leveraging Redis to persist the cache across sessions was "good enough." But I agree that it'd be nice, if its CLI didn't have an external dependency (side note, qsv optionally bundles luau and python as interpreters to help with data-wrangling tasks, and bundling luau was soooo much easier, even though its relatively obscure as its purpose made to be an embeddable interpreter compared to Python with its external dependencies). I recently came across the memmap2 crate, and it seems to be well-suited for persisting the cache to disk. @tirithen , perhaps, you can give integrating it a try? |
@jqnatividad thank you for the suggestion, I'm interested in giving it a try, I'm looking on the code now. My current rust skill set makes it hard to read though the macro related code. I still think it is an interesting challenge, but I might need some help to get going. Then there is also the challenge of agreeing on an API to try out. For the feature to be worth it, I think that the API must be fairly minimal. With Redis there are host and credentials configuration. For disk cache case the user is already logged in so it should be possible to simplify. What I'm hoping for is something similar to adding a #[cached(disk=true, result=true)]
async fn fetch_wikipage(name: &str) -> Result<String, Error> {
reqwest::get(format!("https://en.wikipedia.org/wiki/{}", name)).await?.text().await?
} It would then use memmap2 or similar to store the cached result to a file named Probably also hide this code behind a feature flag as it does not make much sense in a WASM build. How would something like this look to you @jaemk ? Is the cache file paths to hard coded, do we need configuration options, or could something as simple as a single flag work well enough? You have previously mentioned a separate cache type. The thing with that is the ease of use, for example providing a cache file path per function becomes to verbose on my opinion. But maybe having a In any case I'm willing to give it a try at least if I can have some directions on how to get started, and at least a draft of the the desired API. |
Thanks for the suggestions and quesions @jqnatividad and @tirithen! There's a been good amount of thinking in this thread about how the disk writes should be handled and how the interface should look. I think I have an answer as to how I'd like both to be implemented: For actual persistence, use https://github.com/spacejam/sled instead of cooking up a home grown disk storage solution. This provides a nice kv interface over a file and lets you configure the important bits: background flush interval (and also allows manual flushing), file location, and in-process cache size (to avoid disk). For the interface, a new cache type should be created that largely mirrors the Redis and AsyncRedis cache types. It should implement the IoCached traits which will make it compatible with the |
@jaemk and @jqnatividad I got some time to make progress on the disk cache. I added it as a draft pull request. It is by no means complete, but as it takes some time for me to write it would be great with some feedback on if it looks like its going in the right direction before I spend to much more time. The most interesting one is the disk.rs store (for now no async code, only sync) https://github.com/jaemk/cached/pull/145/files#diff-44969ddcb461480249b34639a98545dcffd53d3383bae5084c0cf5b314b8faa8 I got the tests inside that file to pass, but got a bit stuck with the proc macro parts. I used sled as suggested, they however does not have built-in support for expiring keys so for now I just added a method that cleans when cache_get runs and more than 10 seconds has elapsed. Maybe this part should run on a separate thread somehow later on to not stall the result from the cache_get call. Have a look on what you think about the approach, and if it makes sense I could continue a bit more. |
I'd also love to see a disk cache, think on kubernetes volumes that are very easy to setup compared to a redis setup |
Would love to see this (as well as autoloading from disk) - sqlite and other "caching" alternatives are much slower. |
@tirithen what's the state of your 1-year-old PR? |
@omid hi! I have not worked on it for a while now. My first problem was that I found it took a lot of time to work with as the code was so abstract and I have no longer had the same direct needs for this for a while. As I remember it, also my personal needs was for a solution with a simple minimal annotations to cache a function result to disc rather than the more verbose/complete I felt that the preference of the maintainers was something more like the more advanced configuration capabilities of the |
Hey guys, I will pick up the old PR and finish it up. We can have a simpler annotation than what's required for redis since there isn't a need to connect to some external system |
Something like #[disk_cached] would be great shorthand, with a sensible default of course - optionally passing in a folder or file (which one is faster? gonna need to benchmark that) would be good, |
Permanent as in it's valid as long as the backing file exists? I think so |
Yep, that's it - also a hook to programmatically delete a certain key could come in handy |
@jaemk thanks for considering implementing this, I'm certain that it will be a great addition for single user applications such as cli tools and local gui applications and similar that "restarts" often. If it helps my use case needs have mostly been about "throttling" things like HTTP requests with a timeout for cli tools. As they restart each time you execute them in-memory does not make sense, and as the cli tool should be self contained external services like Redis does not make any sense. Both in-memory and Redis mainly makes sense for servers/services. I understand that there might be some configuration needed, it is a great thing being able to tweak where possible, but having as many of these as opt-in as possible I believe would be great. |
merged #145 , released in |
@AstroOrbis any cache-annotated function has a cache that's the function name in all caps (unless ex: Line 41 in afb670c
|
if it's by function name, wouldn't conflicts arise? Something like cachedir/crate_name/function_name maybe would work better, dunno |
Sure, conflicts are possible. That's why there's the |
Fantastic! You got it merged, this will be great for cli tools for caching between runs. |
@jaemk Should we close this? |
I saw #11 and I think it would be good to add this functionality.
Some prior art that I'm aware of includes https://github.com/brmscheiner/memorize.py which is a python library for memoization that supports storing the cache in disk. Some details about that library:
<filename-of-function>_<function-name>.cache
IMO, I think some better options would be to:
$XDG_CACHE_HOME/<appname>
or~/.cache/<appname>
by defaultLet me know what you think!
edit: So I noticed you can specify a key format, so that would solve the json tuple issue actually. Definitely a nice feature!
edit2: It would probably be good to store the caches in a subdirectory of
~/.cache/<appname>
, like in a directory calledmemoized
or something.The text was updated successfully, but these errors were encountered: