Skip to content

Conversation

@koushiro
Copy link
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Propose a Cache Layer design

What changes are included in this PR?

A new RFC

Are there any user-facing changes?

@koushiro koushiro requested a review from Xuanwo as a code owner June 16, 2025 13:07
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 16, 2025
@koushiro koushiro changed the title RFC: Cache Layer RFC-6297: Cache Layer Jun 16, 2025
@Xuanwo
Copy link
Member

Xuanwo commented Jun 17, 2025

My current understanding of the cache layer is that it is a very thin component that simply wraps two (or more) OpenDAL operators. The cache policy and strategy should be defined by a separate trait. I haven't explored this idea in depth yet, but it might look something like the following:

let cache_layer = CacheLayer::builder()
    .route("*.hint", WholeCache::new(1024), memory_cache)
    .route("*.json", FixedSizeCache::new(64 << 20), memory_cache)
    .route("*.parquet", ParquetCache::new(), foyer_cache)
    .build()?;
let op = op.layer(cache_layer);

All FixedSizeCache and ParquetCache are implementations of the CachePolicy trait and can be provided by users. OpenDAL can offer some default implementations as a starting point. In other words, the cache layer itself does not make any decisions for users.

@koushiro
Copy link
Member Author

koushiro commented Jun 26, 2025

My current understanding of the cache layer is that it is a very thin component that simply wraps two (or more) OpenDAL operators.

I agree,I think it's best to simplify the cache layer design first, without considering a unified cache policy and strategy design. Instead, let the corresponding OpenDAL service manage it itself. After collecting enough use cases and feedback, we can consider the design of policy/strategy and provide some common policy/strategy implementations.

Copilot AI review requested due to automatic review settings December 27, 2025 07:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR proposes RFC-6297, which introduces a Cache Layer to OpenDAL for transparent read-through and write-through caching capabilities. The RFC aims to improve performance by allowing users to cache data from slower storage services (like S3) to faster ones (like Memory or Redis).

Key changes:

  • Introduces a new CacheLayer that wraps existing storage with caching functionality
  • Defines a CacheStorage trait for flexible cache backend implementations
  • Proposes CacheReader and CacheWriter components for handling cached and uncached data flows

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@koushiro koushiro marked this pull request as draft December 27, 2025 08:01
@koushiro koushiro marked this pull request as ready for review December 27, 2025 12:29
@koushiro koushiro force-pushed the add-cache-layer-rfc branch from 2bd38da to 6dbc7a7 Compare December 29, 2025 10:21
@Xuanwo
Copy link
Member

Xuanwo commented Dec 29, 2025

Thank you for working on this! I will find sometime to review this RFC again.

}

async fn delete(&self) -> Result<(RpDelete, Self::Deleter)> {
self.inner.delete().await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also delete the file in the cache service ? we shouldn't depend on the evict policy of the underlying service here for we may always read the deleted file if we use memory as a cache service.

Copy link
Contributor

@flaneur2020 flaneur2020 Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll be nice to invalidate the cache on delete() is called imo, in this manner, we could claim a "best effort" cache consistency.


1. **Buffer Accumulation**: All written data is accumulated in an internal `buffer`
2. **Primary Write**: Data is always written to the underlying service first via `inner.write()`
3. **Cache Write**: If `cache_write` is enabled and the writing to underlying service succeeds, the complete data is written to cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we invalidate the cache data even if cache_write is disabled after a successful write?

1. **Buffer Accumulation**: All written data is accumulated in an internal `buffer`
2. **Primary Write**: Data is always written to the underlying service first via `inner.write()`
3. **Cache Write**: If `cache_write` is enabled and the writing to underlying service succeeds, the complete data is written to cache
4. **Atomic Caching**: Cache operations happen only after successful completion to ensure cache consistency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not atomic as cache operation could be failed.


# Future possibilities

- Customizable Cache Key Generation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need some keys like {path}-{version}, {path}-{range}, etc.


// Try cache first if read caching is enabled
if self.cache_options.read {
match self.cache_service.read(&cache_key).await {
Copy link
Contributor

@flaneur2020 flaneur2020 Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when dealing with object storages, it's considered as a common practise to merge smaller writes into a bigger object, which often makes each object exceed 512MB or even gigabytes.

also, when objects are accessed, it's often scanned within a range, and the range is often maintained from an outside index (like iceberg's manifest). it would has read amplification if we have to load the entire object from cache.

it'll be nice to able to have an abstraction here likely called as CacheMapper or CacheChunker, which handles the shaping of cache accesses, letting it to be 1:M for object:cache entity. this allows us not to cache the entire object, but a subset of the object by range.


1. **Cache Key Strategy**: Should we provide options for custom cache key generation (e.g., hashing, prefixing)?

2. **Invalidation API**: Should we provide explicit cache invalidation methods, or rely entirely on the underlying cache storage?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having no invlidation API, it should solely depends on the eviction strategy underlying.

}
}

async fn write(&self, key: &str, value: Vec<u8>) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using a Buffer instead of Vec<u8> here?

fn stat(&self, key: &str) -> impl Future<Output = Result<Metadata>> + MaybeSend;

/// Check whether `key` exists in the cache.
fn exists(&self, key: &str) -> impl Future<Output = Result<bool>> + MaybeSend;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this exists method considered as a must? imo we could already check the existencs with the stat method.

or, iiuc, we could consider to have a default impl for exists by calling stat method.


Storage access performance varies greatly across different storage services.
Remote object stores like S3 or GCS have much higher latency than local storage or in-memory caches.
In many applications, particularly those with read-heavy workloads or repeated access to the same data, caching can significantly improve performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we assume to keep the content of objects to be immutable on arbitary path when using CacheLayer?

imo keeping content immutable is a common practise for object store caching, keeping things immutable removes the hardest part on cache consistency. however, keeping content immutable requires user to keep this in mind.

// Enable cache promotion during read operations (default: true)
read_promotion: true,
// Enable write-through caching (default: true)
write: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to use an enum for the write option to seperate write through or write back?

sometimes write back could be useful, be like if the upstream s3 service became unavailable, having a write back cache could be benefical to keep our service available by buffering the writes.

@Xuanwo
Copy link
Member

Xuanwo commented Jan 4, 2026

Hello everyone, before we continue discussing the Cache Layer, I invite everyone to reach a consensus on the Route Layer's design. I believe this approach can greatly simplify our Cache Layer design and allow it to focus solely on caching.

RFC: #7130

cc @koushiro @flaneur2020 @meteorgan @killme2008, also @MrCroxx for work on foyer layer.

```rust
let op = s3.layer(
CacheLayer::new(memory)
.with_options(CacheOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of providing CacheOptions, I prefer CacheLayer to do less work and delegate the different options into a new trait called CachePolicy.

In this way, every cache layer will be composed by an Operator and a CachePolicy.

#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum CacheDirective {
    Bypass,
    Use { chunk_size: Option<u32>, fill: bool },
}

pub trait CachePolicy: Send + Sync + 'static {
    fn evaluate(&self, req: &CacheRequest<'_>) -> CacheDirective;
}

CachePolicy can decide everything about cache itself but we can start with simple:

  • bypass or use cache
  • if cache missed, do we need to fill it.
  • how to cache it, as a whole or in chunk.

We can ship some widely used policy too:

  • WholeCachePolicy
  • ChunkedCachePolicy
  • MetadataOnlyCachePolicy

In this way, users can extend cache behavior based on their own needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CacheDirective you mentioned appears to only cover read operations. Does it lack provisions for write or delete operations? Should we also include an Invalidate directive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CacheRequest, I currently think it is composed as follows, but I'm not sure if I should directly use the existing ops types (OpStat/OpRead/OpWrite/OpDelete) in CacheOperation. What do you think?

pub struct CacheRequest<'a> {
    /// Path as seen by OpenDAL.
    pub path: &'a str,
    /// Operation kind (stat/read/write/delete)
    pub op: CacheOperation,
    // Additional fields can be added in the future.
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it lack provisions for write or delete operations? Should we also include an Invalidate directive?

Yes, I do think we do lock of invalidate support here. However I believe it's not a blocker and we can add it later. We can test CacehLayer on users that ensure that files are immutable.

I'm not sure if I should directly use the existing ops types (OpStat/OpRead/OpWrite/OpDelete) in CacheOperation

I think it's fine. I also thing use a String for path is also fine unless we are sure this is a bottleneck.

## Architecture

The Cache Layer implements the `Layer` trait and wraps an underlying `Access` implementation with caching capabilities.
It introduces a `CacheService` trait that defines the interface for cache operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we just accpet Operator as public API and convert into Accessor inside. Also, we should not add a new API surface like CacheService.

By simply reusing the Access trait, we can avoid a lot of repetitive work.


```rust
#[derive(Clone, Copy, Debug)]
pub struct CacheOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before, I think we should expose a policy trait instead.

cache_options: CacheOptions,
}

impl<A: Access, S: CacheService> LayeredAccess for CacheAccessor<A, S> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal should not include detailed code, as it could divert our focus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants