feat: Apache OpenDAL™ compatible backends#6177
Conversation
Greptile OverviewGreptile SummaryThis PR integrates Apache OpenDAL™ as a catch-all backend, enabling support for 62+ storage backends through a unified interface. Key Changes
Implementation Quality
Minor Issues
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant IOClient
participant OpenDALSource
participant Operator as OpenDAL Operator
User->>IOClient: read_parquet("oss://bucket/file.parquet", io_config)
IOClient->>IOClient: parse_url("oss://...")
Note over IOClient: Unknown scheme -> SourceType::OpenDAL{scheme:"oss"}
IOClient->>IOClient: Check backends config
alt Backend configured
IOClient->>OpenDALSource: get_client("oss", config)
OpenDALSource->>Operator: Operator::via_iter(scheme, config)
Operator-->>OpenDALSource: Operator instance
OpenDALSource-->>IOClient: Arc<OpenDALSource>
else Backend not configured
IOClient-->>User: Error: Configure via IOConfig(backends=...)
end
IOClient->>OpenDALSource: get(uri, range, io_stats)
OpenDALSource->>OpenDALSource: url_to_opendal_path(uri)
Note over OpenDALSource: Extracts path from URL
OpenDALSource->>Operator: read_with(path).range(...)
Operator-->>OpenDALSource: Data bytes
OpenDALSource-->>IOClient: GetResult::Stream
IOClient-->>User: DataFrame
|
Additional Comments (1)
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6177 +/- ##
==========================================
+ Coverage 73.21% 73.33% +0.11%
==========================================
Files 994 996 +2
Lines 129815 130930 +1115
==========================================
+ Hits 95050 96017 +967
- Misses 34765 34913 +148
🚀 New features to boost your workflow:
|
|
Hi @universalmind303, thanks for creating this PR! I've reviewed the generic OpenDAL backend approach and have some thoughts to share. Overall, I think the generic OpenDAL backend is a great direction for supporting the long tail of object stores. However, based on my experience implementing COS support (PR #6125) and feedback from engineers who have used OpenDAL's COS implementation in production (e.g., lance-format/lance#5740), I'd like to raise a few concerns:
The CosConfig options in OpenDAL are quite limited — they had to set disable_config_load = false to support environment variables like TENCENTCLOUD_SECURITY_TOKEN and TENCENTCLOUD_REGION. Typed configuration with clear field names (region, secret_id, secret_key, security_token, etc.)
The generic OpenDAL backend (this PR) is excellent for the long tail of storage services that don't justify dedicated support. What do you think? |
|
@XuQianJin-Stars that sounds like a good compromise for me! I'll try to get this merged by EOD today and then you can rebase your PR on top of this to add first class support for COS. |
…ersalmind303/opendal
srilman
left a comment
There was a problem hiding this comment.
Overall LGTM, just had some thoughts on the interface that I wanted your thoughts on. Thanks @universalmind303!
| home = "0.5.12" | ||
| itertools = {workspace = true} | ||
| log = {workspace = true} | ||
| opendal = {workspace = true, features = [ |
There was a problem hiding this comment.
Looking at the list of features, had some questions
- Why not enable
executors-tokioorinternal-tokio-rt? They are enabled by default, and seems reasonable? - Can we add comments as to what each service enables what? Cause I can't really tell what oss and obs are for
- Do we want
services-fs? Shouldn't that be covered by ourfile://? - Are there any other interesting ones that we should add a todo for any other interesting ones that you saw?
There was a problem hiding this comment.
I had services-fs enabled to make it easier/possible to integration test this.
- Update opendal from 0.51 to 0.55 (latest) - Rename `backends` to `opendal_backends` for clarity - Add comments to opendal feature flags explaining each service - Add `executors-tokio` feature for tokio async execution - Use `Reader::into_bytes_stream` for proper streaming instead of reading entire files into memory - Fallback to empty config when no backend config provided, allowing OpenDAL backends that don't require config to work without explicit IOConfig - Improve error messages to list available OpenDAL schemes and suggest IOConfig configuration - Fix API changes in opendal 0.55 (write/close now return Metadata) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
srilman
left a comment
There was a problem hiding this comment.
Thanks @universalmind303!
) PR #6177 (OpenDAL) added ~110 new crate entries to Cargo.lock, which invalidated the Rust compilation cache. The `integration-test-build` job needs ~29 minutes for a full rebuild, but the 30-minute timeout causes it to be cancelled before the post-job cache save runs. This creates a deadlock: every subsequent build also misses the cache and times out. Bumping to 45 minutes lets the first cache-miss build complete and save, after which subsequent builds should return to ~10 minutes. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Changes Made This PR adds support for Tencent Cloud COS (Cloud Object Storage) by leveraging the generic OpenDAL backend infrastructure introduced in #6177. Instead of implementing a dedicated COS source from scratch, we provide a lightweight `CosConfig` that converts to OpenDAL configuration, reusing the existing `OpenDALSource` for all I/O operations. ### Architecture ```mermaid flowchart TD A["User API: IOConfig(cos=CosConfig(region="ap-guangzhou", ...))"] --> B["Config: CosConfig.to_opendal_config(bucket) → BTreeMap<String, String>"] B --> C["I/O: OpenDALSource (generic, supports read/write/list/glob/multipart)"] C --> D["Backend: Apache OpenDAL → Tencent COS"] ``` ### Implementation Details **New Files:** - `src/common/io-config/src/cos.rs` - `CosConfig` struct with region, endpoint, credentials, and connection settings. Includes `from_env()` for automatic environment variable scanning and `to_opendal_config()` to convert into OpenDAL-compatible config map. **Modified Files:** - `src/common/io-config/src/lib.rs` - Export CosConfig module - `src/common/io-config/src/config.rs` - Add `cos` field to `IOConfig` struct - `src/common/io-config/src/python.rs` - Python bindings for `CosConfig` class with full API support - `src/daft-io/src/lib.rs` - Route `cos://` and `cosn://` URL schemes to `SourceType::OpenDAL { scheme: "cos" }`, with special handling to extract the bucket from the URL and merge `CosConfig` into the OpenDAL config - `daft/io/__init__.py` - Export `CosConfig` to Python API **Key Design Decisions:** - **No dedicated CosSource** — instead of ~1,000 lines of custom I/O code, COS reuses the generic `OpenDALSource` which already implements `ObjectSource` (including multipart write support) - **Lightweight CosConfig preserved** — provides a user-friendly Python API with automatic region↔endpoint derivation and environment variable scanning (`COS_*` / `TENCENTCLOUD_*`), rather than requiring raw `opendal_backends` dicts **Supported Features:** - URL schemes: `cos://bucket/key` and `cosn://bucket/key` (Hadoop CosN compatible) - Environment variables: `COS_*` and `TENCENTCLOUD_*` prefixes for configuration - Operations: read, write (including multipart), list, delete, glob pattern matching - Authentication: permanent keys and STS temporary credentials - Full Python API with `CosConfig` class **Why OpenDAL:** Since Tencent COS doesn't have an official Rust SDK, we use Apache OpenDAL which provides a unified data access layer supporting 70+ storage backends including Tencent COS. This PR builds on #6177 which already added the generic OpenDAL backend infrastructure. ### Example Usage **Reading from COS with explicit credentials:** ```python import daft from daft.io import IOConfig, CosConfig io_config = IOConfig( cos=CosConfig( region="ap-guangzhou", secret_id="your-secret-id", secret_key="your-secret-key", ) ) df = daft.read_parquet("cos://my-bucket/path/to/data.parquet", io_config=io_config) df.show()
## Changes Made
Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).
**Python API:**
```python
io_config = IOConfig(
protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```
### Implementation
- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.
### Design Decisions
- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`
## Related Issues
Builds on PR #6177 (OpenDAL support).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Changes Made
Integrates Apache OpenDAL as a catch-all backend so that any
OpenDAL-supported storage (62+ backends) works out of the box via
IOConfig(backends={...})
Native backends (S3, GCS, Azure, HTTP, HF, TOS, etc.) are unchanged and
always take priority
Unknown URL schemes now route through OpenDAL instead of erroring
immediately, with a helpful error message if no backend config is
provided
example usage:
```py
io_config = IOConfig(
opendal_backends={"oss": {"bucket": "my-bucket", "access_key_id": "...", "access_key_secret": "..."}}
)
df = daft.read_parquet("oss://my-bucket/data.parquet", io_config=io_config)
```
more examples:
```py
# using the opendal github backend
io_config = IOConfig(
opendal_backends={
"github": {
"owner": "Eventual-Inc",
"repo": "Daft",
}
}
)
df = daft.read_json("github://main/tests/assets/json-data/sample1.json", io_config=io_config)
```
## Related Issues
<!-- Link to related GitHub issues, e.g., "Closes Eventual-Inc#123" -->
---------
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…entual-Inc#6241) PR Eventual-Inc#6177 (OpenDAL) added ~110 new crate entries to Cargo.lock, which invalidated the Rust compilation cache. The `integration-test-build` job needs ~29 minutes for a full rebuild, but the 30-minute timeout causes it to be cancelled before the post-job cache save runs. This creates a deadlock: every subsequent build also misses the cache and times out. Bumping to 45 minutes lets the first cache-miss build complete and save, after which subsequent builds should return to ~10 minutes. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…Inc#6140) ## Changes Made This PR adds support for Tencent Cloud COS (Cloud Object Storage) by leveraging the generic OpenDAL backend infrastructure introduced in Eventual-Inc#6177. Instead of implementing a dedicated COS source from scratch, we provide a lightweight `CosConfig` that converts to OpenDAL configuration, reusing the existing `OpenDALSource` for all I/O operations. ### Architecture ```mermaid flowchart TD A["User API: IOConfig(cos=CosConfig(region="ap-guangzhou", ...))"] --> B["Config: CosConfig.to_opendal_config(bucket) → BTreeMap<String, String>"] B --> C["I/O: OpenDALSource (generic, supports read/write/list/glob/multipart)"] C --> D["Backend: Apache OpenDAL → Tencent COS"] ``` ### Implementation Details **New Files:** - `src/common/io-config/src/cos.rs` - `CosConfig` struct with region, endpoint, credentials, and connection settings. Includes `from_env()` for automatic environment variable scanning and `to_opendal_config()` to convert into OpenDAL-compatible config map. **Modified Files:** - `src/common/io-config/src/lib.rs` - Export CosConfig module - `src/common/io-config/src/config.rs` - Add `cos` field to `IOConfig` struct - `src/common/io-config/src/python.rs` - Python bindings for `CosConfig` class with full API support - `src/daft-io/src/lib.rs` - Route `cos://` and `cosn://` URL schemes to `SourceType::OpenDAL { scheme: "cos" }`, with special handling to extract the bucket from the URL and merge `CosConfig` into the OpenDAL config - `daft/io/__init__.py` - Export `CosConfig` to Python API **Key Design Decisions:** - **No dedicated CosSource** — instead of ~1,000 lines of custom I/O code, COS reuses the generic `OpenDALSource` which already implements `ObjectSource` (including multipart write support) - **Lightweight CosConfig preserved** — provides a user-friendly Python API with automatic region↔endpoint derivation and environment variable scanning (`COS_*` / `TENCENTCLOUD_*`), rather than requiring raw `opendal_backends` dicts **Supported Features:** - URL schemes: `cos://bucket/key` and `cosn://bucket/key` (Hadoop CosN compatible) - Environment variables: `COS_*` and `TENCENTCLOUD_*` prefixes for configuration - Operations: read, write (including multipart), list, delete, glob pattern matching - Authentication: permanent keys and STS temporary credentials - Full Python API with `CosConfig` class **Why OpenDAL:** Since Tencent COS doesn't have an official Rust SDK, we use Apache OpenDAL which provides a unified data access layer supporting 70+ storage backends including Tencent COS. This PR builds on Eventual-Inc#6177 which already added the generic OpenDAL backend infrastructure. ### Example Usage **Reading from COS with explicit credentials:** ```python import daft from daft.io import IOConfig, CosConfig io_config = IOConfig( cos=CosConfig( region="ap-guangzhou", secret_id="your-secret-id", secret_key="your-secret-key", ) ) df = daft.read_parquet("cos://my-bucket/path/to/data.parquet", io_config=io_config) df.show()
## Changes Made
Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).
**Python API:**
```python
io_config = IOConfig(
protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```
### Implementation
- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.
### Design Decisions
- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`
## Related Issues
Builds on PR Eventual-Inc#6177 (OpenDAL support).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Changes Made
Integrates Apache OpenDAL as a catch-all backend so that any OpenDAL-supported storage (62+ backends) works out of the box via IOConfig(backends={...})
Native backends (S3, GCS, Azure, HTTP, HF, TOS, etc.) are unchanged and always take priority
Unknown URL schemes now route through OpenDAL instead of erroring immediately, with a helpful error message if no backend config is provided
example usage:
more examples:
Related Issues