Skip to content

feat: Apache OpenDAL™ compatible backends#6177

Merged
desmondcheongzx merged 8 commits intomainfrom
universalmind303/opendal
Feb 19, 2026
Merged

feat: Apache OpenDAL™ compatible backends#6177
desmondcheongzx merged 8 commits intomainfrom
universalmind303/opendal

Conversation

@universalmind303
Copy link
Copy Markdown
Member

@universalmind303 universalmind303 commented Feb 11, 2026

Changes Made

Integrates Apache OpenDAL as a catch-all backend so that any OpenDAL-supported storage (62+ backends) works out of the box via IOConfig(backends={...})
Native backends (S3, GCS, Azure, HTTP, HF, TOS, etc.) are unchanged and always take priority

Unknown URL schemes now route through OpenDAL instead of erroring immediately, with a helpful error message if no backend config is provided

example usage:

  io_config = IOConfig(
      opendal_backends={"oss": {"bucket": "my-bucket", "access_key_id": "...", "access_key_secret": "..."}}
  )
  df = daft.read_parquet("oss://my-bucket/data.parquet", io_config=io_config)

more examples:

# using the opendal github backend
io_config = IOConfig(
    opendal_backends={
        "github": {
            "owner": "Eventual-Inc",
            "repo": "Daft",
        }
    }
)
df = daft.read_json("github://main/tests/assets/json-data/sample1.json", io_config=io_config)

Related Issues

@github-actions github-actions Bot added the feat label Feb 11, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR integrates Apache OpenDAL™ as a catch-all backend, enabling support for 62+ storage backends through a unified interface.

Key Changes

  • OpenDAL Integration: Added new OpenDALSource module implementing the ObjectSource trait with full support for read, write, glob, list, and delete operations
  • Automatic Fallback: Unknown URL schemes now route through OpenDAL instead of erroring immediately, with clear error messages when backend configuration is missing
  • Configuration: Added backends field to IOConfig allowing users to configure OpenDAL backends via IOConfig(backends={"scheme": {...}})
  • Writer Support: Refactored CSV/JSON/Parquet writers to support OpenDAL backends through the supports_native_writer() abstraction
  • Breaking Change: SourceType enum changed from Copy to Clone due to addition of OpenDAL { scheme: String } variant

Implementation Quality

  • Well-structured error handling with proper mapping from OpenDAL errors to Daft error types
  • Comprehensive test coverage including read/write roundtrips, globbing, partitioning, and serialization
  • Proper URL path extraction logic that strips scheme/host to work with OpenDAL's operator model
  • Multipart writer implementation supporting streaming writes to OpenDAL backends

Minor Issues

  • URL reconstruction in the ls method could be more robust when handling schemes without host components

Confidence Score: 4/5

  • This PR is safe to merge with only minor refinements needed
  • The implementation is well-designed with comprehensive test coverage and proper error handling. The integration follows established patterns in the codebase. Only minor style improvements suggested for URL handling robustness. The breaking change to SourceType (Copy -> Clone) is necessary and handled correctly throughout.
  • No files require special attention - the implementation is solid across all changes

Important Files Changed

Filename Overview
src/daft-io/src/opendal_source.rs New OpenDAL integration module with comprehensive ObjectSource implementation. Well-structured with proper error handling and extensive unit tests. Minor issue with URL reconstruction in ls method.
src/common/io-config/src/config.rs Added backends field to IOConfig struct for OpenDAL configuration. Clean implementation with proper display formatting.
src/daft-io/src/lib.rs Integrated OpenDAL as fallback for unknown URL schemes. Changed SourceType to non-Copy due to String field. Routes unknown schemes to OpenDAL instead of erroring.
src/daft-writers/src/csv_writer.rs Refactored to use supports_native_writer() check instead of hardcoded source types, enabling CSV writes via OpenDAL backends.
tests/io/test_opendal.py Comprehensive test suite covering read/write operations, globbing, serialization, and various data types through OpenDAL fs backend. Excellent coverage.

Sequence Diagram

sequenceDiagram
    participant User
    participant IOClient
    participant OpenDALSource
    participant Operator as OpenDAL Operator

    User->>IOClient: read_parquet("oss://bucket/file.parquet", io_config)
    IOClient->>IOClient: parse_url("oss://...")
    Note over IOClient: Unknown scheme -> SourceType::OpenDAL{scheme:"oss"}
    
    IOClient->>IOClient: Check backends config
    alt Backend configured
        IOClient->>OpenDALSource: get_client("oss", config)
        OpenDALSource->>Operator: Operator::via_iter(scheme, config)
        Operator-->>OpenDALSource: Operator instance
        OpenDALSource-->>IOClient: Arc<OpenDALSource>
    else Backend not configured
        IOClient-->>User: Error: Configure via IOConfig(backends=...)
    end
    
    IOClient->>OpenDALSource: get(uri, range, io_stats)
    OpenDALSource->>OpenDALSource: url_to_opendal_path(uri)
    Note over OpenDALSource: Extracts path from URL
    OpenDALSource->>Operator: read_with(path).range(...)
    Operator-->>OpenDALSource: Data bytes
    OpenDALSource-->>IOClient: GetResult::Stream
    IOClient-->>User: DataFrame
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Additional Comments (1)

src/common/io-config/src/config.rs
Display trait for IOConfig doesn't include backends field - formatting will be incomplete

impl Display for IOConfig {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::result::Result<(), std::fmt::Error> {
        write!(
            f,
            "IOConfig:
{}
{}
{}
{}
{}
{}
{}
{}
Backends: {:?}",
            self.s3, self.azure, self.gcs, self.tos, self.http, self.unity, self.gravitino, self.hf,
            self.backends
        )
    }
}

@universalmind303
Copy link
Copy Markdown
Member Author

@greptileai

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/daft-io/src/opendal_source.rs Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 79.66102% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.33%. Comparing base (f258ff9) to head (5d0d2ef).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-io/src/opendal_source.rs 77.55% 55 Missing ⚠️
src/daft-writers/src/csv_writer.rs 40.00% 3 Missing ⚠️
src/common/io-config/src/config.rs 66.66% 1 Missing ⚠️
src/daft-io/src/lib.rs 92.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6177      +/-   ##
==========================================
+ Coverage   73.21%   73.33%   +0.11%     
==========================================
  Files         994      996       +2     
  Lines      129815   130930    +1115     
==========================================
+ Hits        95050    96017     +967     
- Misses      34765    34913     +148     
Files with missing lines Coverage Δ
src/common/io-config/src/python.rs 46.21% <100.00%> (+1.21%) ⬆️
src/daft-functions-list/src/kernels.rs 98.24% <100.00%> (+3.67%) ⬆️
src/daft-writers/src/json_writer.rs 82.35% <100.00%> (ø)
src/daft-writers/src/parquet_writer.rs 90.82% <100.00%> (ø)
src/daft-writers/src/utils.rs 91.40% <100.00%> (ø)
src/common/io-config/src/config.rs 97.82% <66.66%> (-2.18%) ⬇️
src/daft-io/src/lib.rs 77.90% <92.30%> (+1.29%) ⬆️
src/daft-writers/src/csv_writer.rs 88.75% <40.00%> (+6.89%) ⬆️
src/daft-io/src/opendal_source.rs 77.55% <77.55%> (ø)

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@XuQianJin-Stars
Copy link
Copy Markdown
Contributor

Hi @universalmind303, thanks for creating this PR! I've reviewed the generic OpenDAL backend approach and have some thoughts to share.

Overall, I think the generic OpenDAL backend is a great direction for supporting the long tail of object stores. However, based on my experience implementing COS support (PR #6125) and feedback from engineers who have used OpenDAL's COS implementation in production (e.g., lance-format/lance#5740), I'd like to raise a few concerns:

  1. OpenDAL's individual service implementations can be rough around the edges
    Feedback from the Lance project's COS integration (which also uses OpenDAL under the hood) suggests that OpenDAL's service-specific implementations tend to be too simplistic and may have subtle issues in production. For example:

The CosConfig options in OpenDAL are quite limited — they had to set disable_config_load = false to support environment variables like TENCENTCLOUD_SECURITY_TOKEN and TENCENTCLOUD_REGION.
The recommendation from engineers with production COS experience is that using the S3-compatible interface (s3:// scheme) with an S3 Rust SDK would be more robust, since COS is S3-compatible and the S3 ecosystem is far more battle-tested.
2. The generic backends dict loses type safety and discoverability
My dedicated CosConfig approach provides:

Typed configuration with clear field names (region, secret_id, secret_key, security_token, etc.)
Python IDE autocompletion and type checking via .pyi stubs
from_env() method that knows exactly which environment variables to load (COS_ENDPOINT, TENCENTCLOUD_SECRET_ID, etc.)
Sensible defaults (timeouts, retry counts, concurrency limits)
With the generic backends={"cos": {...}} approach, users need to know the exact OpenDAL config key names, and there's no IDE support or validation.

  1. Suggested path forward
    I think both approaches can coexist:

The generic OpenDAL backend (this PR) is excellent for the long tail of storage services that don't justify dedicated support.
Dedicated backends for popular services (like COS, which is widely used in the Chinese cloud ecosystem) can provide a better user experience with typed configs, environment variable support, and optimized implementations.
I'm happy to rebase my PR #6125 on top of this one once it merges, and we could register cos:// / cosn:// as a dedicated scheme that takes priority over the generic OpenDAL fallback — similar to how S3, GCS, Azure, and TOS are handled today.

What do you think?

@universalmind303
Copy link
Copy Markdown
Member Author

@XuQianJin-Stars that sounds like a good compromise for me! I'll try to get this merged by EOD today and then you can rebase your PR on top of this to add first class support for COS.

Copy link
Copy Markdown
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just had some thoughts on the interface that I wanted your thoughts on. Thanks @universalmind303!

Comment thread Cargo.toml Outdated
Comment thread src/daft-io/Cargo.toml
home = "0.5.12"
itertools = {workspace = true}
log = {workspace = true}
opendal = {workspace = true, features = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the list of features, had some questions

  • Why not enable executors-tokio or internal-tokio-rt? They are enabled by default, and seems reasonable?
  • Can we add comments as to what each service enables what? Cause I can't really tell what oss and obs are for
  • Do we want services-fs? Shouldn't that be covered by our file://?
  • Are there any other interesting ones that we should add a todo for any other interesting ones that you saw?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had services-fs enabled to make it easier/possible to integration test this.

Comment thread src/daft-io/src/lib.rs Outdated
Comment thread src/daft-io/src/lib.rs
Comment thread daft/daft/__init__.pyi Outdated
Comment thread src/daft-io/src/opendal_source.rs Outdated
- Update opendal from 0.51 to 0.55 (latest)
- Rename `backends` to `opendal_backends` for clarity
- Add comments to opendal feature flags explaining each service
- Add `executors-tokio` feature for tokio async execution
- Use `Reader::into_bytes_stream` for proper streaming instead of
  reading entire files into memory
- Fallback to empty config when no backend config provided, allowing
  OpenDAL backends that don't require config to work without explicit
  IOConfig
- Improve error messages to list available OpenDAL schemes and suggest
  IOConfig configuration
- Fix API changes in opendal 0.55 (write/close now return Metadata)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desmondcheongzx desmondcheongzx enabled auto-merge (squash) February 19, 2026 00:29
@desmondcheongzx desmondcheongzx merged commit b55e634 into main Feb 19, 2026
98 of 106 checks passed
@desmondcheongzx desmondcheongzx deleted the universalmind303/opendal branch February 19, 2026 00:59
desmondcheongzx added a commit that referenced this pull request Feb 19, 2026
)

PR #6177 (OpenDAL) added ~110 new crate entries to Cargo.lock, which
invalidated the Rust compilation cache. The `integration-test-build` job
needs ~29 minutes for a full rebuild, but the 30-minute timeout causes
it to be cancelled before the post-job cache save runs. This creates a
deadlock: every subsequent build also misses the cache and times out.

Bumping to 45 minutes lets the first cache-miss build complete and save,
after which subsequent builds should return to ~10 minutes.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
universalmind303 pushed a commit that referenced this pull request Feb 19, 2026
## Changes Made

This PR adds support for Tencent Cloud COS (Cloud Object Storage) by
leveraging the generic OpenDAL backend infrastructure introduced in
#6177. Instead of implementing a dedicated COS source from scratch, we
provide a lightweight `CosConfig` that converts to OpenDAL
configuration, reusing the existing `OpenDALSource` for all I/O
operations.

### Architecture
```mermaid
flowchart TD
    A["User API: IOConfig(cos=CosConfig(region=&quot;ap-guangzhou&quot;, ...))"] --> B["Config: CosConfig.to_opendal_config(bucket) → BTreeMap&lt;String, String&gt;"]
    B --> C["I/O: OpenDALSource (generic, supports read/write/list/glob/multipart)"]
    C --> D["Backend: Apache OpenDAL → Tencent COS"]
```



### Implementation Details

**New Files:**
- `src/common/io-config/src/cos.rs` - `CosConfig` struct with region,
endpoint, credentials, and connection settings. Includes `from_env()`
for automatic environment variable scanning and `to_opendal_config()` to
convert into OpenDAL-compatible config map.

**Modified Files:**
- `src/common/io-config/src/lib.rs` - Export CosConfig module
- `src/common/io-config/src/config.rs` - Add `cos` field to `IOConfig`
struct
- `src/common/io-config/src/python.rs` - Python bindings for `CosConfig`
class with full API support
- `src/daft-io/src/lib.rs` - Route `cos://` and `cosn://` URL schemes to
`SourceType::OpenDAL { scheme: "cos" }`, with special handling to
extract the bucket from the URL and merge `CosConfig` into the OpenDAL
config
- `daft/io/__init__.py` - Export `CosConfig` to Python API

**Key Design Decisions:**
- **No dedicated CosSource** — instead of ~1,000 lines of custom I/O
code, COS reuses the generic `OpenDALSource` which already implements
`ObjectSource` (including multipart write support)
- **Lightweight CosConfig preserved** — provides a user-friendly Python
API with automatic region↔endpoint derivation and environment variable
scanning (`COS_*` / `TENCENTCLOUD_*`), rather than requiring raw
`opendal_backends` dicts

**Supported Features:**
- URL schemes: `cos://bucket/key` and `cosn://bucket/key` (Hadoop CosN
compatible)
- Environment variables: `COS_*` and `TENCENTCLOUD_*` prefixes for
configuration
- Operations: read, write (including multipart), list, delete, glob
pattern matching
- Authentication: permanent keys and STS temporary credentials
- Full Python API with `CosConfig` class

**Why OpenDAL:** Since Tencent COS doesn't have an official Rust SDK, we
use Apache OpenDAL which provides a unified data access layer supporting
70+ storage backends including Tencent COS. This PR builds on #6177
which already added the generic OpenDAL backend infrastructure.

### Example Usage

**Reading from COS with explicit credentials:**
```python
import daft
from daft.io import IOConfig, CosConfig

io_config = IOConfig(
    cos=CosConfig(
        region="ap-guangzhou",
        secret_id="your-secret-id",
        secret_key="your-secret-key",
    )
)

df = daft.read_parquet("cos://my-bucket/path/to/data.parquet", io_config=io_config)
df.show()
desmondcheongzx pushed a commit that referenced this pull request Mar 6, 2026
## Changes Made

Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).

**Python API:**
```python
io_config = IOConfig(
    protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
    s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```

### Implementation

- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.

### Design Decisions

- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`

## Related Issues

Builds on PR #6177 (OpenDAL support).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
gavin9402 pushed a commit to gavin9402/Daft that referenced this pull request Apr 7, 2026
## Changes Made

Integrates Apache OpenDAL as a catch-all backend so that any
OpenDAL-supported storage (62+ backends) works out of the box via
IOConfig(backends={...})
Native backends (S3, GCS, Azure, HTTP, HF, TOS, etc.) are unchanged and
always take priority

Unknown URL schemes now route through OpenDAL instead of erroring
immediately, with a helpful error message if no backend config is
provided

example usage: 
```py
  io_config = IOConfig(
      opendal_backends={"oss": {"bucket": "my-bucket", "access_key_id": "...", "access_key_secret": "..."}}
  )
  df = daft.read_parquet("oss://my-bucket/data.parquet", io_config=io_config)
```

more examples: 
```py
# using the opendal github backend
io_config = IOConfig(
    opendal_backends={
        "github": {
            "owner": "Eventual-Inc",
            "repo": "Daft",
        }
    }
)
df = daft.read_json("github://main/tests/assets/json-data/sample1.json", io_config=io_config)
```




## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes Eventual-Inc#123" -->

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
gavin9402 pushed a commit to gavin9402/Daft that referenced this pull request Apr 7, 2026
…entual-Inc#6241)

PR Eventual-Inc#6177 (OpenDAL) added ~110 new crate entries to Cargo.lock, which
invalidated the Rust compilation cache. The `integration-test-build` job
needs ~29 minutes for a full rebuild, but the 30-minute timeout causes
it to be cancelled before the post-job cache save runs. This creates a
deadlock: every subsequent build also misses the cache and times out.

Bumping to 45 minutes lets the first cache-miss build complete and save,
after which subsequent builds should return to ~10 minutes.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
gavin9402 pushed a commit to gavin9402/Daft that referenced this pull request Apr 7, 2026
…Inc#6140)

## Changes Made

This PR adds support for Tencent Cloud COS (Cloud Object Storage) by
leveraging the generic OpenDAL backend infrastructure introduced in
Eventual-Inc#6177. Instead of implementing a dedicated COS source from scratch, we
provide a lightweight `CosConfig` that converts to OpenDAL
configuration, reusing the existing `OpenDALSource` for all I/O
operations.

### Architecture
```mermaid
flowchart TD
    A["User API: IOConfig(cos=CosConfig(region=&quot;ap-guangzhou&quot;, ...))"] --> B["Config: CosConfig.to_opendal_config(bucket) → BTreeMap&lt;String, String&gt;"]
    B --> C["I/O: OpenDALSource (generic, supports read/write/list/glob/multipart)"]
    C --> D["Backend: Apache OpenDAL → Tencent COS"]
```



### Implementation Details

**New Files:**
- `src/common/io-config/src/cos.rs` - `CosConfig` struct with region,
endpoint, credentials, and connection settings. Includes `from_env()`
for automatic environment variable scanning and `to_opendal_config()` to
convert into OpenDAL-compatible config map.

**Modified Files:**
- `src/common/io-config/src/lib.rs` - Export CosConfig module
- `src/common/io-config/src/config.rs` - Add `cos` field to `IOConfig`
struct
- `src/common/io-config/src/python.rs` - Python bindings for `CosConfig`
class with full API support
- `src/daft-io/src/lib.rs` - Route `cos://` and `cosn://` URL schemes to
`SourceType::OpenDAL { scheme: "cos" }`, with special handling to
extract the bucket from the URL and merge `CosConfig` into the OpenDAL
config
- `daft/io/__init__.py` - Export `CosConfig` to Python API

**Key Design Decisions:**
- **No dedicated CosSource** — instead of ~1,000 lines of custom I/O
code, COS reuses the generic `OpenDALSource` which already implements
`ObjectSource` (including multipart write support)
- **Lightweight CosConfig preserved** — provides a user-friendly Python
API with automatic region↔endpoint derivation and environment variable
scanning (`COS_*` / `TENCENTCLOUD_*`), rather than requiring raw
`opendal_backends` dicts

**Supported Features:**
- URL schemes: `cos://bucket/key` and `cosn://bucket/key` (Hadoop CosN
compatible)
- Environment variables: `COS_*` and `TENCENTCLOUD_*` prefixes for
configuration
- Operations: read, write (including multipart), list, delete, glob
pattern matching
- Authentication: permanent keys and STS temporary credentials
- Full Python API with `CosConfig` class

**Why OpenDAL:** Since Tencent COS doesn't have an official Rust SDK, we
use Apache OpenDAL which provides a unified data access layer supporting
70+ storage backends including Tencent COS. This PR builds on Eventual-Inc#6177
which already added the generic OpenDAL backend infrastructure.

### Example Usage

**Reading from COS with explicit credentials:**
```python
import daft
from daft.io import IOConfig, CosConfig

io_config = IOConfig(
    cos=CosConfig(
        region="ap-guangzhou",
        secret_id="your-secret-id",
        secret_key="your-secret-key",
    )
)

df = daft.read_parquet("cos://my-bucket/path/to/data.parquet", io_config=io_config)
df.show()
gavin9402 pushed a commit to gavin9402/Daft that referenced this pull request Apr 7, 2026
## Changes Made

Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).

**Python API:**
```python
io_config = IOConfig(
    protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
    s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```

### Implementation

- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.

### Design Decisions

- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`

## Related Issues

Builds on PR Eventual-Inc#6177 (OpenDAL support).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants