Skip to content

feat: Add protocol aliases for IOConfig#6252

Merged
desmondcheongzx merged 1 commit intomainfrom
universalmind303/protocol-aliases
Mar 6, 2026
Merged

feat: Add protocol aliases for IOConfig#6252
desmondcheongzx merged 1 commit intomainfrom
universalmind303/protocol-aliases

Conversation

@universalmind303
Copy link
Copy Markdown
Member

Changes Made

Adds protocol aliases to IOConfig: user-defined mappings from custom scheme names to existing schemes. For example, "my-s3" -> "s3" lets organizations use domain-specific protocol names that route to standard backends (including native S3, Azure, GCS — not just OpenDAL).

Python API:

io_config = IOConfig(
    protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
    s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)

Implementation

  • src/common/io-config/src/config.rs — Added protocol_aliases: BTreeMap<String, String> field to IOConfig, display support, and validate_protocol_aliases() that rejects alias keys matching built-in schemes.
  • src/daft-io/src/lib.rs — Added resolve_url_alias() using Cow for zero-allocation on the common (no-alias) path. Integrated into get_source_and_path(), single_url_get(), single_url_put(), and single_url_get_size(). Added 7 Rust unit tests.
  • src/common/io-config/src/python.rs — Added protocol_aliases parameter to IOConfig::new() and replace() with case normalization and validation. Added getter.
  • daft/daft/__init__.pyi — Updated type stubs.
  • tests/io/test_protocol_aliases.py — 9 config tests + 2 integration tests using OpenDAL fs backend.

Design Decisions

  • Single-level resolution — no chaining, avoids infinite loops
  • Built-in scheme protection — aliasing s3, gcs, etc. as keys is rejected at construction time
  • Case-insensitive — consistent with parse_url() which already lowercases schemes
  • Minimal change surfaceparse_url() and its 17+ external callers remain untouched; alias resolution happens in IOClient methods before calling parse_url()

Related Issues

Builds on PR #6177 (OpenDAL support).

🤖 Generated with Claude Code

Allow user-defined mappings from custom scheme names to existing schemes
(e.g., "my-s3" -> "s3") so organizations can use domain-specific protocol
names that route to any backend including native S3, Azure, GCS, and OpenDAL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@universalmind303 universalmind303 requested a review from a team as a code owner February 19, 2026 22:14
@github-actions github-actions Bot added the feat label Feb 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 19, 2026

Greptile Summary

This PR adds protocol aliases to IOConfig, allowing users to define custom scheme names (e.g., "my-s3") that transparently rewrite to existing schemes (e.g., "s3") before IO operations are dispatched. The implementation includes case normalization, built-in scheme protection, single-level resolution, and a zero-allocation fast path via Cow::Borrowed, with good Python API parity.

However, there is one critical integration bug that prevents the feature from working:

  • IOClient::glob skips alias resolution for the URL passed to source.globget_source correctly resolves the alias internally (via get_source_and_path), so the right source object is selected. However, the raw, unresolved URL is then forwarded to source.glob(input.as_str(), ...). The underlying source implementation parses the URL to extract the scheme and path; receiving the custom aliased scheme instead of the canonical one causes the glob to fail. This affects all glob-based reads (e.g., read_parquet("myfs://*.parquet", ...)) and the test_alias_to_fs_write_and_read integration test exercises exactly this code path.

Additionally, there are two minor concerns:

  • validate_protocol_aliases checks alias keys against built-in schemes but does not validate that alias targets are known schemes or registered OpenDAL backends, so misconfigured aliases fail silently at runtime.
  • A duplicate #[allow(clippy::too_many_arguments)] attribute is present on both new and replace in python.rs.

Confidence Score: 2/5

  • Not safe to merge — critical bug in glob function breaks all glob-based reads with aliased schemes.
  • The feature is well-designed with good case normalization, built-in scheme protection, and zero-allocation fast paths. However, the IOClient::glob function has a critical integration bug: it passes the unresolved (aliased) URL to source.glob() instead of the resolved URL. This causes any glob-based read operation (e.g., read_parquet("myfs://*.parquet", ...)) to fail at runtime. The integration test test_alias_to_fs_write_and_read (which uses a glob pattern) will likely fail when run. Additionally, alias target validation at construction time would improve error messages, and there is a redundant clippy suppression attribute.
  • src/daft-io/src/lib.rs (glob function line 351–354), tests/io/test_protocol_aliases.py (integration test that exercises the buggy path)

Sequence Diagram

sequenceDiagram
    participant User
    participant IOClient
    participant resolve_url_alias
    participant parse_url
    participant ObjectSource

    User->>IOClient: single_url_get("my-s3://bucket/file.parquet")
    IOClient->>resolve_url_alias: resolve("my-s3://...", config)
    resolve_url_alias-->>IOClient: "s3://bucket/file.parquet" (Cow::Owned)
    IOClient->>parse_url: parse("s3://bucket/file.parquet")
    parse_url-->>IOClient: (SourceType::S3, path)
    IOClient->>ObjectSource: get(path, ...)
    ObjectSource-->>User: GetResult

    User->>IOClient: glob("my-s3://bucket/*.parquet")  [BUGGY PATH]
    IOClient->>ObjectSource: get_source("my-s3://...") → resolves alias internally → S3Source
    IOClient->>ObjectSource: source.glob("my-s3://bucket/*.parquet")  ← unresolved URL!
    ObjectSource-->>IOClient: Error: unknown scheme "my-s3"
Loading

Comments Outside Diff (1)

  1. src/daft-io/src/lib.rs, line 351-354 (link)

    glob passes unresolved URL to source

    The glob function resolves the alias when calling self.get_source(&input) (because get_source delegates to get_source_and_path, which applies resolve_url_alias). However, the original — unresolved — URL is then forwarded directly to source.glob(input.as_str(), ...). This means the underlying source (e.g., OpenDALSource for scheme fs) receives a URL like myfs://localhost/out/*.parquet instead of fs://localhost/out/*.parquet. When the source internally parses that URL to determine the scheme and path, it will see the aliased scheme myfs, which has no registered backend, causing the glob to fail.

    The integration test test_alias_to_fs_write_and_read exercises exactly this path via daft.read_parquet("myfs://localhost/out/*.parquet", ...) and is likely to surface this at test time.

    The fix is to resolve the alias before both calls:

    pub async fn glob(
        &self,
        input: String,
        fanout_limit: Option<usize>,
        page_size: Option<i32>,
        limit: Option<usize>,
        io_stats: Option<Arc<IOStatsContext>>,
        file_format: Option<FileFormat>,
    ) -> Result<BoxStream<'static, Result<FileMetadata>>> {
        let resolved = resolve_url_alias(&input, &self.config);
        let source = self.get_source(&resolved).await?;
        let files = source
            .glob(
                resolved.as_ref(),
                fanout_limit,
                page_size,
                limit,
                io_stats,
                file_format,
            )
            .await?;
        Ok(files)
    }

Last reviewed commit: 287272c

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 19, 2026

Additional Comments (1)

src/daft-io/src/python.rs
parse_url is called on the unresolved input, but get_source (line 42) resolves the alias internally. This causes the path to contain the original scheme instead of the resolved one.

            let resolved = crate::resolve_url_alias(&input, &io_config.unwrap_or_default().config);
            let (_, path) = parse_url(&resolved)?;
            let runtime_handle = get_io_runtime(multithreaded_io);

            runtime_handle.block_on_current_thread(async {
                let source = io_client.get_source(&resolved).await?;

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 99.41176% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 75.59%. Comparing base (ffe0024) to head (287272c).
⚠️ Report is 80 commits behind head on main.

Files with missing lines Patch % Lines
src/common/io-config/src/config.rs 92.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6252      +/-   ##
==========================================
+ Coverage   73.44%   75.59%   +2.15%     
==========================================
  Files        1001     1023      +22     
  Lines      133163   150140   +16977     
==========================================
+ Hits        97798   113500   +15702     
- Misses      35365    36640    +1275     
Files with missing lines Coverage Δ
src/common/io-config/src/python.rs 60.31% <100.00%> (+6.65%) ⬆️
src/daft-io/src/lib.rs 81.23% <100.00%> (+6.84%) ⬆️
src/common/io-config/src/config.rs 96.82% <92.30%> (-1.18%) ⬇️

... and 239 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@desmondcheongzx
Copy link
Copy Markdown
Collaborator

Oops sorry! Closed when I meant to merge

@desmondcheongzx desmondcheongzx enabled auto-merge (squash) March 6, 2026 01:06
Comment on lines 216 to 231
@@ -226,7 +225,8 @@ impl IOConfig {
tos=None,
gravitino=None,
cos=None,
opendal_backends=None
opendal_backends=None,
protocol_aliases=None
))]
#[allow(clippy::too_many_arguments)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate #[allow(clippy::too_many_arguments)] suppression

The #[allow(clippy::too_many_arguments)] attribute appears twice for the new method — once at line 216 before the #[pyo3(signature = (...))] attribute and again at line 231 immediately before the fn declaration. The same duplication occurs on the replace method (lines 273 and 288). Per project rules, clippy warnings should be fixed rather than suppressed, but at minimum the redundant duplicate suppressions should be removed.

Context Used: Rule from dashboard - Fix clippy warnings instead of suppressing them with allow attributes in Rust code. (source)

Comment on lines +87 to +101
pub fn validate_protocol_aliases(&self) -> std::result::Result<(), String> {
const BUILTIN_SCHEMES: &[&str] = &[
"file", "http", "https", "s3", "s3a", "s3n", "az", "abfs", "abfss", "gcs", "gs", "hf",
"tos", "cos", "cosn", "vol+dbfs", "dbfs", "gvfs",
];
for key in self.protocol_aliases.keys() {
if BUILTIN_SCHEMES.contains(&key.as_str()) {
return Err(format!(
"Protocol alias key '{key}' conflicts with built-in scheme. \
Aliases can only map new custom scheme names to existing schemes."
));
}
}
Ok(())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alias target values are not validated against known schemes

validate_protocol_aliases only verifies that alias keys don't shadow built-in schemes. It does not validate that alias values (targets) actually refer to a known scheme or a registered OpenDAL backend. A config like {"my-proto": "typo-schme"} will be accepted at construction time and only fail at runtime when a URL is first resolved. Consider validating that alias targets are either built-in schemes or present in opendal_backends, so users catch configuration mistakes early.

@desmondcheongzx desmondcheongzx merged commit 80d670f into main Mar 6, 2026
88 of 90 checks passed
@desmondcheongzx desmondcheongzx deleted the universalmind303/protocol-aliases branch March 6, 2026 02:04
gavin9402 pushed a commit to gavin9402/Daft that referenced this pull request Apr 7, 2026
## Changes Made

Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).

**Python API:**
```python
io_config = IOConfig(
    protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
    s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```

### Implementation

- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.

### Design Decisions

- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`

## Related Issues

Builds on PR Eventual-Inc#6177 (OpenDAL support).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants