feat: Add protocol aliases for IOConfig#6252
Conversation
Allow user-defined mappings from custom scheme names to existing schemes (e.g., "my-s3" -> "s3") so organizations can use domain-specific protocol names that route to any backend including native S3, Azure, GCS, and OpenDAL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds protocol aliases to However, there is one critical integration bug that prevents the feature from working:
Additionally, there are two minor concerns:
Confidence Score: 2/5
Sequence DiagramsequenceDiagram
participant User
participant IOClient
participant resolve_url_alias
participant parse_url
participant ObjectSource
User->>IOClient: single_url_get("my-s3://bucket/file.parquet")
IOClient->>resolve_url_alias: resolve("my-s3://...", config)
resolve_url_alias-->>IOClient: "s3://bucket/file.parquet" (Cow::Owned)
IOClient->>parse_url: parse("s3://bucket/file.parquet")
parse_url-->>IOClient: (SourceType::S3, path)
IOClient->>ObjectSource: get(path, ...)
ObjectSource-->>User: GetResult
User->>IOClient: glob("my-s3://bucket/*.parquet") [BUGGY PATH]
IOClient->>ObjectSource: get_source("my-s3://...") → resolves alias internally → S3Source
IOClient->>ObjectSource: source.glob("my-s3://bucket/*.parquet") ← unresolved URL!
ObjectSource-->>IOClient: Error: unknown scheme "my-s3"
|
Additional Comments (1)
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6252 +/- ##
==========================================
+ Coverage 73.44% 75.59% +2.15%
==========================================
Files 1001 1023 +22
Lines 133163 150140 +16977
==========================================
+ Hits 97798 113500 +15702
- Misses 35365 36640 +1275
🚀 New features to boost your workflow:
|
|
Oops sorry! Closed when I meant to merge |
| @@ -226,7 +225,8 @@ impl IOConfig { | |||
| tos=None, | |||
| gravitino=None, | |||
| cos=None, | |||
| opendal_backends=None | |||
| opendal_backends=None, | |||
| protocol_aliases=None | |||
| ))] | |||
| #[allow(clippy::too_many_arguments)] | |||
There was a problem hiding this comment.
Duplicate #[allow(clippy::too_many_arguments)] suppression
The #[allow(clippy::too_many_arguments)] attribute appears twice for the new method — once at line 216 before the #[pyo3(signature = (...))] attribute and again at line 231 immediately before the fn declaration. The same duplication occurs on the replace method (lines 273 and 288). Per project rules, clippy warnings should be fixed rather than suppressed, but at minimum the redundant duplicate suppressions should be removed.
Context Used: Rule from dashboard - Fix clippy warnings instead of suppressing them with allow attributes in Rust code. (source)
| pub fn validate_protocol_aliases(&self) -> std::result::Result<(), String> { | ||
| const BUILTIN_SCHEMES: &[&str] = &[ | ||
| "file", "http", "https", "s3", "s3a", "s3n", "az", "abfs", "abfss", "gcs", "gs", "hf", | ||
| "tos", "cos", "cosn", "vol+dbfs", "dbfs", "gvfs", | ||
| ]; | ||
| for key in self.protocol_aliases.keys() { | ||
| if BUILTIN_SCHEMES.contains(&key.as_str()) { | ||
| return Err(format!( | ||
| "Protocol alias key '{key}' conflicts with built-in scheme. \ | ||
| Aliases can only map new custom scheme names to existing schemes." | ||
| )); | ||
| } | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Alias target values are not validated against known schemes
validate_protocol_aliases only verifies that alias keys don't shadow built-in schemes. It does not validate that alias values (targets) actually refer to a known scheme or a registered OpenDAL backend. A config like {"my-proto": "typo-schme"} will be accepted at construction time and only fail at runtime when a URL is first resolved. Consider validating that alias targets are either built-in schemes or present in opendal_backends, so users catch configuration mistakes early.
## Changes Made
Adds **protocol aliases** to `IOConfig`: user-defined mappings from
custom scheme names to existing schemes. For example, `"my-s3" -> "s3"`
lets organizations use domain-specific protocol names that route to
standard backends (including native S3, Azure, GCS — not just OpenDAL).
**Python API:**
```python
io_config = IOConfig(
protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)
```
### Implementation
- **`src/common/io-config/src/config.rs`** — Added `protocol_aliases:
BTreeMap<String, String>` field to `IOConfig`, display support, and
`validate_protocol_aliases()` that rejects alias keys matching built-in
schemes.
- **`src/daft-io/src/lib.rs`** — Added `resolve_url_alias()` using `Cow`
for zero-allocation on the common (no-alias) path. Integrated into
`get_source_and_path()`, `single_url_get()`, `single_url_put()`, and
`single_url_get_size()`. Added 7 Rust unit tests.
- **`src/common/io-config/src/python.rs`** — Added `protocol_aliases`
parameter to `IOConfig::new()` and `replace()` with case normalization
and validation. Added getter.
- **`daft/daft/__init__.pyi`** — Updated type stubs.
- **`tests/io/test_protocol_aliases.py`** — 9 config tests + 2
integration tests using OpenDAL `fs` backend.
### Design Decisions
- **Single-level resolution** — no chaining, avoids infinite loops
- **Built-in scheme protection** — aliasing `s3`, `gcs`, etc. as keys is
rejected at construction time
- **Case-insensitive** — consistent with `parse_url()` which already
lowercases schemes
- **Minimal change surface** — `parse_url()` and its 17+ external
callers remain untouched; alias resolution happens in `IOClient` methods
before calling `parse_url()`
## Related Issues
Builds on PR Eventual-Inc#6177 (OpenDAL support).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Changes Made
Adds protocol aliases to
IOConfig: user-defined mappings from custom scheme names to existing schemes. For example,"my-s3" -> "s3"lets organizations use domain-specific protocol names that route to standard backends (including native S3, Azure, GCS — not just OpenDAL).Python API:
Implementation
src/common/io-config/src/config.rs— Addedprotocol_aliases: BTreeMap<String, String>field toIOConfig, display support, andvalidate_protocol_aliases()that rejects alias keys matching built-in schemes.src/daft-io/src/lib.rs— Addedresolve_url_alias()usingCowfor zero-allocation on the common (no-alias) path. Integrated intoget_source_and_path(),single_url_get(),single_url_put(), andsingle_url_get_size(). Added 7 Rust unit tests.src/common/io-config/src/python.rs— Addedprotocol_aliasesparameter toIOConfig::new()andreplace()with case normalization and validation. Added getter.daft/daft/__init__.pyi— Updated type stubs.tests/io/test_protocol_aliases.py— 9 config tests + 2 integration tests using OpenDALfsbackend.Design Decisions
s3,gcs, etc. as keys is rejected at construction timeparse_url()which already lowercases schemesparse_url()and its 17+ external callers remain untouched; alias resolution happens inIOClientmethods before callingparse_url()Related Issues
Builds on PR #6177 (OpenDAL support).
🤖 Generated with Claude Code