-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Add enhanced support for Unity Catalog #57954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g and type safety - Add dataclasses for structured responses (VolumeInfo, CredentialsResponse, AWSCredentials, AzureSASCredentials, GCPOAuthCredentials) - Add CloudProvider enum for type-safe cloud provider handling - Implement proper GCP temporary file handling with atexit cleanup - Add comprehensive error messages for volume access failures - Add extensive documentation with third-party API references - Ensure Python 3.8+ compatibility (use Tuple instead of tuple) - Remove ray.init() call (follows Ray Data architecture - initialization is external) - Add support for all Ray Data file formats (lance, iceberg, hudi, etc.) - Improve docstrings on all dataclasses and methods - Pass all ruff and black lint checks Supports Delta Lake features (deletion vectors, column mapping) via delegation to ray.data.read_delta(). Signed-off-by: soffer-anyscale <[email protected]>
- Fix volume path parsing for /Volumes/ prefix format * Correctly parse /Volumes/catalog/schema/volume/path into catalog.schema.volume * Add validation to ensure at least 3 components after /Volumes/ * Previously only took first component, causing API failures - Fix memory leak in GCP credential cleanup * Change cleanup method to static to avoid capturing self reference * Prevents UnityCatalogConnector instances from being garbage collected * Pass file path directly to atexit.register instead of using lambda Signed-off-by: soffer-anyscale <[email protected]>
The UnityCatalogConnector class was missing an API stability annotation, which is required by Ray's API policy checks. This class is used internally by the read_unity_catalog() function and shouldn't be directly instantiated by users. - Add @DeveloperAPI annotation to UnityCatalogConnector class - Import DeveloperAPI from ray.util.annotations - Fixes API policy check CI failure in documentation build Signed-off-by: soffer-anyscale <[email protected]>
- Add ColumnInfo.from_dict() method to safely parse API responses - Extracts only needed fields, ignoring extra fields like column_masks - Fixes TypeError when Unity Catalog API returns unexpected fields - Improve code modularity by extracting helper methods: - _create_auth_headers() for authorization header creation - _fetch_volume_metadata() for volume metadata fetching - _fetch_table_metadata() for table metadata fetching - _extract_storage_url() for storage URL construction - Add comprehensive documentation to all methods - Fix all ruff lint violations (trailing whitespace in docstrings) - All precommit checks pass Signed-off-by: soffer-anyscale <[email protected]>
- Refactored _get_creds() method into smaller focused functions - Added logging infrastructure following Ray Data patterns - Fixed custom datasource handling to use correct instantiation - Documented custom datasource constructor requirements - Applied ruff formatting and fixed all lint issues - Validated third-party integrations (PyArrow, Delta Lake, cloud SDKs) - Enhanced error handling with comprehensive messages - All pre-commit checks passing Signed-off-by: soffer-anyscale <[email protected]>
This is a simplified version of Unity Catalog support that focuses on the core use case: reading Delta Lake tables with credential vending. Changes from the full implementation: - Removed volume support (private preview feature) - Removed format detection/inference - Removed custom datasource support - Assumes all tables are Delta format - Simplified API: read_unity_catalog(table, url, token, ...) This makes the implementation much easier to review while providing the most valuable functionality: secure access to Unity Catalog Delta tables with automatic credential vending. The simplified implementation includes: - Unity Catalog table metadata fetching - Production-ready credential vending API - AWS, Azure, and GCP credential handling - Delta Lake reading with proper PyArrow filesystem configuration - Comprehensive error messages for common issues - Full logging support Total: ~600 lines vs ~1150 lines in full implementation Signed-off-by: soffer-anyscale <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a significant enhancement to the Unity Catalog integration by refactoring the connector into a more robust and maintainable implementation. The use of dataclasses to model API responses, detailed logging, and comprehensive error handling for common Delta Lake issues are excellent improvements. The code is clean and well-documented. I have a couple of suggestions to further improve robustness, particularly around handling required parameters and failing fast to provide clearer error messages to the user.
python/ray/data/_internal/datasource/unity_catalog_datasource.py
Outdated
Show resolved
Hide resolved
The hardcoded fallback to 'us-east-1' could lead to connection errors if the S3 bucket is in a different region and requires signature version 4. Changes: - Remove hardcoded 'us-east-1' fallback in S3FileSystem creation - Add explicit validation that region is provided for AWS - Provide clear error message with example usage - Update docstrings to clarify region is required for AWS - Improve error handling flow in _read_delta_with_credentials() This prevents hard-to-debug runtime failures and makes the API clearer for users working with AWS S3-backed Delta tables. Signed-off-by: soffer-anyscale <[email protected]>
Reduced from 637 lines to 193 lines by starting from master's uc_datasource.py and adding only essential changes: What's added (56 lines over master's 137 lines): - PyArrow S3FileSystem support for Delta with session tokens (AWS requirement) - Required region validation for AWS to prevent connection errors - Deletion vector error handling with helpful messages - GCP temp file cleanup with atexit to prevent file leaks Changes from previous version: - Renamed unity_catalog_datasource.py -> uc_datasource.py (matches master) - Removed all dataclasses, enums, and complex structures - Removed volume support (not needed for initial release) - Kept simple dict-based approach from master - Restored data_format parameter for flexibility - Minimal API surface matching master's design Net change: -459 lines (removed 674, added 215) File: 193 lines (vs 137 in master = +56 lines of essential fixes) Signed-off-by: soffer-anyscale <[email protected]>
Convert Markdown-style link to reStructuredText format to fix documentation build warning. The docstring used [text](url) syntax which is Markdown, but Sphinx expects RST syntax for links: `text <url>`_ Signed-off-by: soffer-anyscale <[email protected]>
## Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.
### What's Added
- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens
### Key Features
✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout
### Usage Example
```python
import ray
# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
table="main.sales.transactions",
url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
token="dapi...",
region="us-west-2" # Optional, for AWS
)
# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```
### Implementation Notes
This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code
The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.
### Testing
- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated
## Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
## Additional information
### Architecture
The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:
1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration
### Delta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.
### Cloud Provider Support
| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |
### Error Handling
Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps
### Future Enhancements
Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)
### Dependencies
- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure
### Documentation
- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance
---------
Signed-off-by: soffer-anyscale <[email protected]>
…58049) ## Description This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually. ### What's Added - **`ray.data.read_unity_catalog()`** - Updated public API for reading Unity Catalog Delta tables - **`UnityCatalogConnector`** - Handles Unity Catalog REST API integration and credential vending - **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage, and Google Cloud Storage - **Automatic credential management** - Obtains temporary, least-privilege credentials via Unity Catalog API - **Delta Lake integration** - Properly configures PyArrow filesystem for Delta tables with session tokens ### Key Features ✅ **Production-ready credential vending API** - Uses stable, public Unity Catalog APIs ✅ **Secure by default** - Temporary credentials with automatic cleanup ✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage) ✅ **Delta Lake optimized** - Handles session tokens and PyArrow filesystem configuration ✅ **Comprehensive error handling** - Helpful messages for common issues (deletion vectors, permissions, etc.) ✅ **Full logging support** - Debug and info logging throughout ### Usage Example ```python import ray # Read a Unity Catalog Delta table ds = ray.data.read_unity_catalog( table="main.sales.transactions", url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com", token="dapi...", region="us-west-2" # Optional, for AWS ) # Use standard Ray Data operations ds = ds.filter(lambda row: row["amount"] > 100) ds.show(5) ``` ### Implementation Notes This is a **simplified, focused implementation** that: - Supports **Unity Catalog tables only** (no volumes - that's in private preview) - Assumes **Delta Lake format** (most common Unity Catalog use case) - Uses **production-ready APIs** only (no private preview features) - Provides ~600 lines of clean, reviewable code The full implementation with volumes and multi-format support is available in the `data_uc_volumes` branch and can be added in a future PR once this foundation is reviewed. ### Testing - ✅ All ruff lint checks pass - ✅ Code formatted per Ray standards - ✅ Tested with real Unity Catalog Delta tables on AWS S3 - ✅ Proper PyArrow filesystem configuration verified - ✅ Credential vending flow validated ## Related issues Related to Unity Catalog and Delta Lake support requests in Ray Data. ## Additional information ### Architecture The implementation follows the **connector pattern** rather than a `Datasource` subclass because Unity Catalog is a metadata/credential layer, not a data format. The connector: 1. Fetches table metadata from Unity Catalog REST API 2. Obtains temporary credentials via credential vending API 3. Configures cloud-specific environment variables 4. Delegates to `ray.data.read_delta()` with proper filesystem configuration ### Delta Lake Special Handling Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the `deltalake` library. ### Cloud Provider Support | Provider | Credential Type | Implementation | |----------|----------------|----------------| | AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session token | | Azure Blob | SAS tokens | Environment variables (AZURE_STORAGE_SAS_TOKEN) | | GCP Cloud Storage | OAuth tokens / Service account | Environment variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) | ### Error Handling Comprehensive error messages for common issues: - **Deletion Vectors**: Guidance on upgrading deltalake library or disabling the feature - **Column Mapping**: Compatibility information and solutions - **Permissions**: Clear list of required Unity Catalog permissions - **Credential issues**: Detailed troubleshooting steps ### Future Enhancements Potential follow-up PRs: - Unity Catalog volumes support (when out of private preview) - Multi-format support (Parquet, CSV, JSON, images, etc.) - Custom datasource integration - Advanced Delta Lake features (time travel, partition filters) ### Dependencies - Requires `deltalake` package for Delta Lake support - Uses standard Ray Data APIs (`read_delta`, `read_datasource`) - Integrates with existing PyArrow filesystem infrastructure ### Documentation - Full docstrings with examples - Type hints throughout - Inline comments with references to external documentation - Comprehensive error messages with actionable guidance --------- > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: soffer-anyscale <[email protected]> Co-authored-by: soffer-anyscale <[email protected]>
## Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.
### What's Added
- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens
### Key Features
✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout
### Usage Example
```python
import ray
# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
table="main.sales.transactions",
url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
token="dapi...",
region="us-west-2" # Optional, for AWS
)
# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```
### Implementation Notes
This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code
The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.
### Testing
- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated
## Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
## Additional information
### Architecture
The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:
1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration
### Delta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.
### Cloud Provider Support
| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |
### Error Handling
Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps
### Future Enhancements
Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)
### Dependencies
- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure
### Documentation
- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance
---------
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: xgui <[email protected]>
## Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.
### What's Added
- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens
### Key Features
✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout
### Usage Example
```python
import ray
# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
table="main.sales.transactions",
url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
token="dapi...",
region="us-west-2" # Optional, for AWS
)
# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```
### Implementation Notes
This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code
The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.
### Testing
- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated
## Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
## Additional information
### Architecture
The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:
1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration
### Delta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.
### Cloud Provider Support
| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |
### Error Handling
Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps
### Future Enhancements
Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)
### Dependencies
- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure
### Documentation
- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance
---------
Signed-off-by: soffer-anyscale <[email protected]>
## Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.
### What's Added
- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens
### Key Features
✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout
### Usage Example
```python
import ray
# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
table="main.sales.transactions",
url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
token="dapi...",
region="us-west-2" # Optional, for AWS
)
# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```
### Implementation Notes
This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code
The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.
### Testing
- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated
## Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
## Additional information
### Architecture
The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:
1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration
### Delta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.
### Cloud Provider Support
| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |
### Error Handling
Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps
### Future Enhancements
Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)
### Dependencies
- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure
### Documentation
- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance
---------
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually.
What's Added
ray.data.read_unity_catalog()- Updated public API for reading Unity Catalog Delta tablesUnityCatalogConnector- Handles Unity Catalog REST API integration and credential vendingKey Features
✅ Production-ready credential vending API - Uses stable, public Unity Catalog APIs
✅ Secure by default - Temporary credentials with automatic cleanup
✅ Multi-cloud - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage)
✅ Delta Lake optimized - Handles session tokens and PyArrow filesystem configuration
✅ Comprehensive error handling - Helpful messages for common issues (deletion vectors, permissions, etc.)
✅ Full logging support - Debug and info logging throughout
Usage Example
Implementation Notes
This is a simplified, focused implementation that:
The full implementation with volumes and multi-format support is available in the
data_uc_volumesbranch and can be added in a future PR once this foundation is reviewed.Testing
Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
Additional information
Architecture
The implementation follows the connector pattern rather than a
Datasourcesubclass because Unity Catalog is a metadata/credential layer, not a data format. The connector:ray.data.read_delta()with proper filesystem configurationDelta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the
deltalakelibrary.Cloud Provider Support
Error Handling
Comprehensive error messages for common issues:
Future Enhancements
Potential follow-up PRs:
Dependencies
deltalakepackage for Delta Lake supportread_delta,read_datasource)Documentation