Skip to content

Conversation

@soffer-anyscale
Copy link
Contributor

Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually.

What's Added

  • ray.data.read_unity_catalog() - Updated public API for reading Unity Catalog Delta tables
  • UnityCatalogConnector - Handles Unity Catalog REST API integration and credential vending
  • Multi-cloud support - Works with AWS S3, Azure Data Lake Storage, and Google Cloud Storage
  • Automatic credential management - Obtains temporary, least-privilege credentials via Unity Catalog API
  • Delta Lake integration - Properly configures PyArrow filesystem for Delta tables with session tokens

Key Features

Production-ready credential vending API - Uses stable, public Unity Catalog APIs
Secure by default - Temporary credentials with automatic cleanup
Multi-cloud - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage)
Delta Lake optimized - Handles session tokens and PyArrow filesystem configuration
Comprehensive error handling - Helpful messages for common issues (deletion vectors, permissions, etc.)
Full logging support - Debug and info logging throughout

Usage Example

import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)

Implementation Notes

This is a simplified, focused implementation that:

  • Supports Unity Catalog tables only (no volumes - that's in private preview)
  • Assumes Delta Lake format (most common Unity Catalog use case)
  • Uses production-ready APIs only (no private preview features)
  • Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is available in the data_uc_volumes branch and can be added in a future PR once this foundation is reviewed.

Testing

  • ✅ All ruff lint checks pass
  • ✅ Code formatted per Ray standards
  • ✅ Tested with real Unity Catalog Delta tables on AWS S3
  • ✅ Proper PyArrow filesystem configuration verified
  • ✅ Credential vending flow validated

Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

Additional information

Architecture

The implementation follows the connector pattern rather than a Datasource subclass because Unity Catalog is a metadata/credential layer, not a data format. The connector:

  1. Fetches table metadata from Unity Catalog REST API
  2. Obtains temporary credentials via credential vending API
  3. Configures cloud-specific environment variables
  4. Delegates to ray.data.read_delta() with proper filesystem configuration

Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the deltalake library.

Cloud Provider Support

Provider Credential Type Implementation
AWS S3 Temporary IAM credentials PyArrow S3FileSystem with session token
Azure Blob SAS tokens Environment variables (AZURE_STORAGE_SAS_TOKEN)
GCP Cloud Storage OAuth tokens / Service account Environment variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS)

Error Handling

Comprehensive error messages for common issues:

  • Deletion Vectors: Guidance on upgrading deltalake library or disabling the feature
  • Column Mapping: Compatibility information and solutions
  • Permissions: Clear list of required Unity Catalog permissions
  • Credential issues: Detailed troubleshooting steps

Future Enhancements

Potential follow-up PRs:

  • Unity Catalog volumes support (when out of private preview)
  • Multi-format support (Parquet, CSV, JSON, images, etc.)
  • Custom datasource integration
  • Advanced Delta Lake features (time travel, partition filters)

Dependencies

  • Requires deltalake package for Delta Lake support
  • Uses standard Ray Data APIs (read_delta, read_datasource)
  • Integrates with existing PyArrow filesystem infrastructure

Documentation

  • Full docstrings with examples
  • Type hints throughout
  • Inline comments with references to external documentation
  • Comprehensive error messages with actionable guidance

…g and type safety

- Add dataclasses for structured responses (VolumeInfo, CredentialsResponse, AWSCredentials, AzureSASCredentials, GCPOAuthCredentials)
- Add CloudProvider enum for type-safe cloud provider handling
- Implement proper GCP temporary file handling with atexit cleanup
- Add comprehensive error messages for volume access failures
- Add extensive documentation with third-party API references
- Ensure Python 3.8+ compatibility (use Tuple instead of tuple)
- Remove ray.init() call (follows Ray Data architecture - initialization is external)
- Add support for all Ray Data file formats (lance, iceberg, hudi, etc.)
- Improve docstrings on all dataclasses and methods
- Pass all ruff and black lint checks

Supports Delta Lake features (deletion vectors, column mapping) via delegation to ray.data.read_delta().

Signed-off-by: soffer-anyscale <[email protected]>
- Fix volume path parsing for /Volumes/ prefix format
  * Correctly parse /Volumes/catalog/schema/volume/path into catalog.schema.volume
  * Add validation to ensure at least 3 components after /Volumes/
  * Previously only took first component, causing API failures

- Fix memory leak in GCP credential cleanup
  * Change cleanup method to static to avoid capturing self reference
  * Prevents UnityCatalogConnector instances from being garbage collected
  * Pass file path directly to atexit.register instead of using lambda

Signed-off-by: soffer-anyscale <[email protected]>
The UnityCatalogConnector class was missing an API stability annotation,
which is required by Ray's API policy checks. This class is used internally
by the read_unity_catalog() function and shouldn't be directly instantiated
by users.

- Add @DeveloperAPI annotation to UnityCatalogConnector class
- Import DeveloperAPI from ray.util.annotations
- Fixes API policy check CI failure in documentation build

Signed-off-by: soffer-anyscale <[email protected]>
- Add ColumnInfo.from_dict() method to safely parse API responses
- Extracts only needed fields, ignoring extra fields like column_masks
- Fixes TypeError when Unity Catalog API returns unexpected fields
- Improve code modularity by extracting helper methods:
  - _create_auth_headers() for authorization header creation
  - _fetch_volume_metadata() for volume metadata fetching
  - _fetch_table_metadata() for table metadata fetching
  - _extract_storage_url() for storage URL construction
- Add comprehensive documentation to all methods
- Fix all ruff lint violations (trailing whitespace in docstrings)
- All precommit checks pass

Signed-off-by: soffer-anyscale <[email protected]>
- Refactored _get_creds() method into smaller focused functions
- Added logging infrastructure following Ray Data patterns
- Fixed custom datasource handling to use correct instantiation
- Documented custom datasource constructor requirements
- Applied ruff formatting and fixed all lint issues
- Validated third-party integrations (PyArrow, Delta Lake, cloud SDKs)
- Enhanced error handling with comprehensive messages
- All pre-commit checks passing

Signed-off-by: soffer-anyscale <[email protected]>
This is a simplified version of Unity Catalog support that focuses on the core
use case: reading Delta Lake tables with credential vending.

Changes from the full implementation:
- Removed volume support (private preview feature)
- Removed format detection/inference
- Removed custom datasource support
- Assumes all tables are Delta format
- Simplified API: read_unity_catalog(table, url, token, ...)

This makes the implementation much easier to review while providing the most
valuable functionality: secure access to Unity Catalog Delta tables with
automatic credential vending.

The simplified implementation includes:
- Unity Catalog table metadata fetching
- Production-ready credential vending API
- AWS, Azure, and GCP credential handling
- Delta Lake reading with proper PyArrow filesystem configuration
- Comprehensive error messages for common issues
- Full logging support

Total: ~600 lines vs ~1150 lines in full implementation
Signed-off-by: soffer-anyscale <[email protected]>
@soffer-anyscale soffer-anyscale requested a review from a team as a code owner October 21, 2025 17:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a significant enhancement to the Unity Catalog integration by refactoring the connector into a more robust and maintainable implementation. The use of dataclasses to model API responses, detailed logging, and comprehensive error handling for common Delta Lake issues are excellent improvements. The code is clean and well-documented. I have a couple of suggestions to further improve robustness, particularly around handling required parameters and failing fast to provide clearer error messages to the user.

The hardcoded fallback to 'us-east-1' could lead to connection errors if the
S3 bucket is in a different region and requires signature version 4.

Changes:
- Remove hardcoded 'us-east-1' fallback in S3FileSystem creation
- Add explicit validation that region is provided for AWS
- Provide clear error message with example usage
- Update docstrings to clarify region is required for AWS
- Improve error handling flow in _read_delta_with_credentials()

This prevents hard-to-debug runtime failures and makes the API clearer for
users working with AWS S3-backed Delta tables.

Signed-off-by: soffer-anyscale <[email protected]>
Reduced from 637 lines to 193 lines by starting from master's uc_datasource.py
and adding only essential changes:

What's added (56 lines over master's 137 lines):
- PyArrow S3FileSystem support for Delta with session tokens (AWS requirement)
- Required region validation for AWS to prevent connection errors
- Deletion vector error handling with helpful messages
- GCP temp file cleanup with atexit to prevent file leaks

Changes from previous version:
- Renamed unity_catalog_datasource.py -> uc_datasource.py (matches master)
- Removed all dataclasses, enums, and complex structures
- Removed volume support (not needed for initial release)
- Kept simple dict-based approach from master
- Restored data_format parameter for flexibility
- Minimal API surface matching master's design

Net change: -459 lines (removed 674, added 215)
File: 193 lines (vs 137 in master = +56 lines of essential fixes)

Signed-off-by: soffer-anyscale <[email protected]>
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Oct 21, 2025
Convert Markdown-style link to reStructuredText format to fix documentation build warning. The docstring used [text](url) syntax which is Markdown, but Sphinx expects RST syntax for links: `text <url>`_

Signed-off-by: soffer-anyscale <[email protected]>
@soffer-anyscale soffer-anyscale added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
@bveeramani bveeramani merged commit 16ef6a0 into master Oct 23, 2025
7 checks passed
@bveeramani bveeramani deleted the data_uc_tables_delta_only branch October 23, 2025 17:08
gvspraveen pushed a commit that referenced this pull request Oct 23, 2025
## Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.

### What's Added

- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens

### Key Features

✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup  
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout

### Usage Example

```python
import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```

### Implementation Notes

This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.

### Testing

- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated

## Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

## Additional information

### Architecture

The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:

1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration

### Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.

### Cloud Provider Support

| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |

### Error Handling

Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps

### Future Enhancements

Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)

### Dependencies

- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure

### Documentation

- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance

---------

Signed-off-by: soffer-anyscale <[email protected]>
aslonnie pushed a commit that referenced this pull request Oct 23, 2025
…58049)

## Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.

### What's Added

- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens

### Key Features

✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup ✅
**Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout

### Usage Example

```python
import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```

### Implementation Notes

This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.

### Testing

- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated

## Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

## Additional information

### Architecture

The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:

1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration

### Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.

### Cloud Provider Support

| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |

### Error Handling

Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps

### Future Enhancements

Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)

### Dependencies

- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure

### Documentation

- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance

---------

> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: soffer-anyscale <[email protected]>
Co-authored-by: soffer-anyscale <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
## Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.

### What's Added

- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens

### Key Features

✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout

### Usage Example

```python
import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```

### Implementation Notes

This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.

### Testing

- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated

## Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

## Additional information

### Architecture

The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:

1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration

### Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.

### Cloud Provider Support

| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |

### Error Handling

Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps

### Future Enhancements

Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)

### Dependencies

- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure

### Documentation

- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance

---------

Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.

### What's Added

- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens

### Key Features

✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup  
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout

### Usage Example

```python
import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```

### Implementation Notes

This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.

### Testing

- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated

## Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

## Additional information

### Architecture

The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:

1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration

### Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.

### Cloud Provider Support

| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |

### Error Handling

Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps

### Future Enhancements

Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)

### Dependencies

- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure

### Documentation

- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance

---------

Signed-off-by: soffer-anyscale <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description

This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.

### What's Added

- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens

### Key Features

✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup
✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout

### Usage Example

```python
import ray

# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
    table="main.sales.transactions",
    url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
    token="dapi...",
    region="us-west-2"  # Optional, for AWS
)

# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```

### Implementation Notes

This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code

The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.

### Testing

- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated

## Related issues

Related to Unity Catalog and Delta Lake support requests in Ray Data.

## Additional information

### Architecture

The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:

1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration

### Delta Lake Special Handling

Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.

### Cloud Provider Support

| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |

### Error Handling

Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps

### Future Enhancements

Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)

### Dependencies

- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure

### Documentation

- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance

---------

Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants