feat(ibis): introduce Databricks connector#1361
Conversation
|
Warning Rate limit exceeded@goldmedal has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 13 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds Databricks support across Python models, data source routing, metadata provider, tests, tooling, and Rust manifest mapping to enable Databricks connections, metadata extraction, query routing, and function description updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Router/API
participant DataSrcExt as DataSourceExtension
participant Ibis as ibis Databricks Backend
participant Meta as DatabricksMetadata
participant DB as Databricks SQL
Client->>API: POST /v2/connector/databricks/metadata/tables\n(connectionInfo)
API->>DataSrcExt: get_connection(DataSource.databricks, info)
DataSrcExt->>Ibis: get_databricks_connection(info)
Ibis-->>DataSrcExt: Backend
API->>Meta: DatabricksMetadata(connection_info)
API->>Meta: get_table_list()
Meta->>DB: SELECT ... FROM INFORMATION_SCHEMA.TABLES/COLUMNS
DB-->>Meta: rows
Meta->>Meta: _transform_column_type(...) per column
Meta-->>API: List[Table]
API-->>Client: 200 OK with metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
wren-core/core/src/logical_plan/utils.rs (1)
185-185: LGTM – Defensive type mapping follows established pattern.The mapping of "any" to
DataType::Utf8is consistent with how other unsupported or dynamic types (json, xml, uuid, unknown) are handled in this codebase.Consider adding a test case for the "any" type mapping in
test_map_data_type()(around line 430) for completeness, similar to the existing test cases for other type mappings.Additionally, you might want to verify whether the "any" type is actually encountered in Databricks metadata discovery, or if this is a preemptive/hypothetical addition. If it's not actively used, you could defer this mapping until it's needed.
// Add to test_map_data_type test cases around line 472: ("any", DataType::Utf8),ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
12-16: Consider removing redundant marker application.The
pytest_collection_modifyitemshook adds the databricks marker to tests, but this is already accomplished by the module-levelpytestmark = pytest.mark.databrickson line 7. The hook appears unnecessary unless you intend to mark tests in subdirectories.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
ibis-server/poetry.lockis excluded by!**/*.lockibis-server/resources/function_list/databricks.csvis excluded by!**/*.csv
📒 Files selected for processing (13)
ibis-server/app/model/__init__.py(3 hunks)ibis-server/app/model/data_source.py(6 hunks)ibis-server/app/model/metadata/databricks.py(1 hunks)ibis-server/app/model/metadata/factory.py(2 hunks)ibis-server/pyproject.toml(3 hunks)ibis-server/tests/routers/v3/connector/databricks/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/databricks/test_function.py(1 hunks)ibis-server/tests/routers/v3/connector/databricks/test_metadata.py(1 hunks)ibis-server/tests/routers/v3/connector/databricks/test_query.py(1 hunks)ibis-server/tools/update_databricks_functions.py(1 hunks)wren-core-base/manifest-macro/src/lib.rs(1 hunks)wren-core-base/src/mdl/manifest.rs(1 hunks)wren-core/core/src/logical_plan/utils.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-29T05:49:45.513Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1301
File: ibis-server/pyproject.toml:83-83
Timestamp: 2025-08-29T05:49:45.513Z
Learning: Polars dependency in ibis-server/pyproject.toml is correctly placed in dev dependencies because it's only used by helper tools in the tools/data_source/ directory, not in the main production application runtime.
Applied to files:
ibis-server/pyproject.toml
🧬 Code graph analysis (8)
wren-core-base/src/mdl/manifest.rs (1)
ibis-server/app/model/data_source.py (1)
DataSource(60-214)
ibis-server/app/model/metadata/factory.py (2)
ibis-server/app/model/metadata/databricks.py (1)
DatabricksMetadata(35-181)ibis-server/app/model/data_source.py (1)
DataSource(60-214)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
ibis-server/tools/update_databricks_functions.py (1)
connect(43-66)
ibis-server/app/model/data_source.py (2)
ibis-server/app/model/__init__.py (2)
DatabricksConnectionInfo(367-382)QueryDatabricksDTO(70-71)ibis-server/tools/update_databricks_functions.py (1)
connect(43-66)
ibis-server/app/model/metadata/databricks.py (4)
ibis-server/app/model/__init__.py (1)
DatabricksConnectionInfo(367-382)ibis-server/app/model/data_source.py (3)
DataSource(60-214)get_connection(78-82)get_connection(238-253)ibis-server/app/model/metadata/dto.py (5)
Constraint(94-100)ConstraintType(88-91)RustWrenEngineColumnType(13-58)Table(80-85)TableProperties(70-77)ibis-server/app/model/metadata/metadata.py (1)
Metadata(7-21)
ibis-server/tests/routers/v3/connector/databricks/test_metadata.py (2)
ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-60)
ibis-server/tests/routers/v3/connector/databricks/test_query.py (4)
wren-core-base/manifest-macro/src/lib.rs (1)
manifest(26-56)ibis-server/tests/routers/v3/connector/databricks/test_function.py (1)
manifest_str(32-33)ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-60)
ibis-server/tests/routers/v3/connector/databricks/test_function.py (3)
ibis-server/app/config.py (1)
get_config(98-99)ibis-server/tests/conftest.py (2)
file_path(10-11)client(18-23)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-60)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci
🔇 Additional comments (22)
wren-core-base/manifest-macro/src/lib.rs (1)
111-112: LGTM!The Databricks variant addition follows the established pattern and is positioned consistently with other data source variants.
wren-core-base/src/mdl/manifest.rs (1)
122-122: LGTM!The Display implementation for Databricks is consistent with other data source variants.
ibis-server/pyproject.toml (3)
23-23: LGTM!The databricks extra is correctly added to the ibis-framework dependency list.
52-52: LGTM!The databricks-sql-connector dependency is correctly configured with pyarrow extras for efficient data transfer.
103-103: LGTM!The pytest marker for databricks tests follows the established pattern.
ibis-server/app/model/metadata/factory.py (2)
6-6: LGTM!The import statement is correctly placed and follows the alphabetical ordering.
39-39: LGTM!The metadata factory mapping for Databricks is correctly configured.
ibis-server/app/model/__init__.py (3)
70-71: LGTM!The QueryDatabricksDTO follows the established pattern for query DTOs.
367-383: LGTM!The DatabricksConnectionInfo model is well-structured with appropriate use of SecretStr for sensitive fields and clear field documentation.
510-510: LGTM!The ConnectionInfo union is correctly updated to include DatabricksConnectionInfo.
ibis-server/tests/routers/v3/connector/databricks/conftest.py (3)
1-9: LGTM!The module setup correctly imports dependencies and configures the pytest marker for all tests.
19-51: LGTM!The fixture properly sets up test schema and tables with appropriate constraints. The connection is correctly closed in the finally block. Using
CREATE OR REPLACEensures idempotency.
54-60: Verify environment variables are set before running tests.The fixture retrieves environment variables without validation. Tests will fail with unclear errors if variables are missing.
Consider adding validation:
@pytest.fixture(scope="module") def connection_info() -> dict[str, str]: + required_vars = ["DATABRICKS_SERVER_HOSTNAME", "DATABRICKS_HTTP_PATH", "DATABRICKS_TOKEN"] + missing = [var for var in required_vars if not os.getenv(var)] + if missing: + pytest.skip(f"Databricks tests require environment variables: {', '.join(missing)}") + return { "serverHostname": os.getenv("DATABRICKS_SERVER_HOSTNAME"), "httpPath": os.getenv("DATABRICKS_HTTP_PATH"), "accessToken": os.getenv("DATABRICKS_TOKEN"), }ibis-server/tests/routers/v3/connector/databricks/test_metadata.py (4)
1-1: Verify the API version mismatch.The test file is located in
v3/connector/databricks/but usesv2_base_url = "/v2/connector/databricks". Confirm whether metadata endpoints are intentionally still on v2 or if this should use v3.
4-30: LGTM!The test comprehensively validates table metadata including structure, properties, and column details.
32-49: LGTM!The test properly validates foreign key constraint metadata, confirming the constraint relationship between t2 and t1.
51-57: LGTM!The test validates version retrieval. Consider adding validation of the version format if there's a predictable pattern.
ibis-server/app/model/data_source.py (5)
23-23: LGTM!The Databricks imports are correctly placed and alphabetically ordered.
Also applies to: 35-35
76-76: LGTM!The databricks data source is correctly added to the DataSource enum.
182-183: LGTM!The connection info building for Databricks follows the established pattern.
233-233: LGTM!The DataSourceExtension mapping for Databricks is correctly configured.
417-423: Verify timeout configuration for Databricks.The connection method correctly implements the ibis.databricks.connect call. However, note that other data sources (postgres, clickhouse, trino, bigquery) have timeout configuration in the
get_connection_infomethod (lines 90-135), but Databricks does not. Confirm whether Databricks requires query timeout configuration or if it's handled differently.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
ibis-server/app/model/metadata/databricks.py (1)
41-60: Consider filtering additional system schemas.The WHERE clause excludes
information_schema, but Databricks also has system catalogs (such assystem) that typically contain internal metadata tables. Consider whether these should also be filtered out to avoid exposing system internals in the metadata API.WHERE - c.TABLE_SCHEMA NOT IN ('information_schema') + c.TABLE_SCHEMA NOT IN ('information_schema') + AND c.TABLE_CATALOG NOT IN ('system')
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ibis-server/app/model/metadata/databricks.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ibis-server/app/model/metadata/databricks.py (4)
ibis-server/app/model/__init__.py (1)
DatabricksConnectionInfo(367-382)ibis-server/app/model/data_source.py (3)
DataSource(60-214)get_connection(78-82)get_connection(238-253)ibis-server/app/model/metadata/dto.py (5)
Constraint(94-100)ConstraintType(88-91)RustWrenEngineColumnType(13-58)Table(80-85)TableProperties(70-77)ibis-server/app/model/metadata/metadata.py (1)
Metadata(7-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: check Cargo.toml formatting
- GitHub Check: cargo test (amd64)
- GitHub Check: cargo test (arm64)
- GitHub Check: clippy
- GitHub Check: ci
- GitHub Check: test
🔇 Additional comments (4)
ibis-server/app/model/metadata/databricks.py (4)
35-38: LGTM!The initialization properly extends the base
Metadataclass and establishes the Databricks connection using the standardizedDataSourcepattern.
61-99: LGTM!The result processing correctly builds
Tableobjects with proper handling for complex types (array,map,struct) and standard type transformation. The logic for grouping columns by table and handling nullability is sound.
148-153: LGTM!The version retrieval correctly uses Databricks'
current_version().dbsql_versionfunction and extracts the scalar result appropriately.
155-184: LGTM!The helper methods correctly format table/constraint names and transform Databricks data types to the internal enum representation. The special handling for parameterized types (decimal, geography, geometry) and logging for unknown types are appropriate.
|
This PR has been tested locally |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
ibis-server/app/model/metadata/databricks.py (1)
15-32: Add missingintervaltype mapping.The
DATABRICKS_TYPE_MAPPINGis missing theintervalentry. A previous review comment on these lines was marked as addressed in commit b595c06, but the code still lacks this mapping. Databricks SQL supports INTERVAL types (year-month and day-time forms), andRustWrenEngineColumnType.INTERVALexists in the enum. Without this mapping, interval columns will be classified asUNKNOWNby_transform_column_type(line 181).Add the missing entry:
"tinyint": RustWrenEngineColumnType.TINYINT, "variant": RustWrenEngineColumnType.VARIANT, "object": RustWrenEngineColumnType.JSON, + "interval": RustWrenEngineColumnType.INTERVAL, }
🧹 Nitpick comments (1)
ibis-server/app/model/metadata/databricks.py (1)
166-187: Consider documenting the fallback behavior.The method logs a warning when encountering unmapped types (lines 184-185) and returns
UNKNOWN. This is good defensive coding. Consider adding a docstring or inline comment explaining thatarray,map, andstructtypes are handled separately by the caller and should not be passed to this method, clarifying the expected input domain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/ibis-ci.yml(1 hunks)ibis-server/app/model/metadata/databricks.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ibis-server/app/model/metadata/databricks.py (4)
ibis-server/app/model/__init__.py (1)
DatabricksConnectionInfo(367-382)ibis-server/app/model/data_source.py (3)
DataSource(60-214)get_connection(78-82)get_connection(238-253)ibis-server/app/model/metadata/dto.py (5)
Constraint(94-100)ConstraintType(88-91)RustWrenEngineColumnType(13-58)Table(80-85)TableProperties(70-77)ibis-server/app/model/metadata/metadata.py (1)
Metadata(7-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: check Cargo.toml formatting
- GitHub Check: cargo test (amd64)
- GitHub Check: clippy
- GitHub Check: cargo test (arm64)
- GitHub Check: test
- GitHub Check: ci
🔇 Additional comments (2)
ibis-server/app/model/metadata/databricks.py (2)
85-88: Verify the approach for complex types.Complex types (
array,map,struct) are stored as raw string representations (e.g.,"array<int>"), while scalar types are normalized toRustWrenEngineColumnTypeenum values. This inconsistency means downstream consumers must handle two different type representations.Confirm whether:
- The string representation for complex types is the intended design.
- Downstream code (Rust engine, query processing) correctly handles both enum values and type strings.
If both representations are supported by design, consider adding a comment explaining this dual approach for future maintainers.
101-149: LGTM! Constraint query correctly handles catalog and schema boundaries.The constraint extraction query now includes comprehensive join conditions on
constraint_schema,table_catalog,table_schema,table_name(lines 115-118), andconstraint_catalog,constraint_schema(lines 121-122). This prevents cross-catalog or cross-schema constraint mismatches, addressing all concerns raised in previous reviews.
|
Thanks @goldmedal |
Description
This PR introduces Databricks connector. The connection info would be:
For more authentication information, see the Databricks SQL Connector for Python documentation.
Something worth knowing
ibis-server/tools/update_databricks_functions.pyis a tool for upgrading the function description from Databricks.Summary by CodeRabbit
New Features
Tests
Chores