feat(ibis): optimize Databricks connector and support service principal connection#1373
Conversation
WalkthroughThis PR adds service principal authentication support for Databricks. It refactors Changes
Sequence DiagramsequenceDiagram
participant Client
participant QueryRouter
participant Connector as DatabricksConnector
participant DBsql as databricks.sql
rect rgba(200, 220, 255, 0.3)
Note over Client,DBsql: Token-Based Authentication
Client->>QueryRouter: POST query (token auth)
QueryRouter->>Connector: new DatabricksConnector(connection_info)
Connector->>DBsql: connect(server_hostname, http_path, access_token)
DBsql-->>Connector: connection established
end
rect rgba(200, 255, 220, 0.3)
Note over Client,DBsql: Service Principal Authentication
Client->>QueryRouter: POST query (service principal auth)
QueryRouter->>Connector: new DatabricksConnector(connection_info)
Connector->>Connector: build oauth_service_principal(DbConfig)
Connector->>DBsql: connect(server_hostname, http_path, credentials_provider)
DBsql-->>Connector: connection established
end
rect rgba(255, 240, 200, 0.3)
Note over Client,DBsql: Query Execution
Connector->>DBsql: execute SQL
DBsql->>DBsql: fetch results as Arrow
DBsql-->>Connector: results
Connector-->>QueryRouter: results
QueryRouter-->>Client: 200 OK with data
end
Connector->>DBsql: close()
DBsql-->>Connector: disconnected
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20-25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (7)
ibis-server/app/model/__init__.py (1)
417-420: Consider including service principal in the global ConnectionInfo unionRight now
ConnectionInfoonly includesDatabricksTokenConnectionInfo, while service principal is only reachable viaDatabricksConnectionUnion(e.g.,QueryDatabricksDTO).If you intend service principal to be usable anywhere
ConnectionInfois accepted (e.g.,ValidateDTO,TranspileDTO), it may be clearer and more consistent to also includeDatabricksServicePrincipalConnectionInfothere.For example:
- | RedshiftIAMConnectionInfo - | SnowflakeConnectionInfo - | DatabricksTokenConnectionInfo + | RedshiftIAMConnectionInfo + | SnowflakeConnectionInfo + | DatabricksTokenConnectionInfo + | DatabricksServicePrincipalConnectionInfoIf those endpoints are intentionally restricted to token auth, keeping it as‑is is fine but might be worth documenting.
Also applies to: 548-548
ibis-server/tests/routers/v3/connector/databricks/test_query.py (1)
99-115: Consider asserting result contents for the service principal path
test_query_with_service_principalcurrently verifies status code and basic shape, which is good for wiring coverage, but it wouldn’t catch subtle differences (e.g., wrong dtypes or data values) between token and service principal flows.You could strengthen it by reusing the same row/dtype assertions as
test_query, for example:async def test_query_with_service_principal( client, manifest_str, service_principal_connection_info ): @@ - assert len(result["columns"]) == len(manifest["models"][0]["columns"]) - assert len(result["data"]) == 1 + assert len(result["columns"]) == len(manifest["models"][0]["columns"]) + assert len(result["data"]) == 1 + + assert result["data"][0] == [ + 1, + 184501, + "O", + "203010.51", + "1996-01-02", + "1_184501", + "2024-01-01 23:59:59.000000 +00:00", + "2024-01-01 23:59:59.000000 +00:00", + None, + ] + # Optionally also assert dtypes if you want full parity with the token path.Not mandatory, but it would give you stronger guarantees that both auth methods behave identically.
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
55-61: Fixtures align with the new discriminated Databricks connection types
connection_infonow includes"databricks_type": "token"and uses the same JSON field names asDatabricksTokenConnectionInfo, which will let Pydantic pick the correct variant.service_principal_connection_infosimilarly matchesDatabricksServicePrincipalConnectionInfo(minus the optionalazureTenantId), giving good coverage of the service principal path.If you plan to exercise the
azure_tenant_idbranch as well, you could extend the fixture to readTEST_DATABRICKS_AZURE_TENANT_IDand include"azureTenantId"when present:def service_principal_connection_info() -> dict[str, str]: - return { - "databricks_type": "service_principal", - "serverHostname": os.getenv("TEST_DATABRICKS_SERVER_HOSTNAME"), - "httpPath": os.getenv("TEST_DATABRICKS_HTTP_PATH"), - "clientId": os.getenv("TEST_DATABRICKS_CLIENT_ID"), - "clientSecret": os.getenv("TEST_DATABRICKS_CLIENT_SECRET"), - } + info = { + "databricks_type": "service_principal", + "serverHostname": os.getenv("TEST_DATABRICKS_SERVER_HOSTNAME"), + "httpPath": os.getenv("TEST_DATABRICKS_HTTP_PATH"), + "clientId": os.getenv("TEST_DATABRICKS_CLIENT_ID"), + "clientSecret": os.getenv("TEST_DATABRICKS_CLIENT_SECRET"), + } + azure_tenant_id = os.getenv("TEST_DATABRICKS_AZURE_TENANT_ID") + if azure_tenant_id: + info["azureTenantId"] = azure_tenant_id + return infoThis keeps the happy path as-is but allows you to test the optional tenant ID behavior when available.
Also applies to: 64-72
ibis-server/app/model/metadata/databricks.py (1)
3-4: Switch to DatabricksConnector/query is coherent; consider widening the type hintThe refactor of
DatabricksMetadatato useDatabricksConnectorand itsquerymethod looks consistent:
__init__now builds aDatabricksConnectoronce and reuses it.get_table_list,get_constraints, andget_versionall useself.connection.query(...).to_pandas(), which matches the connector’s contract.One small refinement: the
__init__signature currently restrictsconnection_infotoDatabricksTokenConnectionInfo, butDatabricksConnectoritself accepts the fullDatabricksConnectionUnion. To keep metadata in sync with the connector and make it straightforward to support service principal here as well, you could widen the annotation and import:-from app.model import DatabricksTokenConnectionInfo +from app.model import DatabricksConnectionUnion @@ -class DatabricksMetadata(Metadata): - def __init__(self, connection_info: DatabricksTokenConnectionInfo): +class DatabricksMetadata(Metadata): + def __init__(self, connection_info: DatabricksConnectionUnion): super().__init__(connection_info) self.connection = DatabricksConnector(connection_info)This doesn’t change runtime behavior but makes the type contract accurately reflect what the class can already handle via the connector.
Also applies to: 36-39, 61-61, 125-125, 151-156
ibis-server/app/model/data_source.py (1)
424-429: Confirm thatget_databricks_connectionis never called with a service-principal infoThe narrowed type to
DatabricksTokenConnectionInfomatches the implementation (expectsaccess_token), but_build_connection_infocan now also produceDatabricksServicePrincipalConnectionInfo. Please ensure that any code path that callsDataSource.databricks.get_connection(info)only ever passes a token-based info, with service-principal flows going exclusively throughDatabricksConnector.If service-principal ever needs ibis-based metadata as well, consider either:
- Accepting the union and branching on
databricks_typehere, or- Clearly guarding the call site(s) to enforce token-only usage.
ibis-server/app/model/connector.py (2)
30-32: Databricks driver and SDK imports are fine assuming they’re hard dependenciesImporting
databricks-sql-connectoranddatabricks-sdkat module import time is acceptable if these are now mandatory runtime dependencies (as implied by pyproject). If you intend Databricks to remain optional, consider moving these imports intoDatabricksConnector.__init__instead so non-Databricks deployments don’t fail at import time.
618-666: DatabricksConnector implementation looks correct; consider small polishThe connector correctly:
- Distinguishes token vs service-principal configs via
isinstanceon the shared union type.- Uses
dbsql.connect(...)withserver_hostname,http_path, and eitheraccess_tokenor acredentials_providerbuilt fromoauth_service_principal(DbConfig(**kwargs)).- Executes queries via a cursor context manager and returns Arrow tables using
fetchmany_arrow(limit)/fetchall_arrow().- Implements
dry_runusing aSELECT * FROM ({sql}) AS sub LIMIT 0wrapper consistent with other connectors.- Closes the underlying connection with error logging.
Two minor follow-ups you might consider:
- Add explicit type hints for
queryanddry_runfor consistency with other connectors:- def query(self, sql, limit=None): + def query(self, sql: str, limit: int | None = None) -> pa.Table: ... - def dry_run(self, sql): + def dry_run(self, sql: str) -> None:
- Optionally guard against negative
limitvalues (e.g., treat< 0as no limit or raise) before callingfetchmany_arrow(limit).Please double-check against the versions in use that:
oauth_service_principal(DbConfig(**kwargs))returns an object suitable for thecredentials_providerparameter ofdbsql.connect, andfetchall_arrow/fetchmany_arrowhave the expected signatures and returnpyarrow.Table.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
ibis-server/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
ibis-server/app/model/__init__.py(4 hunks)ibis-server/app/model/connector.py(4 hunks)ibis-server/app/model/data_source.py(3 hunks)ibis-server/app/model/metadata/databricks.py(5 hunks)ibis-server/pyproject.toml(1 hunks)ibis-server/tests/routers/v3/connector/databricks/conftest.py(1 hunks)ibis-server/tests/routers/v3/connector/databricks/test_query.py(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.
Applied to files:
ibis-server/app/model/connector.py
📚 Learning: 2025-08-29T05:49:45.513Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1301
File: ibis-server/pyproject.toml:83-83
Timestamp: 2025-08-29T05:49:45.513Z
Learning: Polars dependency in ibis-server/pyproject.toml is correctly placed in dev dependencies because it's only used by helper tools in the tools/data_source/ directory, not in the main production application runtime.
Applied to files:
ibis-server/pyproject.toml
🧬 Code graph analysis (5)
ibis-server/app/model/connector.py (5)
ibis-server/wren/session/__init__.py (3)
sql(37-52)execute(145-164)dry_run(136-143)ibis-server/app/model/__init__.py (2)
DatabricksServicePrincipalConnectionInfo(387-414)DatabricksTokenConnectionInfo(367-383)ibis-server/app/model/data_source.py (1)
DataSource(61-220)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-61)ibis-server/tools/update_databricks_functions.py (1)
connect(43-66)
ibis-server/app/model/__init__.py (1)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-61)
ibis-server/app/model/metadata/databricks.py (3)
ibis-server/app/model/__init__.py (1)
DatabricksTokenConnectionInfo(367-383)ibis-server/app/model/connector.py (9)
DatabricksConnector(618-666)query(102-135)query(189-194)query(310-314)query(377-384)query(440-478)query(498-506)query(595-603)query(646-655)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
connection_info(55-61)
ibis-server/tests/routers/v3/connector/databricks/test_query.py (3)
ibis-server/tests/conftest.py (1)
client(18-23)ibis-server/tests/routers/v3/connector/databricks/test_function.py (1)
manifest_str(32-33)ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
service_principal_connection_info(65-72)
ibis-server/app/model/data_source.py (1)
ibis-server/app/model/__init__.py (2)
DatabricksServicePrincipalConnectionInfo(387-414)DatabricksTokenConnectionInfo(367-383)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: ci
- GitHub Check: ci
🔇 Additional comments (8)
ibis-server/pyproject.toml (1)
53-53: databricks-sdk dependency addition looks appropriateAdding
databricks-sdkto main dependencies aligns with the newDatabricksConnectorservice principal flow that relies on the SDK; no further changes needed here from what’s visible in this PR.ibis-server/app/model/__init__.py (2)
70-71: Strong typing for Databricks query DTO looks goodNarrowing
QueryDatabricksDTO.connection_infotoDatabricksConnectionUnion(instead of the broaderdict | ConnectionInfo) aligns the API with the new discriminated union and enforces better validation for both token and service principal flows.
367-420: Databricks token + service principal models are consistent with the JSON contractThe
DatabricksTokenConnectionInfoandDatabricksServicePrincipalConnectionInfomodels look well‑structured:
- Field aliases (
serverHostname,httpPath,accessToken,clientId,clientSecret,azureTenantId) match the test fixtures.databricks_typeliterals provide a clear discriminator for the union.- Optional
azure_tenant_idwith a default keeps the schema flexible.No issues from a modeling perspective.
ibis-server/tests/routers/v3/connector/databricks/test_query.py (1)
52-52: Manifest and dtype expectations updated consistently for DatabricksAdding
"dataSource": "databricks"to the manifest and updating the expected dtypes (totalpriceasdecimal128(18, 2)and timestamps withtz=Etc/UTC) look consistent with the behavior you’d expect from the new Databricks connector/driver stack. The assertions remain precise and should catch regressions in type mapping.Also applies to: 90-96
ibis-server/app/model/data_source.py (2)
23-24: Databricks connection info imports are correctly wiredImporting the two concrete Databricks connection info types here cleanly exposes them to the DataSource logic and keeps all other behavior unchanged.
184-189: Databricks discriminant logic looks correct and is backward compatibleThe
databricks_typecheck mirrors the existing Redshift IAM pattern and preserves the token path as the default when the field is absent or not"service_principal". This gives a safe, backward-compatible split between token and service principal configs.ibis-server/app/model/connector.py (2)
46-48: Databricks connection models correctly exposed to the connectorWiring
DatabricksConnectionUnionplus the two concrete types into this module keeps connector type usage consistent with the shared model definitions.
97-98: Routing Databricks to the dedicated connector is clean and localizedThe new
elif data_source == DataSource.databricksbranch ensures all Databricks work goes throughDatabricksConnectorwithout affecting other sources, preserving the existing Connector dispatch pattern.
Sample PAT connection info
{ "databricks_type": "token", "serverHostname": "https://dbc-1234555-abd4.cloud.databricks.com", "httpPath": "/sql/1.0/warehouses/1234567777" "accessToken": "41e4f097b2066c1869a...", }Sample Service Principal connection info
{ "databricks_type": "service_principal", "serverHostname": "https://dbc-1234555-abd4.cloud.databricks.com", "httpPath": "/sql/1.0/warehouses/1234567777" "clientId": "0aass6cc-8339-4a1b-1234-aabbccdd", "clientSecret": "dose388ed611d3e564a291ssfassf", }Summary by CodeRabbit
New Features
Chores
Tests