Skip to content

feat(ibis): optimize Databricks connector and support service principal connection#1373

Merged
douenergy merged 3 commits intoCanner:mainfrom
goldmedal:feat/optimize-databricks-connector
Nov 19, 2025
Merged

feat(ibis): optimize Databricks connector and support service principal connection#1373
douenergy merged 3 commits intoCanner:mainfrom
goldmedal:feat/optimize-databricks-connector

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Nov 14, 2025

  • Use Databricks Python driver directly
  • Support OAtuh M2M

Sample PAT connection info

{
        "databricks_type": "token",
        "serverHostname": "https://dbc-1234555-abd4.cloud.databricks.com",
        "httpPath": "/sql/1.0/warehouses/1234567777"
        "accessToken": "41e4f097b2066c1869a...",
}

Sample Service Principal connection info

{
        "databricks_type": "service_principal",
        "serverHostname": "https://dbc-1234555-abd4.cloud.databricks.com",
        "httpPath": "/sql/1.0/warehouses/1234567777"
        "clientId": "0aass6cc-8339-4a1b-1234-aabbccdd",
        "clientSecret": "dose388ed611d3e564a291ssfassf",
    }

Summary by CodeRabbit

  • New Features

    • Added Databricks connector with service principal authentication support, complementing existing token-based authentication.
  • Chores

    • Added databricks-sdk dependency.
  • Tests

    • Added test coverage for service principal authentication flow.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 14, 2025

Walkthrough

This PR adds service principal authentication support for Databricks. It refactors DatabricksConnectionInfo to DatabricksTokenConnectionInfo, introduces DatabricksServicePrincipalConnectionInfo, and unifies both via a discriminated union type. A new DatabricksConnector class replaces DataSource-based connections, supporting both authentication flows. Metadata and tests are updated accordingly.

Changes

Cohort / File(s) Change Summary
Databricks Connection Models
ibis-server/app/model/__init__.py, ibis-server/app/model/data_source.py
Renamed DatabricksConnectionInfo to DatabricksTokenConnectionInfo with databricks_type: Literal["token"] field; added DatabricksServicePrincipalConnectionInfo with databricks_type: Literal["service_principal"] and credential fields; introduced discriminated union DatabricksConnectionUnion; updated public API exports
Databricks Connector
ibis-server/app/model/connector.py
Implemented new DatabricksConnector class with dual authentication paths (token-based and service principal via oauth_service_principal); added query(), dry_run(), and close() methods; integrated into central connector factory
Metadata Layer
ibis-server/app/model/metadata/databricks.py
Updated DatabricksMetadata.__init__ to use DatabricksTokenConnectionInfo; replaced DataSource.databricks.get_connection() with DatabricksConnector initialization; switched SQL execution from sql() to query() method
Dependencies
ibis-server/pyproject.toml
Added databricks-sdk with version constraint ^0.73.0
Tests
ibis-server/tests/routers/v3/connector/databricks/conftest.py, ibis-server/tests/routers/v3/connector/databricks/test_query.py
Added service_principal_connection_info fixture; updated connection_info fixture with databricks_type and TEST_DATABRICKS_* environment variables; added test_query_with_service_principal test; adjusted manifest and expected data types

Sequence Diagram

sequenceDiagram
    participant Client
    participant QueryRouter
    participant Connector as DatabricksConnector
    participant DBsql as databricks.sql
    
    rect rgba(200, 220, 255, 0.3)
    Note over Client,DBsql: Token-Based Authentication
    Client->>QueryRouter: POST query (token auth)
    QueryRouter->>Connector: new DatabricksConnector(connection_info)
    Connector->>DBsql: connect(server_hostname, http_path, access_token)
    DBsql-->>Connector: connection established
    end
    
    rect rgba(200, 255, 220, 0.3)
    Note over Client,DBsql: Service Principal Authentication
    Client->>QueryRouter: POST query (service principal auth)
    QueryRouter->>Connector: new DatabricksConnector(connection_info)
    Connector->>Connector: build oauth_service_principal(DbConfig)
    Connector->>DBsql: connect(server_hostname, http_path, credentials_provider)
    DBsql-->>Connector: connection established
    end
    
    rect rgba(255, 240, 200, 0.3)
    Note over Client,DBsql: Query Execution
    Connector->>DBsql: execute SQL
    DBsql->>DBsql: fetch results as Arrow
    DBsql-->>Connector: results
    Connector-->>QueryRouter: results
    QueryRouter-->>Client: 200 OK with data
    end
    
    Connector->>DBsql: close()
    DBsql-->>Connector: disconnected
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20-25 minutes

  • Areas requiring extra attention:
    • DatabricksConnector implementation: verify both authentication flows (token and service principal) correctly handle credentials and error cases
    • Discriminated union logic in model/init.py: ensure databricks_type field properly discriminates between connection types
    • Metadata layer changes: confirm the switch from DataSource.databricks to direct DatabricksConnector usage doesn't break existing behavior
    • query() vs. sql() method change: verify Arrow result handling remains consistent with previous DataSource approach

Possibly related PRs

Suggested labels

ibis, dependencies, python

Suggested reviewers

  • douenergy

Poem

🐰 Two paths converge in the Databricks wood,
Token and principal, misunderstood—
'Til discriminated union made them one,
The connector now dances, queries run!
With service and secret, authentication's won! 🔐

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: optimizing the Databricks connector and adding service principal authentication support, which are the core objectives of the PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added ibis dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Nov 14, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
ibis-server/app/model/__init__.py (1)

417-420: Consider including service principal in the global ConnectionInfo union

Right now ConnectionInfo only includes DatabricksTokenConnectionInfo, while service principal is only reachable via DatabricksConnectionUnion (e.g., QueryDatabricksDTO).

If you intend service principal to be usable anywhere ConnectionInfo is accepted (e.g., ValidateDTO, TranspileDTO), it may be clearer and more consistent to also include DatabricksServicePrincipalConnectionInfo there.

For example:

-    | RedshiftIAMConnectionInfo
-    | SnowflakeConnectionInfo
-    | DatabricksTokenConnectionInfo
+    | RedshiftIAMConnectionInfo
+    | SnowflakeConnectionInfo
+    | DatabricksTokenConnectionInfo
+    | DatabricksServicePrincipalConnectionInfo

If those endpoints are intentionally restricted to token auth, keeping it as‑is is fine but might be worth documenting.

Also applies to: 548-548

ibis-server/tests/routers/v3/connector/databricks/test_query.py (1)

99-115: Consider asserting result contents for the service principal path

test_query_with_service_principal currently verifies status code and basic shape, which is good for wiring coverage, but it wouldn’t catch subtle differences (e.g., wrong dtypes or data values) between token and service principal flows.

You could strengthen it by reusing the same row/dtype assertions as test_query, for example:

 async def test_query_with_service_principal(
     client, manifest_str, service_principal_connection_info
 ):
@@
-    assert len(result["columns"]) == len(manifest["models"][0]["columns"])
-    assert len(result["data"]) == 1
+    assert len(result["columns"]) == len(manifest["models"][0]["columns"])
+    assert len(result["data"]) == 1
+
+    assert result["data"][0] == [
+        1,
+        184501,
+        "O",
+        "203010.51",
+        "1996-01-02",
+        "1_184501",
+        "2024-01-01 23:59:59.000000 +00:00",
+        "2024-01-01 23:59:59.000000 +00:00",
+        None,
+    ]
+    # Optionally also assert dtypes if you want full parity with the token path.

Not mandatory, but it would give you stronger guarantees that both auth methods behave identically.

ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)

55-61: Fixtures align with the new discriminated Databricks connection types

  • connection_info now includes "databricks_type": "token" and uses the same JSON field names as DatabricksTokenConnectionInfo, which will let Pydantic pick the correct variant.
  • service_principal_connection_info similarly matches DatabricksServicePrincipalConnectionInfo (minus the optional azureTenantId), giving good coverage of the service principal path.

If you plan to exercise the azure_tenant_id branch as well, you could extend the fixture to read TEST_DATABRICKS_AZURE_TENANT_ID and include "azureTenantId" when present:

 def service_principal_connection_info() -> dict[str, str]:
-    return {
-        "databricks_type": "service_principal",
-        "serverHostname": os.getenv("TEST_DATABRICKS_SERVER_HOSTNAME"),
-        "httpPath": os.getenv("TEST_DATABRICKS_HTTP_PATH"),
-        "clientId": os.getenv("TEST_DATABRICKS_CLIENT_ID"),
-        "clientSecret": os.getenv("TEST_DATABRICKS_CLIENT_SECRET"),
-    }
+    info = {
+        "databricks_type": "service_principal",
+        "serverHostname": os.getenv("TEST_DATABRICKS_SERVER_HOSTNAME"),
+        "httpPath": os.getenv("TEST_DATABRICKS_HTTP_PATH"),
+        "clientId": os.getenv("TEST_DATABRICKS_CLIENT_ID"),
+        "clientSecret": os.getenv("TEST_DATABRICKS_CLIENT_SECRET"),
+    }
+    azure_tenant_id = os.getenv("TEST_DATABRICKS_AZURE_TENANT_ID")
+    if azure_tenant_id:
+        info["azureTenantId"] = azure_tenant_id
+    return info

This keeps the happy path as-is but allows you to test the optional tenant ID behavior when available.

Also applies to: 64-72

ibis-server/app/model/metadata/databricks.py (1)

3-4: Switch to DatabricksConnector/query is coherent; consider widening the type hint

The refactor of DatabricksMetadata to use DatabricksConnector and its query method looks consistent:

  • __init__ now builds a DatabricksConnector once and reuses it.
  • get_table_list, get_constraints, and get_version all use self.connection.query(...).to_pandas(), which matches the connector’s contract.

One small refinement: the __init__ signature currently restricts connection_info to DatabricksTokenConnectionInfo, but DatabricksConnector itself accepts the full DatabricksConnectionUnion. To keep metadata in sync with the connector and make it straightforward to support service principal here as well, you could widen the annotation and import:

-from app.model import DatabricksTokenConnectionInfo
+from app.model import DatabricksConnectionUnion
@@
-class DatabricksMetadata(Metadata):
-    def __init__(self, connection_info: DatabricksTokenConnectionInfo):
+class DatabricksMetadata(Metadata):
+    def __init__(self, connection_info: DatabricksConnectionUnion):
         super().__init__(connection_info)
         self.connection = DatabricksConnector(connection_info)

This doesn’t change runtime behavior but makes the type contract accurately reflect what the class can already handle via the connector.

Also applies to: 36-39, 61-61, 125-125, 151-156

ibis-server/app/model/data_source.py (1)

424-429: Confirm that get_databricks_connection is never called with a service-principal info

The narrowed type to DatabricksTokenConnectionInfo matches the implementation (expects access_token), but _build_connection_info can now also produce DatabricksServicePrincipalConnectionInfo. Please ensure that any code path that calls DataSource.databricks.get_connection(info) only ever passes a token-based info, with service-principal flows going exclusively through DatabricksConnector.

If service-principal ever needs ibis-based metadata as well, consider either:

  • Accepting the union and branching on databricks_type here, or
  • Clearly guarding the call site(s) to enforce token-only usage.
ibis-server/app/model/connector.py (2)

30-32: Databricks driver and SDK imports are fine assuming they’re hard dependencies

Importing databricks-sql-connector and databricks-sdk at module import time is acceptable if these are now mandatory runtime dependencies (as implied by pyproject). If you intend Databricks to remain optional, consider moving these imports into DatabricksConnector.__init__ instead so non-Databricks deployments don’t fail at import time.


618-666: DatabricksConnector implementation looks correct; consider small polish

The connector correctly:

  • Distinguishes token vs service-principal configs via isinstance on the shared union type.
  • Uses dbsql.connect(...) with server_hostname, http_path, and either access_token or a credentials_provider built from oauth_service_principal(DbConfig(**kwargs)).
  • Executes queries via a cursor context manager and returns Arrow tables using fetchmany_arrow(limit) / fetchall_arrow().
  • Implements dry_run using a SELECT * FROM ({sql}) AS sub LIMIT 0 wrapper consistent with other connectors.
  • Closes the underlying connection with error logging.

Two minor follow-ups you might consider:

  • Add explicit type hints for query and dry_run for consistency with other connectors:
-    def query(self, sql, limit=None):
+    def query(self, sql: str, limit: int | None = None) -> pa.Table:
...
-    def dry_run(self, sql):
+    def dry_run(self, sql: str) -> None:
  • Optionally guard against negative limit values (e.g., treat < 0 as no limit or raise) before calling fetchmany_arrow(limit).

Please double-check against the versions in use that:

  • oauth_service_principal(DbConfig(**kwargs)) returns an object suitable for the credentials_provider parameter of dbsql.connect, and
  • fetchall_arrow / fetchmany_arrow have the expected signatures and return pyarrow.Table.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c53150e and f753e4e.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • ibis-server/app/model/__init__.py (4 hunks)
  • ibis-server/app/model/connector.py (4 hunks)
  • ibis-server/app/model/data_source.py (3 hunks)
  • ibis-server/app/model/metadata/databricks.py (5 hunks)
  • ibis-server/pyproject.toml (1 hunks)
  • ibis-server/tests/routers/v3/connector/databricks/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/databricks/test_query.py (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object without requiring conn.register().

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-06-18T02:23:34.040Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1224
File: ibis-server/app/util.py:49-56
Timestamp: 2025-06-18T02:23:34.040Z
Learning: DuckDB supports querying PyArrow Tables directly in SQL queries without needing to register them. When a pa.Table object is referenced in a FROM clause (e.g., "SELECT ... FROM df" where df is a pa.Table), DuckDB automatically handles the PyArrow object via its "replacement scan" mechanism that recognizes Python variables referencing Arrow objects as SQL tables. No conn.register() call is required.

Applied to files:

  • ibis-server/app/model/connector.py
📚 Learning: 2025-08-29T05:49:45.513Z
Learnt from: goldmedal
Repo: Canner/wren-engine PR: 1301
File: ibis-server/pyproject.toml:83-83
Timestamp: 2025-08-29T05:49:45.513Z
Learning: Polars dependency in ibis-server/pyproject.toml is correctly placed in dev dependencies because it's only used by helper tools in the tools/data_source/ directory, not in the main production application runtime.

Applied to files:

  • ibis-server/pyproject.toml
🧬 Code graph analysis (5)
ibis-server/app/model/connector.py (5)
ibis-server/wren/session/__init__.py (3)
  • sql (37-52)
  • execute (145-164)
  • dry_run (136-143)
ibis-server/app/model/__init__.py (2)
  • DatabricksServicePrincipalConnectionInfo (387-414)
  • DatabricksTokenConnectionInfo (367-383)
ibis-server/app/model/data_source.py (1)
  • DataSource (61-220)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
  • connection_info (55-61)
ibis-server/tools/update_databricks_functions.py (1)
  • connect (43-66)
ibis-server/app/model/__init__.py (1)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
  • connection_info (55-61)
ibis-server/app/model/metadata/databricks.py (3)
ibis-server/app/model/__init__.py (1)
  • DatabricksTokenConnectionInfo (367-383)
ibis-server/app/model/connector.py (9)
  • DatabricksConnector (618-666)
  • query (102-135)
  • query (189-194)
  • query (310-314)
  • query (377-384)
  • query (440-478)
  • query (498-506)
  • query (595-603)
  • query (646-655)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
  • connection_info (55-61)
ibis-server/tests/routers/v3/connector/databricks/test_query.py (3)
ibis-server/tests/conftest.py (1)
  • client (18-23)
ibis-server/tests/routers/v3/connector/databricks/test_function.py (1)
  • manifest_str (32-33)
ibis-server/tests/routers/v3/connector/databricks/conftest.py (1)
  • service_principal_connection_info (65-72)
ibis-server/app/model/data_source.py (1)
ibis-server/app/model/__init__.py (2)
  • DatabricksServicePrincipalConnectionInfo (387-414)
  • DatabricksTokenConnectionInfo (367-383)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: ci
  • GitHub Check: ci
🔇 Additional comments (8)
ibis-server/pyproject.toml (1)

53-53: databricks-sdk dependency addition looks appropriate

Adding databricks-sdk to main dependencies aligns with the new DatabricksConnector service principal flow that relies on the SDK; no further changes needed here from what’s visible in this PR.

ibis-server/app/model/__init__.py (2)

70-71: Strong typing for Databricks query DTO looks good

Narrowing QueryDatabricksDTO.connection_info to DatabricksConnectionUnion (instead of the broader dict | ConnectionInfo) aligns the API with the new discriminated union and enforces better validation for both token and service principal flows.


367-420: Databricks token + service principal models are consistent with the JSON contract

The DatabricksTokenConnectionInfo and DatabricksServicePrincipalConnectionInfo models look well‑structured:

  • Field aliases (serverHostname, httpPath, accessToken, clientId, clientSecret, azureTenantId) match the test fixtures.
  • databricks_type literals provide a clear discriminator for the union.
  • Optional azure_tenant_id with a default keeps the schema flexible.

No issues from a modeling perspective.

ibis-server/tests/routers/v3/connector/databricks/test_query.py (1)

52-52: Manifest and dtype expectations updated consistently for Databricks

Adding "dataSource": "databricks" to the manifest and updating the expected dtypes (totalprice as decimal128(18, 2) and timestamps with tz=Etc/UTC) look consistent with the behavior you’d expect from the new Databricks connector/driver stack. The assertions remain precise and should catch regressions in type mapping.

Also applies to: 90-96

ibis-server/app/model/data_source.py (2)

23-24: Databricks connection info imports are correctly wired

Importing the two concrete Databricks connection info types here cleanly exposes them to the DataSource logic and keeps all other behavior unchanged.


184-189: Databricks discriminant logic looks correct and is backward compatible

The databricks_type check mirrors the existing Redshift IAM pattern and preserves the token path as the default when the field is absent or not "service_principal". This gives a safe, backward-compatible split between token and service principal configs.

ibis-server/app/model/connector.py (2)

46-48: Databricks connection models correctly exposed to the connector

Wiring DatabricksConnectionUnion plus the two concrete types into this module keeps connector type usage consistent with the shared model definitions.


97-98: Routing Databricks to the dedicated connector is clean and localized

The new elif data_source == DataSource.databricks branch ensures all Databricks work goes through DatabricksConnector without affecting other sources, preserving the existing Connector dispatch pattern.

@goldmedal goldmedal requested a review from douenergy November 14, 2025 08:59
@douenergy douenergy merged commit d759d81 into Canner:main Nov 19, 2025
7 checks passed
nhaluc1005 pushed a commit to nhaluc1005/text2sql-practice that referenced this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file ibis python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants