Added user-friendly API to execute SQL statements: `for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict())` #295

nfx · 2023-08-17T15:01:28Z

Execute SQL statements in a stateless manner.

Primary use-case of :py:meth:iterate_rows and :py:meth:execute methods is oriented at executing SQL queries in
a stateless manner straight away from Databricks SDK for Python, without requiring any external dependencies.
Results are fetched in JSON format through presigned external links. This is perfect for serverless applications
like AWS Lambda, Azure Functions, or any other containerised short-lived applications, where container startup
time is faster with the smaller dependency set.

for (pickup_zip, dropoff_zip) in w.statement_execution.iterate_rows(warehouse_id,
    'SELECT pickup_zip, dropoff_zip FROM nyctaxi.trips LIMIT 10', catalog='samples'):
    print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')

Method :py:meth:iterate_rows returns an iterator of objects, that resemble :class:pyspark.sql.Row APIs, but full
compatibility is not the goal of this implementation.

iterate_rows = functools.partial(w.statement_execution.iterate_rows, warehouse_id, catalog='samples')
for row in iterate_rows('SELECT * FROM nyctaxi.trips LIMIT 10'):
    pickup_time, dropoff_time = row[0], row[1]
    pickup_zip = row.pickup_zip
    dropoff_zip = row['dropoff_zip']
    all_fields = row.as_dict()
    print(f'{pickup_zip}@{pickup_time} -> {dropoff_zip}@{dropoff_time}: {all_fields}')

When you only need to execute the query and have no need to iterate over results, use the :py:meth:execute.

w.statement_execution.execute(warehouse_id, 'CREATE TABLE foo AS SELECT * FROM range(10)')

Applications, that need to a more traditional SQL Python APIs with cursors, efficient data transfer of hundreds of
megabytes or gigabytes of data serialized in Apache Arrow format, and low result fetching latency, should use
the stateful Databricks SQL Connector for Python.

New Integration tests

codecov-commenter · 2023-08-17T15:03:07Z

Codecov Report

Patch coverage is 89.92% of modified lines.

Files Changed	Coverage
databricks/sdk/mixins/sql.py	`89.51%`
databricks/sdk/core.py	`92.30%`
databricks/sdk/__init__.py	`100.00%`

📢 Thoughts on this report? Let us know!.

mgyucht

Couple questions but overall this looks great! Thanks for contributing this.

databricks/sdk/mixins/sql.py

sander-goos

Nice! Are you planning to support this in all SDKs?

databricks/sdk/mixins/sql.py

```python for (pickup_zip, dropoff_zip) in w.statement_execution.execute_fetch_all( warehouse_id, 'SELECT pickup_zip, dropoff_zip FROM nyctaxi.trips LIMIT 10', catalog='samples'): print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}') ```

sander-goos

Great stuff, left a few comments

sander-goos · 2023-09-15T15:15:20Z

databricks/sdk/mixins/sql.py

+    @staticmethod
+    def _parse_timestamp(value: str) -> datetime.datetime:
+        # make it work with Python 3.7 to 3.10 as well
+        return datetime.datetime.fromisoformat(value.replace('Z', '+00:00'))


I think this would break when timezone is not UTC

@sander-goos we always return in UTC, afaik

No, it will use the current timezone from the spark session. By default this is UTC but it can be changed by a SQL admin.

sander-goos · 2023-09-15T15:20:28Z

databricks/sdk/mixins/sql.py

+                warehouse_id: str,
+                statement: str,
+                *,
+                byte_limit: Optional[int] = None,


Let's add row_limit as well

It would also be great if we could add parameters https://docs.databricks.com/api/workspace/statementexecution/executestatement
But we can do it in a follow up as well.

@sander-goos not in this PR, feel free to add a follow-up PR with these additions.

sander-goos · 2023-09-15T15:32:34Z

databricks/sdk/mixins/sql.py

+                    result_data = self.get_statement_result_chunk_n(execute_response.statement_id,
+                                                                    external_link.next_chunk_index)
+
+    def _iterate_inline_disposition(self, execute_response: ExecuteStatementResponse) -> Iterator[Row]:


Should we remove this if it's not used?

sander-goos · 2023-09-15T15:43:34Z

tests/test_sql_mixins.py

+    cancel_execution.assert_called_with('bcd')
+
+
+def test_fetch_all_no_chunks(config, mocker):


nit: test_fetch_all_single_chunk

sander-goos · 2023-09-15T15:51:59Z

databricks/sdk/mixins/sql.py

+            raise DatabricksError(message, error_code=error_code)
+        raise DatabricksError(status.state.value)
+
+    def execute(self,


Perhaps we can name this: execute_sync or execute_and_wait to make clear this is waiting for the query to finish up to some specified timeout?

_and_wait has slightly different semantics in the SDK, as those methods return a future-like object

sander-goos · 2023-09-15T15:54:43Z

databricks/sdk/mixins/sql.py

+        msg = f"timed out after {timeout}: {status_message}"
+        raise TimeoutError(msg)
+
+    def iterate_rows(self,


I think iterate_rows doesn't fully cover what this method is doing.. As this is handling execution and result fetching, what about: execute_and_fetch?

sander-goos · 2023-09-15T16:05:57Z

tests/test_sql_mixins.py

+    assert isinstance(rows[0].since, datetime.date)
+    assert isinstance(rows[0].now, datetime.datetime)
+
+    http_get.assert_called_with('https://singed-url')


Can we assert that we never submit the authorization token when fetching any external link?

sander-goos · 2023-09-15T16:07:48Z

databricks/sdk/mixins/sql.py

+        # ensure that we close the HTTP session after fetching the external links
+        result_data = execute_response.result
+        row_factory, col_conv = self._result_schema(execute_response)
+        with self._api._new_session() as http:


Does this make sure we don't use the authorization header that's used for the internal requests? That is important.

@sander-goos self._api._new_session() starts HTTP session without our auth headers, because pre-signed URLs don't expect any Authorization headers.

all other requests follow the normal flow.

alexott · 2023-10-04T11:38:36Z

That would be really nice way to use in the Airflow provider because right now dependencies are too heavyweight.

Initial port from Databricks Python SDK PR: databricks/databricks-sdk-py#295

nfx added the do-not-merge label Aug 17, 2023

nfx requested a review from mgyucht August 17, 2023 15:01

mgyucht reviewed Aug 21, 2023

View reviewed changes

sander-goos reviewed Aug 25, 2023

View reviewed changes

nfx force-pushed the sql/execute-mixin branch from df6d924 to 2bf281d Compare September 1, 2023 07:07

nfx added 3 commits September 1, 2023 09:11

fmt

d03ec88

progress

445ef0d

add unit tests

acdcd31

nfx requested review from mgyucht and sander-goos September 1, 2023 11:37

nfx removed the do-not-merge label Sep 1, 2023

nfx enabled auto-merge September 1, 2023 11:37

nfx added 2 commits September 1, 2023 14:28

add documentation

6f14160

python <3.9 compat

0abfe6d

nfx changed the title ~~Added user-friendly API to execute SQL statements~~ Added user-friendly API to execute SQL statements: for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict()) Sep 1, 2023

nfx added 4 commits September 1, 2023 14:37

python <3.9 compat

957151c

python <3.9 compat

115b099

i hate py3.7

e0255f5

bump code coverage

fcfb017

sander-goos reviewed Sep 15, 2023

View reviewed changes

nfx added the ergonomics UX of SDK label Sep 25, 2023

nfx added a commit to databrickslabs/lsql that referenced this pull request Oct 5, 2023

Initial version

d895792

Initial port from Databricks Python SDK PR: databricks/databricks-sdk-py#295

auto-merge was automatically disabled November 23, 2023 15:14
Merge queue setting changed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added user-friendly API to execute SQL statements: `for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict())` #295

Added user-friendly API to execute SQL statements: `for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict())` #295

nfx commented Aug 17, 2023 •

edited

Loading

codecov-commenter commented Aug 17, 2023 •

edited

Loading

mgyucht left a comment

sander-goos left a comment

sander-goos left a comment

sander-goos Sep 15, 2023

nfx Sep 18, 2023

sander-goos Sep 20, 2023

sander-goos Sep 15, 2023

sander-goos Sep 15, 2023

nfx Sep 18, 2023

sander-goos Sep 15, 2023

sander-goos Sep 15, 2023

sander-goos Sep 15, 2023

nfx Sep 18, 2023

sander-goos Sep 15, 2023

sander-goos Sep 15, 2023

nfx Sep 18, 2023

sander-goos Sep 15, 2023

nfx Sep 18, 2023

alexott commented Oct 4, 2023

		cancel_execution.assert_called_with('bcd')


		def test_fetch_all_no_chunks(config, mocker):

Added user-friendly API to execute SQL statements: for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict()) #295

Are you sure you want to change the base?

Added user-friendly API to execute SQL statements: for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict()) #295

Conversation

nfx commented Aug 17, 2023 • edited Loading

New Integration tests

codecov-commenter commented Aug 17, 2023 • edited Loading

Codecov Report

mgyucht left a comment

Choose a reason for hiding this comment

sander-goos left a comment

Choose a reason for hiding this comment

sander-goos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexott commented Oct 4, 2023

Added user-friendly API to execute SQL statements: `for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict())` #295

Added user-friendly API to execute SQL statements: `for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict())` #295

nfx commented Aug 17, 2023 •

edited

Loading

codecov-commenter commented Aug 17, 2023 •

edited

Loading