Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added user-friendly API to execute SQL statements: for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict()) #295

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nfx
Copy link
Contributor

@nfx nfx commented Aug 17, 2023

Execute SQL statements in a stateless manner.

Primary use-case of :py:meth:iterate_rows and :py:meth:execute methods is oriented at executing SQL queries in
a stateless manner straight away from Databricks SDK for Python, without requiring any external dependencies.
Results are fetched in JSON format through presigned external links. This is perfect for serverless applications
like AWS Lambda, Azure Functions, or any other containerised short-lived applications, where container startup
time is faster with the smaller dependency set.

for (pickup_zip, dropoff_zip) in w.statement_execution.iterate_rows(warehouse_id,
    'SELECT pickup_zip, dropoff_zip FROM nyctaxi.trips LIMIT 10', catalog='samples'):
    print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')

Method :py:meth:iterate_rows returns an iterator of objects, that resemble :class:pyspark.sql.Row APIs, but full
compatibility is not the goal of this implementation.

iterate_rows = functools.partial(w.statement_execution.iterate_rows, warehouse_id, catalog='samples')
for row in iterate_rows('SELECT * FROM nyctaxi.trips LIMIT 10'):
    pickup_time, dropoff_time = row[0], row[1]
    pickup_zip = row.pickup_zip
    dropoff_zip = row['dropoff_zip']
    all_fields = row.as_dict()
    print(f'{pickup_zip}@{pickup_time} -> {dropoff_zip}@{dropoff_time}: {all_fields}')

When you only need to execute the query and have no need to iterate over results, use the :py:meth:execute.

w.statement_execution.execute(warehouse_id, 'CREATE TABLE foo AS SELECT * FROM range(10)')

Applications, that need to a more traditional SQL Python APIs with cursors, efficient data transfer of hundreds of
megabytes or gigabytes of data serialized in Apache Arrow format, and low result fetching latency, should use
the stateful Databricks SQL Connector for Python.

New Integration tests

image

@nfx nfx requested a review from mgyucht August 17, 2023 15:01
@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2023

Codecov Report

Patch coverage is 89.92% of modified lines.

Files Changed Coverage
databricks/sdk/mixins/sql.py 89.51%
databricks/sdk/core.py 92.30%
databricks/sdk/__init__.py 100.00%

📢 Thoughts on this report? Let us know!.

Copy link
Contributor

@mgyucht mgyucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions but overall this looks great! Thanks for contributing this.

databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
databricks/sdk/mixins/sql.py Show resolved Hide resolved
databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
Copy link

@sander-goos sander-goos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Are you planning to support this in all SDKs?

databricks/sdk/mixins/sql.py Show resolved Hide resolved
databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
databricks/sdk/mixins/sql.py Show resolved Hide resolved
databricks/sdk/mixins/sql.py Show resolved Hide resolved
databricks/sdk/mixins/sql.py Outdated Show resolved Hide resolved
databricks/sdk/mixins/sql.py Show resolved Hide resolved
```python
for (pickup_zip, dropoff_zip) in w.statement_execution.execute_fetch_all(
        warehouse_id, 'SELECT pickup_zip, dropoff_zip FROM nyctaxi.trips LIMIT 10',
        catalog='samples'):
    print(f'pickup_zip={pickup_zip}, dropoff_zip={dropoff_zip}')
```
@nfx nfx force-pushed the sql/execute-mixin branch from df6d924 to 2bf281d Compare September 1, 2023 07:07
@nfx nfx requested review from mgyucht and sander-goos September 1, 2023 11:37
@nfx nfx removed the do-not-merge label Sep 1, 2023
@nfx nfx enabled auto-merge September 1, 2023 11:37
@nfx nfx changed the title Added user-friendly API to execute SQL statements Added user-friendly API to execute SQL statements: for row in w.statement_execution.iterate_rows(warehouse_id, 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'): print(row.as_dict()) Sep 1, 2023
Copy link

@sander-goos sander-goos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, left a few comments

@staticmethod
def _parse_timestamp(value: str) -> datetime.datetime:
# make it work with Python 3.7 to 3.10 as well
return datetime.datetime.fromisoformat(value.replace('Z', '+00:00'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would break when timezone is not UTC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sander-goos we always return in UTC, afaik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will use the current timezone from the spark session. By default this is UTC but it can be changed by a SQL admin.

warehouse_id: str,
statement: str,
*,
byte_limit: Optional[int] = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add row_limit as well

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be great if we could add parameters https://docs.databricks.com/api/workspace/statementexecution/executestatement
But we can do it in a follow up as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sander-goos not in this PR, feel free to add a follow-up PR with these additions.

result_data = self.get_statement_result_chunk_n(execute_response.statement_id,
external_link.next_chunk_index)

def _iterate_inline_disposition(self, execute_response: ExecuteStatementResponse) -> Iterator[Row]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this if it's not used?

cancel_execution.assert_called_with('bcd')


def test_fetch_all_no_chunks(config, mocker):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test_fetch_all_single_chunk

raise DatabricksError(message, error_code=error_code)
raise DatabricksError(status.state.value)

def execute(self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can name this: execute_sync or execute_and_wait to make clear this is waiting for the query to finish up to some specified timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_and_wait has slightly different semantics in the SDK, as those methods return a future-like object

msg = f"timed out after {timeout}: {status_message}"
raise TimeoutError(msg)

def iterate_rows(self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think iterate_rows doesn't fully cover what this method is doing.. As this is handling execution and result fetching, what about: execute_and_fetch?

assert isinstance(rows[0].since, datetime.date)
assert isinstance(rows[0].now, datetime.datetime)

http_get.assert_called_with('https://singed-url')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that we never submit the authorization token when fetching any external link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# ensure that we close the HTTP session after fetching the external links
result_data = execute_response.result
row_factory, col_conv = self._result_schema(execute_response)
with self._api._new_session() as http:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sure we don't use the authorization header that's used for the internal requests? That is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sander-goos self._api._new_session() starts HTTP session without our auth headers, because pre-signed URLs don't expect any Authorization headers.

all other requests follow the normal flow.

@nfx nfx added the ergonomics UX of SDK label Sep 25, 2023
@alexott
Copy link
Contributor

alexott commented Oct 4, 2023

That would be really nice way to use in the Airflow provider because right now dependencies are too heavyweight.

nfx added a commit to databrickslabs/lsql that referenced this pull request Oct 5, 2023
Initial port from Databricks Python SDK PR: databricks/databricks-sdk-py#295
auto-merge was automatically disabled November 23, 2023 15:14

Merge queue setting changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ergonomics UX of SDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants