-
Notifications
You must be signed in to change notification settings - Fork 13
[PECO-1803] Databricks sqlalchemy is split into this folder #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
53e63ab
91d6cd2
4bc939b
dc0a25b
433c172
68fa020
296c5a3
7abe28e
e12abcb
8d31267
786c0be
a7bbc5a
073781e
52c0a29
ef0ee61
a317b39
5ac49fe
144907c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| name: Integration | ||
|
|
||
| on: | ||
| pull_request: | ||
| types: [ opened, synchronize, reopened ] | ||
| branches: [ main, PECO-1803 ] | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| build_and_test: | ||
| runs-on: ubuntu-latest | ||
| environment: azure-prod | ||
| env: | ||
| DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DATABRICKS_SERVER_HOSTNAME }} | ||
| DATABRICKS_HTTP_PATH: ${{ secrets.DATABRICKS_HTTP_PATH }} | ||
| DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} | ||
| DATABRICKS_CATALOG: ${{ secrets.DATABRICKS_CATALOG }} | ||
| DATABRICKS_SCHEMA : ${{ secrets.DATABRICKS_SCHEMA }} | ||
| DATABRICKS_USER: ${{ secrets.DATABRICKS_USER }} | ||
|
|
||
| steps: | ||
| # Checkout your own repository | ||
| - name: Checkout Repository | ||
| uses: actions/checkout@v3 | ||
|
|
||
| # Checkout the other repository | ||
| - name: Checkout Dependency Repository | ||
| uses: actions/checkout@v3 | ||
| with: | ||
| repository: jprakash-db/databricks-sql-python | ||
| path: databricks_sql_python | ||
| ref : jprakash-db/PECO-1803 | ||
|
||
|
|
||
| # Set up Python | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: '3.9' | ||
|
|
||
| # Install Poetry | ||
| - name: Install Poetry | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip3 install poetry | ||
| python3 -m venv venv | ||
| ls databricks_sql_python/databricks_sql_connector_core | ||
|
||
|
|
||
| # Install the requirements of your repository | ||
| - name: Install Dependencies | ||
| run: | | ||
| source venv/bin/activate | ||
| poetry build | ||
| pip3 install dist/*.whl | ||
|
|
||
| # Build the .whl file in the dependency repository | ||
| - name: Build Dependency Package | ||
| run: | | ||
| source venv/bin/activate | ||
| pip3 install databricks_sql_python/databricks_sql_connector_core/dist/*.whl | ||
|
||
|
|
||
| # Run pytest to execute tests in your repository | ||
| - name: Run Tests | ||
| run: | | ||
| source venv/bin/activate | ||
| pip3 list | ||
| pip3 install pytest | ||
|
|
||
| - name : Main Tests | ||
| run: | | ||
| source venv/bin/activate | ||
| pytest src/databricks_sqlalchemy/test_local | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a fresh README? Feel free to do it in a separate PR if you like. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that there's already a README. I think this is the correct location, no? Let's move it here
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @madhav-db Fixed it |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +0,0 @@ | ||
| # SQLAlchemy Dialect for Databricks | ||
|
|
||
| See PECO-1396 for more information about this repository. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| import pytest | ||
|
|
||
| class DatabricksImportError(Exception): | ||
| pass | ||
|
|
||
| class TestLibraryDependencySuite: | ||
|
|
||
| @pytest.mark.skipif(pytest.importorskip("databricks_sql_connector_core"), reason="databricks_sql_connector_core is present") | ||
| def test_sql_core(self): | ||
| with pytest.raises(DatabricksImportError, match="databricks_sql_connector_core module is not available"): | ||
| try: | ||
| import databricks_sql_connector_core | ||
| except ImportError: | ||
| raise DatabricksImportError("databricks_sql_connector_core module is not available") | ||
|
|
||
| @pytest.mark.skipif(pytest.importorskip("sqlalchemy"), reason="SQLAlchemy is present") | ||
| def test_sqlalchemy(self): | ||
| with pytest.raises(DatabricksImportError, match="sqlalchemy module is not available"): | ||
| try: | ||
| import sqlalchemy | ||
| except ImportError: | ||
| raise DatabricksImportError("sqlalchemy module is not available") |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| [tool.poetry] | ||
| name = "databricks-sqlalchemy" | ||
| version = "1.0.0" | ||
| description = "Databricks SQLAlchemy plugin for Python" | ||
| authors = ["Databricks <[email protected]>"] | ||
| license = "Apache-2.0" | ||
| readme = "README.md" | ||
| packages = [{ include = "databricks_sqlalchemy", from = "src" }] | ||
| include = ["CHANGELOG.md"] | ||
|
|
||
| [tool.poetry.dependencies] | ||
| python = "^3.8.0" | ||
| sqlalchemy = { version = ">=2.0.21" } | ||
|
|
||
| [tool.poetry.dev-dependencies] | ||
| pytest = "^7.1.2" | ||
| mypy = "^1.10.1" | ||
| pylint = ">=2.12.0" | ||
| black = "^22.3.0" | ||
| pytest-dotenv = "^0.5.2" | ||
|
|
||
| [tool.poetry.urls] | ||
| "Homepage" = "https://github.com/databricks/databricks-sql-python" | ||
|
||
| "Bug Tracker" = "https://github.com/databricks/databricks-sql-python/issues" | ||
|
|
||
| [tool.poetry.plugins."sqlalchemy.dialects"] | ||
| "databricks" = "databricks_sqlalchemy:DatabricksDialect" | ||
|
|
||
| [build-system] | ||
| requires = ["poetry-core>=1.0.0"] | ||
| build-backend = "poetry.core.masonry.api" | ||
|
|
||
| [tool.mypy] | ||
| ignore_missing_imports = "true" | ||
| exclude = ['ttypes\.py$', 'TCLIService\.py$'] | ||
|
|
||
| [tool.black] | ||
| exclude = '/(\.eggs|\.git|\.hg|\.mypy_cache|\.nox|\.tox|\.venv|\.svn|_build|buck-out|build|dist|thrift_api)/' | ||
| # | ||
| [tool.pytest.ini_options] | ||
| markers = {"reviewed" = "Test case has been reviewed by Databricks"} | ||
| minversion = "6.0" | ||
| log_cli = "false" | ||
| log_cli_level = "INFO" | ||
| testpaths = ["tests", "databricks_sqlalchemy/src/databricks_sqlalchemy/test_local"] | ||
| env_files = ["test.env"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| ## Databricks dialect for SQLALchemy 2.0 | ||
|
|
||
| The Databricks dialect for SQLAlchemy serves as bridge between [SQLAlchemy](https://www.sqlalchemy.org/) and the Databricks SQL Python driver. The dialect is included with `databricks-sql-connector==3.0.0` and above. A working example demonstrating usage can be found in `examples/sqlalchemy.py`. | ||
|
|
||
| ## Usage with SQLAlchemy <= 2.0 | ||
| A SQLAlchemy 1.4 compatible dialect was first released in connector [version 2.4](https://github.com/databricks/databricks-sql-python/releases/tag/v2.4.0). Support for SQLAlchemy 1.4 was dropped from the dialect as part of `databricks-sql-connector==3.0.0`. To continue using the dialect with SQLAlchemy 1.x, you can use `databricks-sql-connector^2.4.0`. | ||
|
|
||
|
|
||
| ## Installation | ||
|
|
||
| To install the dialect and its dependencies: | ||
|
|
||
| ```shell | ||
| pip install databricks-sql-connector[sqlalchemy] | ||
| ``` | ||
|
|
||
| If you also plan to use `alembic` you can alternatively run: | ||
|
|
||
| ```shell | ||
| pip install databricks-sql-connector[alembic] | ||
| ``` | ||
|
|
||
| ## Connection String | ||
|
|
||
| Every SQLAlchemy application that connects to a database needs to use an [Engine](https://docs.sqlalchemy.org/en/20/tutorial/engine.html#tutorial-engine), which you can create by passing a connection string to `create_engine`. The connection string must include these components: | ||
|
|
||
| 1. Host | ||
| 2. HTTP Path for a compute resource | ||
| 3. API access token | ||
| 4. Initial catalog for the connection | ||
| 5. Initial schema for the connection | ||
|
|
||
| **Note: Our dialect is built and tested on workspaces with Unity Catalog enabled. Support for the `hive_metastore` catalog is untested.** | ||
|
|
||
| For example: | ||
|
|
||
| ```python | ||
| import os | ||
| from sqlalchemy import create_engine | ||
|
|
||
| host = os.getenv("DATABRICKS_SERVER_HOSTNAME") | ||
| http_path = os.getenv("DATABRICKS_HTTP_PATH") | ||
| access_token = os.getenv("DATABRICKS_TOKEN") | ||
| catalog = os.getenv("DATABRICKS_CATALOG") | ||
| schema = os.getenv("DATABRICKS_SCHEMA") | ||
|
|
||
| engine = create_engine( | ||
| f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}" | ||
| ) | ||
| ``` | ||
|
|
||
| ## Types | ||
|
|
||
| The [SQLAlchemy type hierarchy](https://docs.sqlalchemy.org/en/20/core/type_basics.html) contains backend-agnostic type implementations (represented in CamelCase) and backend-specific types (represented in UPPERCASE). The majority of SQLAlchemy's [CamelCase](https://docs.sqlalchemy.org/en/20/core/type_basics.html#the-camelcase-datatypes) types are supported. This means that a SQLAlchemy application using these types should "just work" with Databricks. | ||
|
|
||
| |SQLAlchemy Type|Databricks SQL Type| | ||
| |-|-| | ||
| [`BigInteger`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.BigInteger)| [`BIGINT`](https://docs.databricks.com/en/sql/language-manual/data-types/bigint-type.html) | ||
| [`LargeBinary`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.LargeBinary)| (not supported)| | ||
| [`Boolean`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Boolean)| [`BOOLEAN`](https://docs.databricks.com/en/sql/language-manual/data-types/boolean-type.html) | ||
| [`Date`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Date)| [`DATE`](https://docs.databricks.com/en/sql/language-manual/data-types/date-type.html) | ||
| [`DateTime`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.DateTime)| [`TIMESTAMP_NTZ`](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-ntz-type.html)| | ||
| [`Double`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Double)| [`DOUBLE`](https://docs.databricks.com/en/sql/language-manual/data-types/double-type.html) | ||
| [`Enum`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Enum)| (not supported)| | ||
| [`Float`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Float)| [`FLOAT`](https://docs.databricks.com/en/sql/language-manual/data-types/float-type.html) | ||
| [`Integer`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Integer)| [`INT`](https://docs.databricks.com/en/sql/language-manual/data-types/int-type.html) | ||
| [`Numeric`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Numeric)| [`DECIMAL`](https://docs.databricks.com/en/sql/language-manual/data-types/decimal-type.html)| | ||
| [`PickleType`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.PickleType)| (not supported)| | ||
| [`SmallInteger`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.SmallInteger)| [`SMALLINT`](https://docs.databricks.com/en/sql/language-manual/data-types/smallint-type.html) | ||
| [`String`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.String)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html)| | ||
| [`Text`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Text)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html)| | ||
| [`Time`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Time)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html)| | ||
| [`Unicode`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Unicode)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html)| | ||
| [`UnicodeText`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.UnicodeText)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html)| | ||
| [`Uuid`](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Uuid)| [`STRING`](https://docs.databricks.com/en/sql/language-manual/data-types/string-type.html) | ||
|
|
||
| In addition, the dialect exposes three UPPERCASE SQLAlchemy types which are specific to Databricks: | ||
|
|
||
| - [`databricks.sqlalchemy.TINYINT`](https://docs.databricks.com/en/sql/language-manual/data-types/tinyint-type.html) | ||
| - [`databricks.sqlalchemy.TIMESTAMP`](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-type.html) | ||
| - [`databricks.sqlalchemy.TIMESTAMP_NTZ`](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-ntz-type.html) | ||
|
|
||
|
|
||
| ### `LargeBinary()` and `PickleType()` | ||
|
|
||
| Databricks Runtime doesn't currently support binding of binary values in SQL queries, which is a pre-requisite for this functionality in SQLAlchemy. | ||
|
|
||
| ## `Enum()` and `CHECK` constraints | ||
|
|
||
| Support for `CHECK` constraints is not implemented in this dialect. Support is planned for a future release. | ||
|
|
||
| SQLAlchemy's `Enum()` type depends on `CHECK` constraints and is therefore not yet supported. | ||
|
|
||
| ### `DateTime()`, `TIMESTAMP_NTZ()`, and `TIMESTAMP()` | ||
|
|
||
| Databricks Runtime provides two datetime-like types: `TIMESTAMP` which is always timezone-aware and `TIMESTAMP_NTZ` which is timezone agnostic. Both types can be imported from `databricks.sqlalchemy` and used in your models. | ||
|
|
||
| The SQLAlchemy documentation indicates that `DateTime()` is not timezone-aware by default. So our dialect maps this type to `TIMESTAMP_NTZ()`. In practice, you should never need to use `TIMESTAMP_NTZ()` directly. Just use `DateTime()`. | ||
|
|
||
| If you need your field to be timezone-aware, you can import `TIMESTAMP()` and use it instead. | ||
|
|
||
| _Note that SQLAlchemy documentation suggests that you can declare a `DateTime()` with `timezone=True` on supported backends. However, if you do this with the Databricks dialect, the `timezone` argument will be ignored._ | ||
|
|
||
| ```python | ||
| from sqlalchemy import DateTime | ||
| from databricks.sqlalchemy import TIMESTAMP | ||
|
|
||
| class SomeModel(Base): | ||
| some_date_without_timezone = DateTime() | ||
| some_date_with_timezone = TIMESTAMP() | ||
| ``` | ||
|
|
||
| ### `String()`, `Text()`, `Unicode()`, and `UnicodeText()` | ||
|
|
||
| Databricks Runtime doesn't support length limitations for `STRING` fields. Therefore `String()` or `String(1)` or `String(255)` will all produce identical DDL. Since `Text()`, `Unicode()`, `UnicodeText()` all use the same underlying type in Databricks SQL, they will generate equivalent DDL. | ||
|
|
||
| ### `Time()` | ||
|
|
||
| Databricks Runtime doesn't have a native time-like data type. To implement this type in SQLAlchemy, our dialect stores SQLAlchemy `Time()` values in a `STRING` field. Unlike `DateTime` above, this type can optionally support timezone awareness (since the dialect is in complete control of the strings that we write to the Delta table). | ||
|
|
||
| ```python | ||
| from sqlalchemy import Time | ||
|
|
||
| class SomeModel(Base): | ||
| time_tz = Time(timezone=True) | ||
| time_ntz = Time() | ||
| ``` | ||
|
|
||
|
|
||
| # Usage Notes | ||
|
|
||
| ## `Identity()` and `autoincrement` | ||
|
|
||
| Identity and generated value support is currently limited in this dialect. | ||
|
|
||
| When defining models, SQLAlchemy types can accept an [`autoincrement`](https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.Column.params.autoincrement) argument. In our dialect, this argument is currently ignored. To create an auto-incrementing field in your model you can pass in an explicit [`Identity()`](https://docs.sqlalchemy.org/en/20/core/defaults.html#identity-ddl) instead. | ||
|
|
||
| Furthermore, in Databricks Runtime, only `BIGINT` fields can be configured to auto-increment. So in SQLAlchemy, you must use the `BigInteger()` type. | ||
|
|
||
| ```python | ||
| from sqlalchemy import Identity, String | ||
|
|
||
| class SomeModel(Base): | ||
| id = BigInteger(Identity()) | ||
| value = String() | ||
| ``` | ||
|
|
||
| When calling `Base.metadata.create_all()`, the executed DDL will include `GENERATED ALWAYS AS IDENTITY` for the `id` column. This is useful when using SQLAlchemy to generate tables. However, as of this writing, `Identity()` constructs are not captured when SQLAlchemy reflects a table's metadata (support for this is planned). | ||
|
|
||
| ## Parameters | ||
|
|
||
| `databricks-sql-connector` supports two approaches to parameterizing SQL queries: native and inline. Our SQLAlchemy 2.0 dialect always uses the native approach and is therefore limited to DBR 14.2 and above. If you are writing parameterized queries to be executed by SQLAlchemy, you must use the "named" paramstyle (`:param`). Read more about parameterization in `docs/parameters.md`. | ||
|
|
||
| ## Usage with pandas | ||
|
|
||
| Use [`pandas.DataFrame.to_sql`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) and [`pandas.read_sql`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql) to write and read from Databricks SQL. These methods both accept a SQLAlchemy connection to interact with Databricks. | ||
|
|
||
| ### Read from Databricks SQL into pandas | ||
| ```python | ||
| from sqlalchemy import create_engine | ||
| import pandas as pd | ||
|
|
||
| engine = create_engine("databricks://token:dapi***@***.cloud.databricks.com?http_path=***&catalog=main&schema=test") | ||
| with engine.connect() as conn: | ||
| # This will read the contents of `main.test.some_table` | ||
| df = pd.read_sql("some_table", conn) | ||
| ``` | ||
|
|
||
| ### Write to Databricks SQL from pandas | ||
|
|
||
| ```python | ||
| from sqlalchemy import create_engine | ||
| import pandas as pd | ||
|
|
||
| engine = create_engine("databricks://token:dapi***@***.cloud.databricks.com?http_path=***&catalog=main&schema=test") | ||
| squares = [(i, i * i) for i in range(100)] | ||
| df = pd.DataFrame(data=squares,columns=['x','x_squared']) | ||
|
|
||
| with engine.connect() as conn: | ||
| # This will write the contents of `df` to `main.test.squares` | ||
| df.to_sql('squares',conn) | ||
| ``` | ||
|
|
||
| ## [`PrimaryKey()`](https://docs.sqlalchemy.org/en/20/core/constraints.html#sqlalchemy.schema.PrimaryKeyConstraint) and [`ForeignKey()`](https://docs.sqlalchemy.org/en/20/core/constraints.html#defining-foreign-keys) | ||
|
|
||
| Unity Catalog workspaces in Databricks support PRIMARY KEY and FOREIGN KEY constraints. _Note that Databricks Runtime does not enforce the integrity of FOREIGN KEY constraints_. You can establish a primary key by setting `primary_key=True` when defining a column. | ||
|
|
||
| When building `ForeignKey` or `ForeignKeyConstraint` objects, you must specify a `name` for the constraint. | ||
|
|
||
| If your model definition requires a self-referential FOREIGN KEY constraint, you must include `use_alter=True` when defining the relationship. | ||
|
|
||
| ```python | ||
| from sqlalchemy import Table, Column, ForeignKey, BigInteger, String | ||
|
|
||
| users = Table( | ||
| "users", | ||
| metadata_obj, | ||
| Column("id", BigInteger, primary_key=True), | ||
| Column("name", String(), nullable=False), | ||
| Column("email", String()), | ||
| Column("manager_id", ForeignKey("users.id", name="fk_users_manager_id_x_users_id", use_alter=True)) | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use personal repo and it should be
databricks/databricks-sqlalchemy