get_column_schema_from_query macro #6986

MichelleArk · 2023-02-15T16:21:23Z

resolves #6751

Description

adapters/base/impl.py
- Adds get_column_schema_from_query(self, sql: str) -> List[Tuple[str, Any]] to BaseAdapter
- calls the connection manager's get_column_schema_from_query method
adapters/sql/connection.py
- Implements default get_column_schema_from_query on SQLConnectionManager, which depends on two new class methods:
  1. get_column_schema_from_cursor
  2. data_type_code_to_name
- These new class methods (including get_column_schema_from_query) can be overwritten in adapter connection manager implementations of get_column_schema_from_query.
plugins/postgres/dbt/adapters/postgres/connections.py
- PostgresConnectionManager overwrites data_type_code_to_name using psycopg2.extensions.string_types. This is the only change necessary to get model contracts extended to data_types for Postgres.

get_column_schema_from_query Usage

Used in assert_columns_equivalent
- get_column_schema_from_query is called twice:
  1. Once on the model SQL. This obtains the column schema for the actual model based on its SQL.
  2. Once on the empty_schema_sql, which is simply a SELECT cast(null as {{ col['data_type'] }} for each column specified in the expected model schema yaml.
- Once the expected and actual column schemas are obtained, they are formatted and compared, raising a compiler error if any differences are detected.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
[Jerco has] opened an issue to add/update docs, or docs changes are not required/relevant for this PR: https://github.com/dbt-labs/docs.getdbt.com/pull/2890/files#diff-9056e8c425c88b44d27796d1131ceb68c5b5b15ccc57a140d609be3d4b5358e7R47
I have run changie new to create a changelog entry

🎩

❯ dbt run --select test_schema
02:53:06  Running with dbt=1.5.0-a1
02:53:06  Found 6 models, 20 tests, 0 snapshots, 1 analysis, 503 macros, 0 operations, 3 seed files, 0 sources, 1 exposure, 1 metric
02:53:06  
02:53:07  Concurrency: 1 threads (target='dev')
02:53:07  
02:53:07  1 of 1 START sql table model test_arky.test_schematest_v2 ...................... [RUN]
02:53:07  1 of 1 ERROR creating sql table model test_arky.test_schematest_v2 ............. [ERROR in 0.04s]
02:53:07  
02:53:07  Finished running 1 table model in 0 hours 0 minutes and 0.46 seconds (0.46s).
02:53:07  
02:53:07  Completed with 1 error and 0 warnings:
02:53:07  
02:53:07  Compilation Error in model test_schema (models/test_schema.sql)
02:53:07    Please ensure the name, data_type, order, and number of columns in your `yml` file match the columns in your SQL file.
02:53:07    Schema File Columns: 
02:53:07      a INT, b TEXT
02:53:07    
02:53:07    SQL File Columns: 
02:53:07      a INT, c TEXT 
02:53:07    
02:53:07    > in macro assert_columns_equivalent (macros/materializations/models/table/columns_spec_ddl.sql)
02:53:07    > called by model test_schema (models/test_schema.sql)
02:53:07  
02:53:07  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

github-actions · 2023-02-15T16:21:44Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

core/dbt/adapters/sql/connections.py

core/dbt/include/global_project/macros/adapters/columns.sql

core/dbt/adapters/sql/connections.py

VersusFacit · 2023-02-28T20:12:09Z

core/dbt/include/global_project/macros/adapters/columns.sql

+    select
+    {% for i in columns %}
+      {%- set col = columns[i] -%}
+      cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }}


We talked about this. This can possibly lead to some weird type resolution outcomes...we think. This is so far the "best" option and so far looks promising. I just know SQLs typing mechanisms can get a mind of their own for the worse.

@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.

One more thing this approach has going for it is that there's a non-opaque definition of what data_type values should be - it's 'the value you'd write in SQL when casting to the desired type', as opposed to 'the value returned by mapping the connection cursor's type code to a string'.

this discussion should be seen as a context dump for posterity and is not blocking.

core/dbt/include/global_project/macros/adapters/columns.sql

VersusFacit

My thoughts are all design minded as opposed to implementation minded. This might be an error on my part since Jinja is harder to track than Python. That said, I feel like deliberate design will help indicate/signpost what the implementation is expected to do.

peterallenwebb · 2023-02-28T22:00:55Z

core/dbt/include/global_project/macros/adapters/columns.sql

+    select
+    {% for i in columns %}
+      {%- set col = columns[i] -%}
+      cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }}


@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.

mikealfare · 2023-03-02T00:03:15Z

@MichelleArk I could be naive to the context here, so take this with a grain of salt. What is the benefit gained from putting these methods largely on SQLConnectionManager? They rely very little on the connection attribution, mainly just cursor. And then the result is pulled through in a pass-through method in BaseAdapter with no logic applied.

I'm concerned that we might be polluting the interface on SQLConnectionManager. If I'm thinking about it correctly, SQLConnectionManager deals with the general task of how we interrogate the database and BaseAdapter deals with what we're asking of the database. BaseAdapter effectively uses SQLConnectionManager to answer questions; SQLConnectionManager doesn't ask questions itself.

I would think some of this logic belongs on BaseAdapter. At first glance, I would expose a way to add a standard query on SQLConnectionManager without executing it, basically the body of get_column_schema_from_query. And then I'd probably also put the data_type map there because it has no knowledge of the question being asked:

class SQLConnectionManager(BaseConnectionManager):
    ...

    def add_select_query(self, sql: str) -> Tuple[Connection, Any]:
        sql = self._add_query_comment(sql, auto_begin=False)
        return self.add_query(sql)

    def add_begin_query(self):
        return self.add_query("BEGIN", auto_begin=False)

    ...

    @property
    def data_type_map(self) -> Dict[Union[str, int], str]:
        # build the dictionary somehow, needs to be a method to raise NotImplementedError
        # e.g., abstract
        raise dbt.exceptions.NotImplementedError(...)
        # e.g., Snowflake:
        return snowflake.connector.constants.FIELD_ID_TO_NAME

    ...

Then I'd put the logic from get_column_schema_from_cursor into get_column_schema_from_query on BaseAdaptor, e.g.:

ColumnSchema = List[Tuple[str, str]]


class BaseAdapter(metaclass=AdapterMeta):
    ...

    @available.parse(lambda *a, **k: [])
    def get_column_schema_from_query(self, sql: str) -> ColumnSchema:
        _, cursor = self.connections.add_select_query(sql)
        data_type_map = self.connections.data_type_map
        columns: ColumnSchema = [
            column_name, data_type_map.get(column_code, "UNDEFINED")
            for column_name, column_code, *_ in cursor.description
        ]
        return columns

    ...

Am I thinking about this correctly?

VersusFacit

I'm going to approve this as to not hold things up. We all agree that the code is function and generally "makes sense."

However, please do consider Mike's comment before hitting merge. Likewise, Mike (or any of my team colleagues), request changes to make this even more clear.

Our interfaces are in flux and this is the right time I think to ask these questions. If it doesn't end up in this PR, we should consider what sorts of "contracts" we want these objects to observe. I admit I haven't considered these specifics until now the questions Mike raises, but they feel cogent, insofar as we should weigh these tradeoffs now (early).

MichelleArk · 2023-03-02T15:24:13Z

@mikealfare - Thank you for this really thoughtful comment!

I'm concerned that we might be polluting the interface on SQLConnectionManager.

I had the same concern here and went back and forth for a bit between where this functionality should live. I agree with the overall responsibility of each SQLConnectionManager and BaseAdapter you've described, and that the get_column_schema_from_query functionality in BaseAdapter goes beyond the general task of how we interrogate the database.

Your proposal makes a lot of sense and should work across adapter implementations, but my main concern was that the SQLConnectionManager would be leaking the cursor object, which has some standardization from the DB-API standard, but will vary across adapters and has to do with how the database runs queries. Should the BigQueryAdapter need to know about bigquery.job.query.QueryJob (and know that a query has a temp destination table?) or bigquery.table.RowIterator? But.. it looks like the bigquery adapter is already doing that, for a very similar purpose. I didn't see that was the case already 🤦

All that said - I do think the BaseAdapter is still the better spot for the implementation for the reasons you described. It additionally is configured with the Column class and is better positioned to create a ColumnSchema = List[Column] result for end callers to use, eliminating the need for this messiness in jinja.

I'm going to work in your proposed structure - it shouldn't require a huge refactoring lift in adapter-specific implementations. Longer term, I think we'd want to explore introducing an abstraction around a Cursor or QueryResult (similar to what we've done with Connection).

tests/adapter/dbt/tests/adapter/constraints/fixtures.py

dbeatty10 · 2023-03-02T18:02:34Z

core/dbt/adapters/sql/connections.py

+    @classmethod
+    def data_type_code_to_name(cls, type_code: Union[int, str]) -> str:
+        """Get the string representation of the data type from the type_code."""
+        # https://peps.python.org/pep-0249/#type-objects
+        raise dbt.exceptions.NotImplementedError(
+            "`data_type_code_to_name` is not implemented for this adapter!"
+        )
+


I was interested in seeing a tangible example of how data type codes map to a string representation for a database connector.

This table from the Snowflake docs was useful to me:
https://docs.snowflake.com/en/user-guide/python-connector-api#label-python-connector-type-codes

type_code String Representation Data Type

0 FIXED NUMBER/INT

1 REAL REAL

2 TEXT VARCHAR/STRING

3 DATE DATE

4 TIMESTAMP TIMESTAMP

5 VARIANT VARIANT

6 TIMESTAMP_LTZ TIMESTAMP_LTZ

7 TIMESTAMP_TZ TIMESTAMP_TZ

8 TIMESTAMP_NTZ TIMESTAMP_TZ

9 OBJECT OBJECT

10 ARRAY ARRAY

11 BINARY BINARY

12 TIME TIME

13 BOOLEAN BOOLEAN

(Side note: I suspect there is a typo for code 8 and TIMESTAMP_TZ there should be TIMESTAMP_NTZ instead.)

MichelleArk · 2023-03-03T05:17:08Z

tests/adapter/dbt/tests/adapter/constraints/test_constraints.py

+        return int_type
+
+    @pytest.fixture
+    def data_types(self, schema_int_type, int_type, string_type):


I would have loved to use pytest's parametrized functionality here (and did implement it that way to start), but unfortunately couldn't find a way to make it work with the testing inheritance pattern we have for adapter tests given the parametrized test case array would need to be provided dynamically - which pytest.parametrize does not seem to support.

core/dbt/include/global_project/macros/adapters/columns.sql

mikealfare · 2023-03-03T16:51:18Z

tests/adapter/dbt/tests/adapter/constraints/test_constraints.py

+    ):
+        for (sql_column_value, schema_data_type, error_data_type) in data_types:
+            # Write parametrized data_type to sql file
+            write_file(


It would be nice for this function to be a fixture instead, so that we can yield the file and then clean it up after. Maybe the project fixture does that if the file is setup in the project directory?

The project fixture drops the test schema on teardown, but from what I can tell, it does not clean up the temp project directory. I understand this might actually be by design, for ease of debugging failing tests - similar to intentionally preserving log files. Worth revisiting, but probably out of scope for this PR.

Add adapter.get_column_schema_from_query

jtcohen6 and others added 2 commits February 15, 2023 11:20

Experiment: column types from empty query

e8399cd

use get_empty_schema_sql to get column schema from schema yml

0a9391a

cla-bot bot added the cla:yes label Feb 15, 2023

MichelleArk mentioned this pull request Feb 15, 2023

Ct 1919/get column schema from query macro core #6981

Closed

6 tasks

MichelleArk changed the title ~~Ct 1919/get column schema from query macro2~~ get_column_schema_from_query_macro Feb 15, 2023

MichelleArk added 3 commits February 22, 2023 13:04

polish + test__constraints_wrong_column_data_types

02d2c5c

changelog entry

bef70fc

get_column_schema_from_query documentation

e1ed713

MichelleArk marked this pull request as ready for review February 22, 2023 18:34

MichelleArk requested review from a team as code owners February 22, 2023 18:34

MichelleArk requested review from colin-rogers-dbt, peterallenwebb and VersusFacit February 22, 2023 18:34

MichelleArk marked this pull request as draft February 22, 2023 20:32

MichelleArk added 6 commits February 22, 2023 16:10

use column.data_type in format_columns

80d622a

parametrize BaseConstraintsColumnsEqual tests to types

343d95b

log formatted title + column on same line

425520a

default format_column

266bdec

Merge branch 'main' into CT-1919/get_column_schema_from_query_macro2

02ad29d

s/constraints_enabled/contract

b629e34

MichelleArk commented Feb 27, 2023

View reviewed changes

core/dbt/adapters/sql/connections.py Outdated Show resolved Hide resolved

fix signature of SQLConnectionManager.data_type_code_to_name

d3a4412

MichelleArk marked this pull request as ready for review February 28, 2023 01:03

MichelleArk commented Feb 28, 2023

View reviewed changes

core/dbt/include/global_project/macros/adapters/columns.sql Show resolved Hide resolved

VersusFacit reviewed Feb 28, 2023

View reviewed changes

core/dbt/adapters/sql/connections.py Outdated Show resolved Hide resolved

VersusFacit reviewed Feb 28, 2023

View reviewed changes

core/dbt/include/global_project/macros/adapters/columns.sql Outdated Show resolved Hide resolved

VersusFacit reviewed Feb 28, 2023

View reviewed changes

peterallenwebb approved these changes Feb 28, 2023

View reviewed changes

MichelleArk and others added 3 commits February 28, 2023 17:08

Merge branch 'main' into CT-1919/get_column_schema_from_query_macro2

5a796f2

Add generated CLI API docs

99a4ef7

better docstrings + comments in macros

de39489

MichelleArk force-pushed the CT-1919/get_column_schema_from_query_macro2 branch from 0fbed83 to de39489 Compare March 1, 2023 00:36

MichelleArk requested a review from VersusFacit March 1, 2023 00:54

VersusFacit approved these changes Mar 2, 2023

View reviewed changes

dbeatty10 reviewed Mar 2, 2023

View reviewed changes

tests/adapter/dbt/tests/adapter/constraints/fixtures.py Outdated Show resolved Hide resolved

remove get_column_schema_from_query from ConnectionManager

02cd444

dbeatty10 reviewed Mar 2, 2023

View reviewed changes

add_select_query

5858095

jtcohen6 mentioned this pull request Mar 2, 2023

[CT-1924] Support 'contract' config outside model yaml file #6756

Closed

MichelleArk mentioned this pull request Mar 2, 2023

CT-2103: Implement data_type_code_to_name for redshift dbt-labs/dbt-redshift#346

Merged

6 tasks

test more types! + correct data_type schemas

60fa84e

MichelleArk closed this Mar 3, 2023

MichelleArk reopened this Mar 3, 2023

MichelleArk commented Mar 3, 2023

View reviewed changes

emmyoop reviewed Mar 3, 2023

View reviewed changes

core/dbt/include/global_project/macros/adapters/columns.sql Outdated Show resolved Hide resolved

fix typo

a630793

mikealfare reviewed Mar 3, 2023

View reviewed changes

mikealfare approved these changes Mar 3, 2023

View reviewed changes

MichelleArk changed the title ~~get_column_schema_from_query_macro~~ get_column_schema_from_query macro Mar 3, 2023

MichelleArk merged commit b681908 into main Mar 3, 2023

MichelleArk deleted the CT-1919/get_column_schema_from_query_macro2 branch March 3, 2023 19:21

acurtis-evi pushed a commit to acurtis-evi/dbt-core that referenced this pull request Mar 7, 2023

get_column_schema_from_query macro (dbt-labs#6986)

955532f

Add adapter.get_column_schema_from_query

MichelleArk mentioned this pull request Mar 15, 2023

CT 2196, CT2121 constraints column order #7161

Merged

6 tasks

rlh1994 mentioned this pull request Oct 23, 2023

[CT-3243] [Bug] Unrecognized data types break get_column_schema_from_query macro (and contracts) #8877

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_column_schema_from_query macro #6986

get_column_schema_from_query macro #6986

MichelleArk commented Feb 15, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2023

VersusFacit Feb 28, 2023 •

edited

Loading

peterallenwebb Feb 28, 2023

MichelleArk Feb 28, 2023 •

edited

Loading

VersusFacit Mar 2, 2023

VersusFacit left a comment

peterallenwebb Feb 28, 2023

mikealfare commented Mar 2, 2023

VersusFacit left a comment •

edited

Loading

MichelleArk commented Mar 2, 2023

dbeatty10 Mar 2, 2023

MichelleArk Mar 3, 2023 •

edited

Loading

mikealfare Mar 3, 2023

MichelleArk Mar 3, 2023

type_code	String Representation	Data Type
0	FIXED	NUMBER/INT
1	REAL	REAL
2	TEXT	VARCHAR/STRING
3	DATE	DATE
4	TIMESTAMP	TIMESTAMP
5	VARIANT	VARIANT
6	TIMESTAMP_LTZ	TIMESTAMP_LTZ
7	TIMESTAMP_TZ	TIMESTAMP_TZ
8	TIMESTAMP_NTZ	TIMESTAMP_TZ
9	OBJECT	OBJECT
10	ARRAY	ARRAY
11	BINARY	BINARY
12	TIME	TIME
13	BOOLEAN	BOOLEAN

get_column_schema_from_query macro #6986

get_column_schema_from_query macro #6986

Conversation

MichelleArk commented Feb 15, 2023 • edited Loading

Description

Checklist

github-actions bot commented Feb 15, 2023

VersusFacit Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

peterallenwebb Feb 28, 2023

Choose a reason for hiding this comment

MichelleArk Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

VersusFacit Mar 2, 2023

Choose a reason for hiding this comment

VersusFacit left a comment

Choose a reason for hiding this comment

peterallenwebb Feb 28, 2023

Choose a reason for hiding this comment

mikealfare commented Mar 2, 2023

VersusFacit left a comment • edited Loading

Choose a reason for hiding this comment

MichelleArk commented Mar 2, 2023

dbeatty10 Mar 2, 2023

Choose a reason for hiding this comment

MichelleArk Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

mikealfare Mar 3, 2023

Choose a reason for hiding this comment

MichelleArk Mar 3, 2023

Choose a reason for hiding this comment

MichelleArk commented Feb 15, 2023 •

edited

Loading

VersusFacit Feb 28, 2023 •

edited

Loading

MichelleArk Feb 28, 2023 •

edited

Loading

VersusFacit left a comment •

edited

Loading

MichelleArk Mar 3, 2023 •

edited

Loading