Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_column_schema_from_query macro #6986

Merged
merged 19 commits into from
Mar 3, 2023

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Feb 15, 2023

resolves #6751

Description

  • adapters/base/impl.py
    • Adds get_column_schema_from_query(self, sql: str) -> List[Tuple[str, Any]] to BaseAdapter
    • calls the connection manager's get_column_schema_from_query method
  • adapters/sql/connection.py
    • Implements default get_column_schema_from_query on SQLConnectionManager, which depends on two new class methods:
      1. get_column_schema_from_cursor
      2. data_type_code_to_name
    • These new class methods (including get_column_schema_from_query) can be overwritten in adapter connection manager implementations of get_column_schema_from_query.
  • plugins/postgres/dbt/adapters/postgres/connections.py
    • PostgresConnectionManager overwrites data_type_code_to_name using psycopg2.extensions.string_types. This is the only change necessary to get model contracts extended to data_types for Postgres.

get_column_schema_from_query Usage

  • Used in assert_columns_equivalent
    • get_column_schema_from_query is called twice:
      1. Once on the model SQL. This obtains the column schema for the actual model based on its SQL.
      2. Once on the empty_schema_sql, which is simply a SELECT cast(null as {{ col['data_type'] }} for each column specified in the expected model schema yaml.
    • Once the expected and actual column schemas are obtained, they are formatted and compared, raising a compiler error if any differences are detected.

Checklist

🎩

❯ dbt run --select test_schema
02:53:06  Running with dbt=1.5.0-a1
02:53:06  Found 6 models, 20 tests, 0 snapshots, 1 analysis, 503 macros, 0 operations, 3 seed files, 0 sources, 1 exposure, 1 metric
02:53:06  
02:53:07  Concurrency: 1 threads (target='dev')
02:53:07  
02:53:07  1 of 1 START sql table model test_arky.test_schematest_v2 ...................... [RUN]
02:53:07  1 of 1 ERROR creating sql table model test_arky.test_schematest_v2 ............. [ERROR in 0.04s]
02:53:07  
02:53:07  Finished running 1 table model in 0 hours 0 minutes and 0.46 seconds (0.46s).
02:53:07  
02:53:07  Completed with 1 error and 0 warnings:
02:53:07  
02:53:07  Compilation Error in model test_schema (models/test_schema.sql)
02:53:07    Please ensure the name, data_type, order, and number of columns in your `yml` file match the columns in your SQL file.
02:53:07    Schema File Columns: 
02:53:07      a INT, b TEXT
02:53:07    
02:53:07    SQL File Columns: 
02:53:07      a INT, c TEXT 
02:53:07    
02:53:07    > in macro assert_columns_equivalent (macros/materializations/models/table/columns_spec_ddl.sql)
02:53:07    > called by model test_schema (models/test_schema.sql)
02:53:07  
02:53:07  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@MichelleArk MichelleArk changed the title Ct 1919/get column schema from query macro2 get_column_schema_from_query_macro Feb 15, 2023
@MichelleArk MichelleArk marked this pull request as ready for review February 22, 2023 18:34
@MichelleArk MichelleArk requested review from a team as code owners February 22, 2023 18:34
@MichelleArk MichelleArk marked this pull request as draft February 22, 2023 20:32
@MichelleArk MichelleArk marked this pull request as ready for review February 28, 2023 01:03
select
{% for i in columns %}
{%- set col = columns[i] -%}
cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }}
Copy link
Contributor

@VersusFacit VersusFacit Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this. This can possibly lead to some weird type resolution outcomes...we think. This is so far the "best" option and so far looks promising. I just know SQLs typing mechanisms can get a mind of their own for the worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.

Copy link
Contributor Author

@MichelleArk MichelleArk Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing this approach has going for it is that there's a non-opaque definition of what data_type values should be - it's 'the value you'd write in SQL when casting to the desired type', as opposed to 'the value returned by mapping the connection cursor's type code to a string'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this discussion should be seen as a context dump for posterity and is not blocking.

Copy link
Contributor

@VersusFacit VersusFacit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts are all design minded as opposed to implementation minded. This might be an error on my part since Jinja is harder to track than Python. That said, I feel like deliberate design will help indicate/signpost what the implementation is expected to do.

select
{% for i in columns %}
{%- set col = columns[i] -%}
cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.

@MichelleArk MichelleArk force-pushed the CT-1919/get_column_schema_from_query_macro2 branch from 0fbed83 to de39489 Compare March 1, 2023 00:36
@mikealfare
Copy link
Contributor

@MichelleArk I could be naive to the context here, so take this with a grain of salt. What is the benefit gained from putting these methods largely on SQLConnectionManager? They rely very little on the connection attribution, mainly just cursor. And then the result is pulled through in a pass-through method in BaseAdapter with no logic applied.

I'm concerned that we might be polluting the interface on SQLConnectionManager. If I'm thinking about it correctly, SQLConnectionManager deals with the general task of how we interrogate the database and BaseAdapter deals with what we're asking of the database. BaseAdapter effectively uses SQLConnectionManager to answer questions; SQLConnectionManager doesn't ask questions itself.

I would think some of this logic belongs on BaseAdapter. At first glance, I would expose a way to add a standard query on SQLConnectionManager without executing it, basically the body of get_column_schema_from_query. And then I'd probably also put the data_type map there because it has no knowledge of the question being asked:

class SQLConnectionManager(BaseConnectionManager):
    ...

    def add_select_query(self, sql: str) -> Tuple[Connection, Any]:
        sql = self._add_query_comment(sql, auto_begin=False)
        return self.add_query(sql)

    def add_begin_query(self):
        return self.add_query("BEGIN", auto_begin=False)

    ...

    @property
    def data_type_map(self) -> Dict[Union[str, int], str]:
        # build the dictionary somehow, needs to be a method to raise NotImplementedError
        # e.g., abstract
        raise dbt.exceptions.NotImplementedError(...)
        # e.g., Snowflake:
        return snowflake.connector.constants.FIELD_ID_TO_NAME

    ...

Then I'd put the logic from get_column_schema_from_cursor into get_column_schema_from_query on BaseAdaptor, e.g.:

ColumnSchema = List[Tuple[str, str]]


class BaseAdapter(metaclass=AdapterMeta):
    ...

    @available.parse(lambda *a, **k: [])
    def get_column_schema_from_query(self, sql: str) -> ColumnSchema:
        _, cursor = self.connections.add_select_query(sql)
        data_type_map = self.connections.data_type_map
        columns: ColumnSchema = [
            column_name, data_type_map.get(column_code, "UNDEFINED")
            for column_name, column_code, *_ in cursor.description
        ]
        return columns

    ...

Am I thinking about this correctly?

Copy link
Contributor

@VersusFacit VersusFacit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve this as to not hold things up. We all agree that the code is function and generally "makes sense."

However, please do consider Mike's comment before hitting merge. Likewise, Mike (or any of my team colleagues), request changes to make this even more clear.

Our interfaces are in flux and this is the right time I think to ask these questions. If it doesn't end up in this PR, we should consider what sorts of "contracts" we want these objects to observe. I admit I haven't considered these specifics until now the questions Mike raises, but they feel cogent, insofar as we should weigh these tradeoffs now (early).

@MichelleArk
Copy link
Contributor Author

@mikealfare - Thank you for this really thoughtful comment!

I'm concerned that we might be polluting the interface on SQLConnectionManager.

I had the same concern here and went back and forth for a bit between where this functionality should live. I agree with the overall responsibility of each SQLConnectionManager and BaseAdapter you've described, and that the get_column_schema_from_query functionality in BaseAdapter goes beyond the general task of how we interrogate the database.

Your proposal makes a lot of sense and should work across adapter implementations, but my main concern was that the SQLConnectionManager would be leaking the cursor object, which has some standardization from the DB-API standard, but will vary across adapters and has to do with how the database runs queries. Should the BigQueryAdapter need to know about bigquery.job.query.QueryJob (and know that a query has a temp destination table?) or bigquery.table.RowIterator? But.. it looks like the bigquery adapter is already doing that, for a very similar purpose. I didn't see that was the case already 🤦

All that said - I do think the BaseAdapter is still the better spot for the implementation for the reasons you described. It additionally is configured with the Column class and is better positioned to create a ColumnSchema = List[Column] result for end callers to use, eliminating the need for this messiness in jinja.

I'm going to work in your proposed structure - it shouldn't require a huge refactoring lift in adapter-specific implementations. Longer term, I think we'd want to explore introducing an abstraction around a Cursor or QueryResult (similar to what we've done with Connection).

Comment on lines +135 to +142
@classmethod
def data_type_code_to_name(cls, type_code: Union[int, str]) -> str:
"""Get the string representation of the data type from the type_code."""
# https://peps.python.org/pep-0249/#type-objects
raise dbt.exceptions.NotImplementedError(
"`data_type_code_to_name` is not implemented for this adapter!"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was interested in seeing a tangible example of how data type codes map to a string representation for a database connector.

This table from the Snowflake docs was useful to me:
https://docs.snowflake.com/en/user-guide/python-connector-api#label-python-connector-type-codes

type_code String Representation Data Type
0 FIXED NUMBER/INT
1 REAL REAL
2 TEXT VARCHAR/STRING
3 DATE DATE
4 TIMESTAMP TIMESTAMP
5 VARIANT VARIANT
6 TIMESTAMP_LTZ TIMESTAMP_LTZ
7 TIMESTAMP_TZ TIMESTAMP_TZ
8 TIMESTAMP_NTZ TIMESTAMP_TZ
9 OBJECT OBJECT
10 ARRAY ARRAY
11 BINARY BINARY
12 TIME TIME
13 BOOLEAN BOOLEAN

(Side note: I suspect there is a typo for code 8 and TIMESTAMP_TZ there should be TIMESTAMP_NTZ instead.)

@MichelleArk MichelleArk closed this Mar 3, 2023
@MichelleArk MichelleArk reopened this Mar 3, 2023
return int_type

@pytest.fixture
def data_types(self, schema_int_type, int_type, string_type):
Copy link
Contributor Author

@MichelleArk MichelleArk Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have loved to use pytest's parametrized functionality here (and did implement it that way to start), but unfortunately couldn't find a way to make it work with the testing inheritance pattern we have for adapter tests given the parametrized test case array would need to be provided dynamically - which pytest.parametrize does not seem to support.

):
for (sql_column_value, schema_data_type, error_data_type) in data_types:
# Write parametrized data_type to sql file
write_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for this function to be a fixture instead, so that we can yield the file and then clean it up after. Maybe the project fixture does that if the file is setup in the project directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project fixture drops the test schema on teardown, but from what I can tell, it does not clean up the temp project directory. I understand this might actually be by design, for ease of debugging failing tests - similar to intentionally preserving log files. Worth revisiting, but probably out of scope for this PR.

@MichelleArk MichelleArk changed the title get_column_schema_from_query_macro get_column_schema_from_query macro Mar 3, 2023
@MichelleArk MichelleArk merged commit b681908 into main Mar 3, 2023
@MichelleArk MichelleArk deleted the CT-1919/get_column_schema_from_query_macro2 branch March 3, 2023 19:21
acurtis-evi pushed a commit to acurtis-evi/dbt-core that referenced this pull request Mar 7, 2023
Add adapter.get_column_schema_from_query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1919] Create get_column_schema_from_query macro
8 participants