-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_column_schema_from_query macro #6986
get_column_schema_from_query macro #6986
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
select | ||
{% for i in columns %} | ||
{%- set col = columns[i] -%} | ||
cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this. This can possibly lead to some weird type resolution outcomes...we think. This is so far the "best" option and so far looks promising. I just know SQLs typing mechanisms can get a mind of their own for the worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing this approach has going for it is that there's a non-opaque definition of what data_type
values should be - it's 'the value you'd write in SQL when casting to the desired type', as opposed to 'the value returned by mapping the connection cursor's type code to a string'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this discussion should be seen as a context dump for posterity and is not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts are all design minded as opposed to implementation minded. This might be an error on my part since Jinja is harder to track than Python. That said, I feel like deliberate design will help indicate/signpost what the implementation is expected to do.
select | ||
{% for i in columns %} | ||
{%- set col = columns[i] -%} | ||
cast(null as {{ col['data_type'] }}) as {{ col['name'] }}{{ ", " if not loop.last }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VersusFacit I had the same concern, and discussed it with Michelle synchronously. On the plus side, this approach will automatically account for any new types which appear, so if it works in practice it will be a lot easier than trying to maintain our own list of type aliases. I like that. I was also reassured that this code path will only affect people using contracts, so there isn't much regression risk and we should hear pretty quickly if there are databses/drivers that this approach doesn't work for.
0fbed83
to
de39489
Compare
@MichelleArk I could be naive to the context here, so take this with a grain of salt. What is the benefit gained from putting these methods largely on I'm concerned that we might be polluting the interface on I would think some of this logic belongs on class SQLConnectionManager(BaseConnectionManager):
...
def add_select_query(self, sql: str) -> Tuple[Connection, Any]:
sql = self._add_query_comment(sql, auto_begin=False)
return self.add_query(sql)
def add_begin_query(self):
return self.add_query("BEGIN", auto_begin=False)
...
@property
def data_type_map(self) -> Dict[Union[str, int], str]:
# build the dictionary somehow, needs to be a method to raise NotImplementedError
# e.g., abstract
raise dbt.exceptions.NotImplementedError(...)
# e.g., Snowflake:
return snowflake.connector.constants.FIELD_ID_TO_NAME
... Then I'd put the logic from ColumnSchema = List[Tuple[str, str]]
class BaseAdapter(metaclass=AdapterMeta):
...
@available.parse(lambda *a, **k: [])
def get_column_schema_from_query(self, sql: str) -> ColumnSchema:
_, cursor = self.connections.add_select_query(sql)
data_type_map = self.connections.data_type_map
columns: ColumnSchema = [
column_name, data_type_map.get(column_code, "UNDEFINED")
for column_name, column_code, *_ in cursor.description
]
return columns
... Am I thinking about this correctly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve this as to not hold things up. We all agree that the code is function and generally "makes sense."
However, please do consider Mike's comment before hitting merge. Likewise, Mike (or any of my team colleagues), request changes to make this even more clear.
Our interfaces are in flux and this is the right time I think to ask these questions. If it doesn't end up in this PR, we should consider what sorts of "contracts" we want these objects to observe. I admit I haven't considered these specifics until now the questions Mike raises, but they feel cogent, insofar as we should weigh these tradeoffs now (early).
@mikealfare - Thank you for this really thoughtful comment!
I had the same concern here and went back and forth for a bit between where this functionality should live. I agree with the overall responsibility of each Your proposal makes a lot of sense and should work across adapter implementations, but my main concern was that the All that said - I do think the I'm going to work in your proposed structure - it shouldn't require a huge refactoring lift in adapter-specific implementations. Longer term, I think we'd want to explore introducing an abstraction around a |
@classmethod | ||
def data_type_code_to_name(cls, type_code: Union[int, str]) -> str: | ||
"""Get the string representation of the data type from the type_code.""" | ||
# https://peps.python.org/pep-0249/#type-objects | ||
raise dbt.exceptions.NotImplementedError( | ||
"`data_type_code_to_name` is not implemented for this adapter!" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was interested in seeing a tangible example of how data type codes map to a string representation for a database connector.
This table from the Snowflake docs was useful to me:
https://docs.snowflake.com/en/user-guide/python-connector-api#label-python-connector-type-codes
type_code | String Representation | Data Type |
---|---|---|
0 | FIXED | NUMBER/INT |
1 | REAL | REAL |
2 | TEXT | VARCHAR/STRING |
3 | DATE | DATE |
4 | TIMESTAMP | TIMESTAMP |
5 | VARIANT | VARIANT |
6 | TIMESTAMP_LTZ | TIMESTAMP_LTZ |
7 | TIMESTAMP_TZ | TIMESTAMP_TZ |
8 | TIMESTAMP_NTZ | TIMESTAMP_TZ |
9 | OBJECT | OBJECT |
10 | ARRAY | ARRAY |
11 | BINARY | BINARY |
12 | TIME | TIME |
13 | BOOLEAN | BOOLEAN |
(Side note: I suspect there is a typo for code 8 and TIMESTAMP_TZ
there should be TIMESTAMP_NTZ
instead.)
return int_type | ||
|
||
@pytest.fixture | ||
def data_types(self, schema_int_type, int_type, string_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have loved to use pytest's parametrized
functionality here (and did implement it that way to start), but unfortunately couldn't find a way to make it work with the testing inheritance pattern we have for adapter tests given the parametrized test case array would need to be provided dynamically - which pytest.parametrize does not seem to support.
): | ||
for (sql_column_value, schema_data_type, error_data_type) in data_types: | ||
# Write parametrized data_type to sql file | ||
write_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice for this function to be a fixture instead, so that we can yield the file and then clean it up after. Maybe the project
fixture does that if the file is setup in the project directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The project
fixture drops the test schema on teardown, but from what I can tell, it does not clean up the temp project directory. I understand this might actually be by design, for ease of debugging failing tests - similar to intentionally preserving log files. Worth revisiting, but probably out of scope for this PR.
Add adapter.get_column_schema_from_query
resolves #6751
Description
adapters/base/impl.py
get_column_schema_from_query(self, sql: str) -> List[Tuple[str, Any]]
toBaseAdapter
get_column_schema_from_query
methodadapters/sql/connection.py
get_column_schema_from_query
onSQLConnectionManager
, which depends on two new class methods:get_column_schema_from_cursor
data_type_code_to_name
get_column_schema_from_query
) can be overwritten in adapter connection manager implementations ofget_column_schema_from_query
.plugins/postgres/dbt/adapters/postgres/connections.py
PostgresConnectionManager
overwritesdata_type_code_to_name
usingpsycopg2.extensions.string_types
. This is the only change necessary to get model contracts extended todata_type
s for Postgres.get_column_schema_from_query Usage
get_column_schema_from_query
is called twice:empty_schema_sql
, which is simply aSELECT cast(null as {{ col['data_type'] }}
for each column specified in the expected model schema yaml.Checklist
changie new
to create a changelog entry🎩