-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer explicitly provided dataset over default dataset in lookup #53
Prefer explicitly provided dataset over default dataset in lookup #53
Conversation
Two other cases that I came across that aren't addressed in either this PR or the current code:
If you have strong feelings on changing those two behaviors, I can take a crack at addressing that in this PR. |
assert engine_using_test_dataset.has_table('sample_alt') is False | ||
|
||
assert engine_using_test_dataset.has_table('sample_alt', 'test_pybigquery_alt') is True | ||
assert engine_using_test_dataset.has_table('test_pybigquery_alt.sample_alt') is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the full test suite with these changes and it passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the behavior for both cases should be as similar as possible to other SQLAlchemy dialects.
Can you check for the behavior in one of the popular dialects?
table_name_prepared = table_name | ||
project, dataset, table_name_prepared = self._split_table_name(table_name) | ||
if dataset is None: | ||
if schema is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is precedence for the 'table name dataset' over the scheme the behavior in other sqlalchemy dialects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike BigQuery, Postgres has a two level hierarchy (schema and table).
Postgres doesn't seem to do any interpretation of schema from the tablename parameter: https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/base.py#L2599-L2635
Same situation with MySQL:
Arguably BigQuery is a special situation given the three level hierarchy (project/dataset/table). Other common databases let you run different database instances on the same server and with the same binaries, but BigQuery allows all relevant project_ids to be reached via the same connection and joined with each other, so it's more like a two-level schema name than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
What do you think should be the behavior for BQ?
Raising an error in that case sounds legit to me, problem is we will be changing a fundamental behavior.
Anyway this should probably be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.
That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.
# Table name shouldn't include dataset | ||
with pytest.raises(Exception): | ||
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True) | ||
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to check that the expected results is returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an assertion on some expected property of the Table? In this example, we want to make sure that the dataset ID associated with the table == "test_pybigquery" and != whatever the default dataset ID is, right?
@@ -199,9 +199,7 @@ def test_engine_with_dataset(engine_using_test_dataset): | |||
rows = table_one_row.select().execute().fetchall() | |||
assert list(rows[0]) == ONE_ROW_CONTENTS_EXPANDED | |||
|
|||
# Table name shouldn't include dataset | |||
with pytest.raises(Exception): | |||
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another change in behavior, however seems to me that this is just another bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. The lookup code is the same between Table()
and has_table()
, so the behavior changed in both places. I agree that it is also a bug fix.
On this point, I'd argue we should parse the schema and accept input in the form of |
I think tswast or mxmzdlv opinions should be heard for the 2 issues
|
Hi @tswast! Could you please review and let me know what you think about the current PR and the potential changes laid out? This change resolves an issue folks are seeing in https://github.com/bluelabsio/records-mover and I'd very much like to see a release with the fix inside. |
Given the SQLAlchemy and SQL connector community in general defines schema as the dataset/database name, I think it's most natural for BigQuery to parse and use the project ID if provided as part of the schema. P.S. I have a really hard time with the term "schema" as I'm not used to it being used to describe the dataset / database name. I'm much more used to how BigQuery uses that term to define the actual column types.
I concur that we don't want to break backwards compatibility for this change. Let's continue to attempt to parse the table name. If conflicting instructions are provided, let's throw a
Honestly, I'd have merged this already if the test was updated. |
@tswast: The test update is made. If you're comfortable with the PR at this point without the additional scope, I'd prefer to get it merged and make a release. |
Released in 0.4.15 https://pypi.org/project/pybigquery/0.4.15/ |
Fixes #51