Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer explicitly provided dataset over default dataset in lookup #53

Merged
merged 2 commits into from
Apr 23, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

Fixes #51

@vinceatbluelabs
Copy link
Contributor Author

Two other cases that I came across that aren't addressed in either this PR or the current code:

  • What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)
  • What should the behavior be if the dataset provided in the schema arguments conflicts with the schema provided in the table argument? (both master and this PR will use the one in the table name in preference to the schema argument)

If you have strong feelings on changing those two behaviors, I can take a crack at addressing that in this PR.

assert engine_using_test_dataset.has_table('sample_alt') is False

assert engine_using_test_dataset.has_table('sample_alt', 'test_pybigquery_alt') is True
assert engine_using_test_dataset.has_table('test_pybigquery_alt.sample_alt') is True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the full test suite with these changes and it passed.

Copy link
Contributor

@alonme alonme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the behavior for both cases should be as similar as possible to other SQLAlchemy dialects.
Can you check for the behavior in one of the popular dialects?

table_name_prepared = table_name
project, dataset, table_name_prepared = self._split_table_name(table_name)
if dataset is None:
if schema is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is precedence for the 'table name dataset' over the scheme the behavior in other sqlalchemy dialects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike BigQuery, Postgres has a two level hierarchy (schema and table).

Postgres doesn't seem to do any interpretation of schema from the tablename parameter: https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/base.py#L2599-L2635

Same situation with MySQL:

https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/mysql/base.py#L2458-L2491

Arguably BigQuery is a special situation given the three level hierarchy (project/dataset/table). Other common databases let you run different database instances on the same server and with the same binaries, but BigQuery allows all relevant project_ids to be reached via the same connection and joined with each other, so it's more like a two-level schema name than anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
What do you think should be the behavior for BQ?
Raising an error in that case sounds legit to me, problem is we will be changing a fundamental behavior.
Anyway this should probably be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.

That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.

# Table name shouldn't include dataset
with pytest.raises(Exception):
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to check that the expected results is returned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an assertion on some expected property of the Table? In this example, we want to make sure that the dataset ID associated with the table == "test_pybigquery" and != whatever the default dataset ID is, right?

@@ -199,9 +199,7 @@ def test_engine_with_dataset(engine_using_test_dataset):
rows = table_one_row.select().execute().fetchall()
assert list(rows[0]) == ONE_ROW_CONTENTS_EXPANDED

# Table name shouldn't include dataset
with pytest.raises(Exception):
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another change in behavior, however seems to me that this is just another bug fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. The lookup code is the same between Table() and has_table(), so the behavior changed in both places. I agree that it is also a bug fix.

@vinceatbluelabs
Copy link
Contributor Author

What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)

On this point, I'd argue we should parse the schema and accept input in the form of project.dataset.

@alonme
Copy link
Contributor

alonme commented Apr 16, 2020

@tswast @mxmzdlv - any opinion on the raised issues here?

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Apr 17, 2020

@tswast @mxmzdlv - any opinion on the raised issues here?

Heya, thanks for looking into this. I will review this as soon as possible, but unfortunately can't at the moment. If @tswast thinks this is good to merge, and it is passing tests, then I am all for it.

@vinceatbluelabs
Copy link
Contributor Author

@tswast @alonme: Outside of the test update, what changes would you like me to make?

@vinceatbluelabs vinceatbluelabs requested a review from tswast April 20, 2020 17:00
@alonme
Copy link
Contributor

alonme commented Apr 20, 2020

I think tswast or mxmzdlv opinions should be heard for the 2 issues
1.

What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)

On this point, I'd argue we should parse the schema and accept input in the form of project.dataset.

I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.

That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.

@vinceatbluelabs
Copy link
Contributor Author

Hi @tswast! Could you please review and let me know what you think about the current PR and the potential changes laid out? This change resolves an issue folks are seeing in https://github.com/bluelabsio/records-mover and I'd very much like to see a release with the fix inside.

@tswast
Copy link
Collaborator

tswast commented Apr 22, 2020

  1. What should be the behavior be if the schema argument provides both the project and the dataset

Given the SQLAlchemy and SQL connector community in general defines schema as the dataset/database name, I think it's most natural for BigQuery to parse and use the project ID if provided as part of the schema.

P.S. I have a really hard time with the term "schema" as I'm not used to it being used to describe the dataset / database name. I'm much more used to how BigQuery uses that term to define the actual column types.

  1. parsing on the table name should be dropped entirely

I concur that we don't want to break backwards compatibility for this change. Let's continue to attempt to parse the table name.

If conflicting instructions are provided, let's throw a ValueError("Only one of schema or table should supply a dataset ID. New code should prefer passing dataset via schema)")

Outside of the test update, what changes would you like me to make?

Honestly, I'd have merged this already if the test was updated.

@vinceatbluelabs
Copy link
Contributor Author

@tswast: The test update is made. If you're comfortable with the PR at this point without the additional scope, I'd prefer to get it merged and make a release.

@tswast tswast merged commit 61f810c into googleapis:master Apr 23, 2020
@tswast
Copy link
Collaborator

tswast commented Apr 23, 2020

Released in 0.4.15 https://pypi.org/project/pybigquery/0.4.15/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

has_engine() ignores provided dataset if default dataset is set
4 participants