Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer explicitly provided dataset over default dataset in lookup #53

Merged
merged 2 commits into from
Apr 23, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 8 additions & 11 deletions pybigquery/sqlalchemy_bigquery.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,9 @@ def _split_table_name(self, full_table_name):
project = None

table_name_split = full_table_name.split('.')
if len(table_name_split) == 2:
if len(table_name_split) == 1:
table_name = full_table_name
elif len(table_name_split) == 2:
dataset, table_name = table_name_split
elif len(table_name_split) == 3:
project, dataset, table_name = table_name_split
Expand All @@ -367,17 +369,12 @@ def _get_table(self, connection, table_name, schema=None):
if isinstance(connection, Engine):
connection = connection.connect()

table_name_prepared = dataset = project = None
if self.dataset_id:
table_name_prepared = table_name
dataset = self.dataset_id
project = None

else:
project, dataset, table_name_prepared = self._split_table_name(table_name)
if dataset is None and schema is not None:
table_name_prepared = table_name
project, dataset, table_name_prepared = self._split_table_name(table_name)
if dataset is None:
if schema is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is precedence for the 'table name dataset' over the scheme the behavior in other sqlalchemy dialects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike BigQuery, Postgres has a two level hierarchy (schema and table).

Postgres doesn't seem to do any interpretation of schema from the tablename parameter: https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/base.py#L2599-L2635

Same situation with MySQL:

https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/mysql/base.py#L2458-L2491

Arguably BigQuery is a special situation given the three level hierarchy (project/dataset/table). Other common databases let you run different database instances on the same server and with the same binaries, but BigQuery allows all relevant project_ids to be reached via the same connection and joined with each other, so it's more like a two-level schema name than anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
What do you think should be the behavior for BQ?
Raising an error in that case sounds legit to me, problem is we will be changing a fundamental behavior.
Anyway this should probably be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.

That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.

dataset = schema
elif self.dataset_id:
dataset = self.dataset_id

table = connection.connection._client.dataset(dataset, project=project).table(table_name_prepared)
try:
Expand Down
3 changes: 3 additions & 0 deletions scripts/load_test_data.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
bq mk test_pybigquery
bq mk --data_location=asia-northeast1 test_pybigquery_location
bq mk test_pybigquery_alt

bq rm -f -t test_pybigquery.sample
bq rm -f -t test_pybigquery_alt.sample_alt
bq rm -f -t test_pybigquery.sample_one_row
bq rm -f -t test_pybigquery.sample_dml
bq rm -f -t test_pybigquery_location.sample_one_row

bq mk --table --schema=$(dirname $0)/schema.json --time_partitioning_field timestamp --clustering_fields integer,string test_pybigquery.sample
bq mk --table --schema=$(dirname $0)/schema.json --time_partitioning_field timestamp --clustering_fields integer,string test_pybigquery_alt.sample_alt
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=$(dirname $0)/schema.json test_pybigquery.sample $(dirname $0)/sample.json

bq mk --table --schema=$(dirname $0)/schema.json --time_partitioning_field timestamp --clustering_fields integer,string test_pybigquery.sample_one_row
Expand Down
21 changes: 16 additions & 5 deletions test/test_sqlalchemy_bigquery.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,9 +199,12 @@ def test_engine_with_dataset(engine_using_test_dataset):
rows = table_one_row.select().execute().fetchall()
assert list(rows[0]) == ONE_ROW_CONTENTS_EXPANDED

# Table name shouldn't include dataset
with pytest.raises(Exception):
table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another change in behavior, however seems to me that this is just another bug fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. The lookup code is the same between Table() and has_table(), so the behavior changed in both places. I agree that it is also a bug fix.

table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to check that the expected results is returned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an assertion on some expected property of the Table? In this example, we want to make sure that the dataset ID associated with the table == "test_pybigquery" and != whatever the default dataset ID is, right?

rows = table_one_row.select().execute().fetchall()
# verify that we are pulling from the specifically-named dataset,
# instead of pulling from the default dataset of the engine (which
# does not have this table at all)
assert list(rows[0]) == ONE_ROW_CONTENTS_EXPANDED


def test_dataset_location(engine_with_location):
Expand Down Expand Up @@ -478,6 +481,14 @@ def test_has_table(engine, engine_using_test_dataset):
assert engine.has_table('sample', 'test_pybigquery') is True
assert engine.has_table('test_pybigquery.sample') is True

assert engine.has_table('sample_alt', 'test_pybigquery_alt') is True
assert engine.has_table('test_pybigquery_alt.sample_alt') is True

assert engine_using_test_dataset.has_table('sample') is True
with pytest.raises(Exception):
assert engine_using_test_dataset.has_table('test_pybigquery.sample') is True
assert engine_using_test_dataset.has_table('sample', 'test_pybigquery') is True
assert engine_using_test_dataset.has_table('test_pybigquery.sample') is True

assert engine_using_test_dataset.has_table('sample_alt') is False

assert engine_using_test_dataset.has_table('sample_alt', 'test_pybigquery_alt') is True
assert engine_using_test_dataset.has_table('test_pybigquery_alt.sample_alt') is True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the full test suite with these changes and it passed.