Prefer explicitly provided dataset over default dataset in lookup #53

vinceatbluelabs · 2020-04-15T16:20:13Z

Fixes #51

Fixes googleapis#51

vinceatbluelabs · 2020-04-15T16:20:33Z

Two other cases that I came across that aren't addressed in either this PR or the current code:

What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)
What should the behavior be if the dataset provided in the schema arguments conflicts with the schema provided in the table argument? (both master and this PR will use the one in the table name in preference to the schema argument)

If you have strong feelings on changing those two behaviors, I can take a crack at addressing that in this PR.

vinceatbluelabs · 2020-04-15T16:21:26Z

test/test_sqlalchemy_bigquery.py

+    assert engine_using_test_dataset.has_table('sample_alt') is False
+
+    assert engine_using_test_dataset.has_table('sample_alt', 'test_pybigquery_alt') is True
+    assert engine_using_test_dataset.has_table('test_pybigquery_alt.sample_alt') is True


I ran the full test suite with these changes and it passed.

alonme

I think that the behavior for both cases should be as similar as possible to other SQLAlchemy dialects.
Can you check for the behavior in one of the popular dialects?

alonme · 2020-04-15T16:35:32Z

pybigquery/sqlalchemy_bigquery.py

-                table_name_prepared = table_name
+        project, dataset, table_name_prepared = self._split_table_name(table_name)
+        if dataset is None:
+            if schema is not None:


Is precedence for the 'table name dataset' over the scheme the behavior in other sqlalchemy dialects?

Unlike BigQuery, Postgres has a two level hierarchy (schema and table).

Postgres doesn't seem to do any interpretation of schema from the tablename parameter: https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/base.py#L2599-L2635

Same situation with MySQL:

https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/mysql/base.py#L2458-L2491

Arguably BigQuery is a special situation given the three level hierarchy (project/dataset/table). Other common databases let you run different database instances on the same server and with the same binaries, but BigQuery allows all relevant project_ids to be reached via the same connection and joined with each other, so it's more like a two-level schema name than anything else.

Makes sense.
What do you think should be the behavior for BQ?
Raising an error in that case sounds legit to me, problem is we will be changing a fundamental behavior.
Anyway this should probably be documented.

I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.

That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.

alonme · 2020-04-15T17:02:43Z

test/test_sqlalchemy_bigquery.py

-    # Table name shouldn't include dataset
-    with pytest.raises(Exception):
-        table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)
+    table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)


probably better to check that the expected results is returned

Could we add an assertion on some expected property of the Table? In this example, we want to make sure that the dataset ID associated with the table == "test_pybigquery" and != whatever the default dataset ID is, right?

alonme · 2020-04-15T17:08:18Z

test/test_sqlalchemy_bigquery.py

@@ -199,9 +199,7 @@ def test_engine_with_dataset(engine_using_test_dataset):
    rows = table_one_row.select().execute().fetchall()
    assert list(rows[0]) == ONE_ROW_CONTENTS_EXPANDED

-    # Table name shouldn't include dataset
-    with pytest.raises(Exception):
-        table_one_row = Table('test_pybigquery.sample_one_row', MetaData(bind=engine_using_test_dataset), autoload=True)


This is another change in behavior, however seems to me that this is just another bug fix.

That's correct. The lookup code is the same between Table() and has_table(), so the behavior changed in both places. I agree that it is also a bug fix.

vinceatbluelabs · 2020-04-15T20:25:57Z

What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)

On this point, I'd argue we should parse the schema and accept input in the form of project.dataset.

alonme · 2020-04-16T15:52:51Z

@tswast @mxmzdlv - any opinion on the raised issues here?

mxmzdlv · 2020-04-17T08:09:33Z

@tswast @mxmzdlv - any opinion on the raised issues here?

Heya, thanks for looking into this. I will review this as soon as possible, but unfortunately can't at the moment. If @tswast thinks this is good to merge, and it is passing tests, then I am all for it.

vinceatbluelabs · 2020-04-20T16:51:01Z

@tswast @alonme: Outside of the test update, what changes would you like me to make?

alonme · 2020-04-20T19:38:10Z

I think tswast or mxmzdlv opinions should be heard for the 2 issues
1.

What should be the behavior be if the schema argument provides both the project and the dataset? (neither master nor this PR try to parse out a project from it)

On this point, I'd argue we should parse the schema and accept input in the form of project.dataset.

I mean, ideally I think the parsing on the table name should be dropped entirely, and we should only parse the schema. Given you already parse on the table name, it's probably rude to stop doing so, as there are probably code bases that rely on it.

That said, if we maintain compatibility there, I do think raising an exception when the user gives you conflicting instructions is safer than arbitrarily choosing which one to obey, and wouldn't break any sane existing code.

vinceatbluelabs · 2020-04-21T23:27:34Z

Hi @tswast! Could you please review and let me know what you think about the current PR and the potential changes laid out? This change resolves an issue folks are seeing in https://github.com/bluelabsio/records-mover and I'd very much like to see a release with the fix inside.

tswast · 2020-04-22T20:50:57Z

What should be the behavior be if the schema argument provides both the project and the dataset

Given the SQLAlchemy and SQL connector community in general defines schema as the dataset/database name, I think it's most natural for BigQuery to parse and use the project ID if provided as part of the schema.

P.S. I have a really hard time with the term "schema" as I'm not used to it being used to describe the dataset / database name. I'm much more used to how BigQuery uses that term to define the actual column types.

parsing on the table name should be dropped entirely

I concur that we don't want to break backwards compatibility for this change. Let's continue to attempt to parse the table name.

If conflicting instructions are provided, let's throw a ValueError("Only one of schema or table should supply a dataset ID. New code should prefer passing dataset via schema)")

Outside of the test update, what changes would you like me to make?

Honestly, I'd have merged this already if the test was updated.

vinceatbluelabs · 2020-04-23T15:24:09Z

@tswast: The test update is made. If you're comfortable with the PR at this point without the additional scope, I'd prefer to get it merged and make a release.

tswast · 2020-04-23T21:07:00Z

Released in 0.4.15 https://pypi.org/project/pybigquery/0.4.15/

Prefer explicitly provided dataset over default dataset in lookup

c34ab1c

Fixes googleapis#51

vinceatbluelabs commented Apr 15, 2020

View reviewed changes

alonme reviewed Apr 15, 2020

View reviewed changes

vinceatbluelabs requested a review from tswast April 20, 2020 17:00

Verify correct table was pulled

93897c1

tswast approved these changes Apr 23, 2020

View reviewed changes

tswast merged commit 61f810c into googleapis:master Apr 23, 2020

vinceatbluelabs mentioned this pull request Aug 12, 2020

Allow use of schema argument for project and dataset #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer explicitly provided dataset over default dataset in lookup #53

Prefer explicitly provided dataset over default dataset in lookup #53

vinceatbluelabs commented Apr 15, 2020

vinceatbluelabs commented Apr 15, 2020

vinceatbluelabs Apr 15, 2020

alonme left a comment

alonme Apr 15, 2020

vinceatbluelabs Apr 15, 2020

alonme Apr 15, 2020

vinceatbluelabs Apr 15, 2020

alonme Apr 15, 2020

tswast Apr 17, 2020

alonme Apr 15, 2020

vinceatbluelabs Apr 15, 2020

vinceatbluelabs commented Apr 15, 2020

alonme commented Apr 16, 2020

mxmzdlv commented Apr 17, 2020

vinceatbluelabs commented Apr 20, 2020

alonme commented Apr 20, 2020

vinceatbluelabs commented Apr 21, 2020

tswast commented Apr 22, 2020

vinceatbluelabs commented Apr 23, 2020

tswast commented Apr 23, 2020

Prefer explicitly provided dataset over default dataset in lookup #53

Prefer explicitly provided dataset over default dataset in lookup #53

Conversation

vinceatbluelabs commented Apr 15, 2020

vinceatbluelabs commented Apr 15, 2020

Choose a reason for hiding this comment

alonme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinceatbluelabs commented Apr 15, 2020

alonme commented Apr 16, 2020

mxmzdlv commented Apr 17, 2020

vinceatbluelabs commented Apr 20, 2020

alonme commented Apr 20, 2020

vinceatbluelabs commented Apr 21, 2020

tswast commented Apr 22, 2020

vinceatbluelabs commented Apr 23, 2020

tswast commented Apr 23, 2020