Skip to content

Conversation

@ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Dec 14, 2021

Replace use of pandas-gbq for pure Bigquery

@ncclementi ncclementi marked this pull request as ready for review December 14, 2021 22:01
{
"name": random.choice(["fred", "wilma", "barney", "betty"]),
"number": random.randint(0, 100),
"timestamp": datetime.now(timezone.utc) - timedelta(days=i % 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding timezone.utc I was getting this assertion error:

E           AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="timestamp") are different
E           
E           Attribute "dtype" are different
E           [left]:  datetime64[ns, UTC]
E           [right]: datetime64[ns]

It seems like when reading back from bigquery, it will automatically convert to utc if not otherwise specified, causing the error.
@tswast can you confirm this is the case? any comments?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMESTAMP columns are intended to come back as datetime64[ns, UTC], yes.

DATETIME should come back as datetime64[ns].

See my answer here on the difference between the two: https://stackoverflow.com/a/47724366/101923

Also note: both will come back as object dtype if there's a date outside of the pandas representable range, e.g. 0001-01-01 or 9999-12-31.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually working on making the pandas-gbq dtypes consistent with google-cloud-bigquery as we speak in googleapis/python-bigquery-pandas#444

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if I don't provide a schema, bigquery will infer that the dataframe column named "timestamp" is a TIMESTAMP column therefore it's converting it is coming back as datetime64[ns, UTC]. That been said to keep the test simple I think we can have the local dataframe to be timezone aware and test that it comes back as it should.

cc: @jrbourbeau Does this convince you? If so this PR is ready for review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@ncclementi
Copy link
Contributor Author

It looks like when on macOS it can't find pyarrow and the tests are failing. I've never seen anything like these. @jrbourbeau DO you know what could be happening?

Copy link
Contributor

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing @ncclementi and reviewing @tswast! This is in

@jrbourbeau
Copy link
Contributor

Forgot to mention the macOS CI failure. This looks like a totally unrelated packaging issue (we're seeing similar things over in Dask's CI). I'm not currently able to reproduce locally -- let me rerun CI to see if the issue has already been resolved

@jrbourbeau
Copy link
Contributor

Hmm unfortunately the macOS environment issue is still around. I'm highly confident this is unrelated to the changes in this PR (see similar things being reported in distributed here dask/distributed#5601). Let's go ahead and merge this in for now and we can address any follow ups in subsequent PRs. FWIW I've opened this issue conda-forge/conda-forge.github.io#1574 with the conda-forge maintainers

@jrbourbeau jrbourbeau merged commit c372694 into main Dec 16, 2021
@jrbourbeau jrbourbeau deleted the remove_pd_gbq branch December 16, 2021 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pandas-gbq 0.16 release broke dask-bigquery CI

4 participants