Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: STRUCT and ARRAY support #318

Merged
merged 37 commits into from
Sep 9, 2021
Merged

feat: STRUCT and ARRAY support #318

merged 37 commits into from
Sep 9, 2021

Conversation

jimfulton
Copy link
Contributor

@jimfulton jimfulton commented Aug 30, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #293
Fixes #314
Fixed #233
Fixes #37
🦕

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Aug 30, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 30, 2021
@jimfulton
Copy link
Contributor Author

FTR, WRT superset, once I finally got it working :), it behaves the same with and without these changes.

BTW, we have logic that tries to unpack sub-structs, I think so that there would eventually be scalars for superset to work with.

If you have an array of structs, we still create columns for the fields of the struct in the array. This causes superset to error, because it has no way to get at structs in an array. We should probably not unpack structs in arrays.

@snippet-bot
Copy link

snippet-bot bot commented Sep 1, 2021

Here is the summary of changes.

You are about to add 10 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@jimfulton jimfulton marked this pull request as ready for review September 2, 2021 13:30
@jimfulton jimfulton requested review from a team as code owners September 2, 2021 13:30
@jimfulton jimfulton requested a review from tmatsuo September 2, 2021 13:30
@jimfulton
Copy link
Contributor Author

jimfulton commented Sep 2, 2021

Some notes for reviewers:

  • Heart of change is _struct.py, which isn't large, but also isn't obvious. :( I cribbed from the built-in JSON and ARRAY types. When reviewing, it's probably helpful to look at those. The "Comparator" framework is confusing, in large part because the name doesn't make sense.

    Having said that, the core logic is in _setop_getitem, which is a hook used by the base class of STRUCTs comparator.

    The __getattr__ method just delegates to (the inherited) __getitem__.

  • This PR also has 2 other small changes:

    • Machinery for mapping BQ types to SQLAlchemy types has been factored into a separate _types module, both to avoid cluttering base more and to partially avoid circular imports. (There's still a circular import issue that isn't fixable without a bigger refactoring that I deemed unwarranted.)
    • Implementation of ARRAY indexing, which wasn't implemented. A number of my tests (see test__struct.py in both unit and system tests) used an example that has nested structs and arrays.

@tswast tswast self-requested a review September 7, 2021 20:03
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't quite digested everything in this PR yet, but I figured I'd share the feedback I have so far.

docs/struct.rst Show resolved Hide resolved
setup.py Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Just a few things I think we should clarify before merging.

sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
Comment on lines 73 to 79
global type_compiler

try:
process = type_compiler.process
except AttributeError:
type_compiler = base.dialect.type_compiler(base.dialect())
process = type_compiler.process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this in a _get_type_compiler / _get_process function? I don't see anywhere else we initialize type_compiler, but I'd be more comfortable having this logic closer to the # We have to delay getting the type compiler, because of circular imports. :( comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored so this is combined and isolated in one place using a new, better named _get_subtype_col_spec function.

type_compiler = base.dialect.type_compiler(base.dialect())
process = type_compiler.process

fields = ", ".join(f"{name} {process(type_)}" for name, type_ in self.__fields)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume process is able to handle nested arrays/structs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
f"STRUCT fields can only be accessed with strings field names,"
f" not {name}."
)
subtype = self.expr.type._STRUCT__byname.get(name.lower())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does _STRUCT__byname come from? I'm assuming somewhere from SQLAlchemy, but I'm not getting any results when searching for byname.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I figured it out: https://docs.python.org/3/tutorial/classes.html#private-variables

Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped.

Can we comment about this? I assume we have to do it because we know self.expr.type is a STRUCT, but it's not self.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to make name mangling more explicit and consistent, so I don't think comments are needed anymore. See if you agree. :)

I mainly use "private" variables, which aren't :), to avoid namespace conflicts when subclassing across responsibility boundaries. Arguably, explicit naming is better.

return operator, index, subtype

def __getattr__(self, name):
if name.lower() in self.expr.type._STRUCT__byname:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why self.__byname doesn't work in this case.

Edit: I see now that it's part of the Comparator class. Still probably worth a similar comment to the one I recommend in _setup_getitem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my response on name mangling

sqlalchemy_bigquery/_struct.py Show resolved Hide resolved
for f in field.fields
]
results += _get_transitive_schema_fields(sub_fields, cur_fields)
cur_fields.pop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we pop these off, does that mean we don't get the top-level struct field, just the leaf fields? I suspect this might hide some ARRAY columns if a parent node has mode REPEATED, but is not included.

Edit: I see the top field is added to results on line 83. Might be worth a comment as to why we pop here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of cur_fields. It wasn't needed (anymore).

for field in fields:
results += [field]
if field.field_type == "RECORD":
cur_fields.append(field)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what cur_fields is doing. Is there a better name we can pick for this? Maybe it's referring to ancestors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it's not doing anything.

This is based on

def _get_columns_helper(self, columns, cur_columns):
"""
Recurse into record type and return all the nested field names.
As contributed by @sumedhsakdeo on issue #17
"""
results = []
for col in columns:
results += [
SchemaField(
name=".".join(col.name for col in cur_columns + [col]),
field_type=col.field_type,
mode=col.mode,
description=col.description,
fields=col.fields,
)
]
if col.field_type == "RECORD":
cur_columns.append(col)
results += self._get_columns_helper(col.fields, cur_columns)
cur_columns.pop()
return results
, which I inherited.

I've refactored it quite a bit and failed to notice that this wasn't needed any more. Fixed.

(_col().NAME, "`t`.`person`.NAME"),
(_col().children, "`t`.`person`.children"),
(
_col().children[0].label("anon_1"), # SQLAlchemy doesn't add the label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file an issue for this to investigate later? If so, let's add TODO and link to the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

global _get_subtype_col_spec

type_compiler = base.dialect.type_compiler(base.dialect())
_get_subtype_col_spec = type_compiler.process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy! I didn't realize a function could replace itself. I like it.

def visit_getitem_binary(self, binary, operator_, **kw):
left = self.process(binary.left, **kw)
right = self.process(binary.right, **kw)
return f"{left}[OFFSET({right})]"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAFE_OFFSET support would be nice too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
3 participants