-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Location to a namedtuple, and associated cleanup #205
Conversation
This change makes the `Location` dataclass, which does not change frequently, into a new `SourceLocation` namedtuple, and changes the `SourceLocation` serialization. As a result, with this change: * `embossc` runs about 25% faster on a large (7kLOC) input; `python3 -OO emboss` runs about 19% faster on the same input. * Serialized IR is about 45% smaller. Details: * Replace the `ir_data.Location` dataclass with a new `parser_types.SourceLocation` namedtuple. The rename helps clarify the difference between a location within source code (`SourceLocation`) and a location within a structure (`FieldLocation`). * Similarly, replace `ir_data.Position` with `parser_types.SourcePosition`. * Update any place that edits a `SourceLocation` with an appropriate assignment; e.g., `x.source_location.end = y` becomes `x.source_location = x.source_location._replace(end=y)`. In most cases, several fields were updated consecutively; those updates are been merged. * Update the JSON serialization to use the compact format. * Replace `format_location()` and `format_position()` with `__str__()` methods on `SourceLocation` and `SourcePosition`, respectively. * Replace `parse_location()` and `parse_position()` with `from_str()` class methods on `SourceLocation` and `SourcePosition`, respectively. * Move the `make_location()` functionality into `SourceLocation.__new__()`. * Update `_to_dict` and `_from_dict` in `IrDataSerializer` to stringify and destringify `SourceLocation`. It is tempting to try to do this during the JSON serialization step (with a `default=` parameter to `json.dumps` and an `object_hook=` parameter to `json.loads`), but it is tricky to get the `object_hook` to know when to convert.
The +/- on this PR shows a lot of lines removed, but they're mostly in IR fragments -- the actual code changes are roughly a wash in LOC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for doing this. A few questions around handing of default values
else: | ||
result.source_location = ir_data_utils.copy(parse_tree.source_location) | ||
result.source_location = parse_tree.source_location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard for me to tell if we need a copy here. I guess b/c it's a tuple the data is frozen so we don't? Overall I wonder if modifying ir_data_utils.update
and ir_data_utils.copy
to special case namedtuple
would make sense. The overhead might not be worth it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, more or less the point of this change is to make it so that it is never necessary to deep copy SourceLocation
-- it's frozen, so it doesn't matter if multiple nodes point to the same SourceLocation
instance. About half of the time in _copy()
was just copying Location
.
It would be (relatively) cheap to special case update
and copy
to check if their arguments are tuple
(specifically checking for anything created by namedtuple()
is harder), but I don't really see the point? Right now, they just raise an exception if you give them something that isn't an IR dataclass, so any mistakes will be caught quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's fine as-is, it just took me a second to grok why we weren't using the standard methods for this anymore.
compiler/front_end/module_ir_test.py
Outdated
result.append("{}.end missing".format(path)) | ||
else: | ||
end = source_location.end | ||
|
||
if start and end: | ||
if start.HasField("line") and end.HasField("line"): | ||
if start.line and end.line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is checking that start.line
!= (0,0) and end.line != (0,0)
, is that actually what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tightened up the code in here, mostly to take advantage of the tuple
ordering. I think that the altered version addresses all of your comments.
This function is used to check the source_location
invariants after module_ir.build_ir()
is done -- basically, every node with a source_location
field must have a source location, the source location must be inside the parent's source location, and both line
and column
must be >= 1
.
I also added a number of tests to parser_types_test
to cover the internal invariants in SourcePosition
and SourceLocation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so for a source location to be valid it must not be: (0,0), (0,1), (1,0) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in here, yes. (0, 0)
is "valid" in the sense that you might have a node that didn't "come from" something in the source text, but module_ir
shouldn't be creating any of those.
I just updated SourcePosition.__new__
to assert
-fail on (0, x) and (x, 0), and SourceLocation.__new__
to assert
-fail if start and not end
or end and not start
, and removed the checks here for only line
or column
being 0
.
compiler/front_end/module_ir_test.py
Outdated
and end.HasField("column") | ||
and start.column > end.column | ||
): | ||
if start.column and end.column and start.column > end.column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we shouldn't have the start.column and end.column
portion here, they'll always be present, this is just checking that they're not zero now right?
compiler/front_end/module_ir_test.py
Outdated
result.append("{}.start missing".format(path)) | ||
else: | ||
start = source_location.start | ||
if not source_location.HasField("end"): | ||
if not source_location.end: | ||
result.append("{}.end missing".format(path)) | ||
else: | ||
end = source_location.end | ||
|
||
if start and end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is checking what we want now? Maybe is not None
is more corect.
compiler/front_end/module_ir_test.py
Outdated
@@ -4057,12 +3984,12 @@ def _check_source_location(source_location, path, min_start, max_end): | |||
for name, field in (("start", start), ("end", end)): | |||
if not field: | |||
continue | |||
if field.HasField("line"): | |||
if field.line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and again
compiler/front_end/module_ir_test.py
Outdated
if field.line <= 0: | ||
result.append("{}.{}.line <= 0 ({})".format(path, name, field.line)) | ||
else: | ||
result.append("{}.{}.line missing".format(path, name)) | ||
if field.HasField("column"): | ||
if field.column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm as long as (0,0)
is always invalid things make sense to me.
It's invalid in the sense that it does not refer to an actual location (using 1-based numbering, not 0-based, because that's how lines and columns are customarily numbered). I added some explanation to the docstring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This change makes the
Location
dataclass, which does not changefrequently, into a new
SourceLocation
namedtuple, and changes theSourceLocation
serialization. As a result, with this change:embossc
runs about 25% faster on a large (7kLOC) input;python3 -OO embossc
runs about 19% faster on the same input.Details:
ir_data.Location
dataclass with a newparser_types.SourceLocation
namedtuple. The rename helps clarifythe difference between a location within source code
(
SourceLocation
) and a location within a structure(
FieldLocation
).ir_data.Position
withparser_types.SourcePosition
.SourceLocation
with an appropriateassignment; e.g.,
x.source_location.end = y
becomesx.source_location = x.source_location._replace(end=y)
. In mostcases, several fields were updated consecutively; those updates are
been merged.
format_location()
andformat_position()
with__str__()
methods onSourceLocation
andSourcePosition
,respectively.
parse_location()
andparse_position()
withfrom_str()
class methods on
SourceLocation
andSourcePosition
,respectively.
make_location()
functionality intoSourceLocation.__new__()
._to_dict
and_from_dict
inIrDataSerializer
tostringify and destringify
SourceLocation
. It is tempting totry to do this during the JSON serialization step (with a
default=
parameter to
json.dumps
and anobject_hook=
parameter tojson.loads
), but it is tricky to get theobject_hook
to knowwhen to convert.
source_location
s intomerge_source_locations()
.