-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make {eol_}comments_re read-only and non-init arguments in ParserConfig
#352
Make {eol_}comments_re read-only and non-init arguments in ParserConfig
#352
Conversation
The difficult part in these complex changes is that Plese redirect the pull request to the following branch, and I will test it against existing parsers. https://github.com/neogeny/TatSu/tree/drop_comments_re |
Oof, I didn't realize I was going to have to address lint errors in code I didn't touch. I'll have to patch those up. What are your thoughts on Lines 166 to 172 in 1b632e8
I ran the unit tests the other day and they're never referenced. I wonder if we should be generating some coverage data to show how comprehensive the test suite is. Anyway I'll fix this up rq and change targets |
Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: neogeny#338 [1]: neogeny#351 (comment)
157937b
to
f922dad
Compare
Make the default eol_comments regex use multiline matching. Recent changes to `ParserConfig` [0] now use a precompiled regex (an `re.Pattern`) instead of compiling the `str` regex on the fly. The `Tokenizer` previously assumed `str` type regexes should all be `re.MULTILINE` regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern. [0]: neogeny#338
Previously, the `eol_comments_re` and `comments_re` attributes were public init arguments, were modifiable, and could thus become out of sync with the `eol_comments` and `comments` attributes. Also, with recent changes to `ParserConfig` [0], there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on `__post_init__` to compile the values which trumped the explicit *_re argument values. Now, the constructor interface has been simplified to not take either `eol_comments_re` or `comments_re` as arguments. Callers may only use `eol_comments` and `comments`. The `eol_comments_re` and `comments_re` attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor. [0]: neogeny#200
f922dad
to
fdad793
Compare
ParserConfig
updated the description and commits |
@apalala it sounded like you wanted the attributes dropped completely which i did not do. I have a new branch which does do this and all tests pass. I can push that if you would prefer. Dropping a diff here just in case: (venv) vfazio@Zephyrus:~/development/TatSu$ git diff vfazio-keep_comments_re vfazio-drop-comment-res
diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 5a2a91f..f2a2e77 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -49,6 +49,8 @@ class Buffer(Tokenizer):
self.text = self.original_text = text
self.whitespace_re = self.build_whitespace_re(config.whitespace)
+ self.comments_re = None if not config.comments else re.compile(config.comments)
+ self.eol_comments_re = None if not config.eol_comments else re.compile(config.eol_comments)
self.nameguard = (
config.nameguard
if config.nameguard is not None
@@ -269,11 +271,11 @@ class Buffer(Tokenizer):
return self._eat_regex(self.whitespace_re)
def eat_comments(self):
- comments = self._eat_regex_list(self.config.comments_re)
+ comments = self._eat_regex_list(self.comments_re)
self._index_comments(comments, lambda x: x.inline)
def eat_eol_comments(self):
- comments = self._eat_regex_list(self.config.eol_comments_re)
+ comments = self._eat_regex_list(self.eol_comments_re)
self._index_comments(comments, lambda x: x.eol)
def next_token(self):
diff --git a/tatsu/contexts.py b/tatsu/contexts.py
index 12873fd..425fdbf 100644
--- a/tatsu/contexts.py
+++ b/tatsu/contexts.py
@@ -163,14 +163,6 @@ class ParseContext:
def trace_filename(self):
return self.active_config.trace_filename
- @property
- def comments_re(self):
- return self.active_config.comments_re
-
- @property
- def eol_comments_re(self):
- return self.active_config.eol_comments_re
-
@property
def whitespace(self):
return self.active_config.whitespace
diff --git a/tatsu/infos.py b/tatsu/infos.py
index 3bba14f..5bc2100 100644
--- a/tatsu/infos.py
+++ b/tatsu/infos.py
@@ -2,7 +2,6 @@ from __future__ import annotations
import copy
import dataclasses
-import re
from collections.abc import Callable, MutableMapping
from itertools import starmap
from typing import Any, NamedTuple
@@ -30,9 +29,6 @@ class ParserConfig:
start_rule: str | None = None # FIXME
rule_name: str | None = None # Backward compatibility
- _comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
- _eol_comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
-
tokenizercls: type[Tokenizer] | None = None # FIXME
semantics: type | None = None
@@ -63,18 +59,6 @@ class ParserConfig:
def __post_init__(self): # pylint: disable=W0235
if self.ignorecase:
self.keywords = [k.upper() for k in self.keywords]
- if self.comments:
- self._comments_re = re.compile(self.comments)
- if self.eol_comments:
- self._eol_comments_re = re.compile(self.eol_comments)
-
- @property
- def comments_re(self) -> re.Pattern | None:
- return self._comments_re
-
- @property
- def eol_comments_re(self) -> re.Pattern | None:
- return self._eol_comments_re
@classmethod
def new(
@@ -109,20 +93,8 @@ class ParserConfig:
else:
return self.replace(**vars(other))
- # non-init fields cannot be used as arguments in `replace`, however
- # they are values returned by `vars` and `dataclass.asdict` so they
- # must be filtered out.
- # If the `ParserConfig` dataclass drops these fields, then this filter can be removed
- def _filter_non_init_fields(self, settings: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
- for field in [
- field.name for field in dataclasses.fields(self) if not field.init
- ]:
- if field in settings:
- del settings[field]
- return settings
-
def replace(self, **settings: Any) -> ParserConfig:
- overrides = self._filter_non_init_fields(self._find_common(**settings))
+ overrides = self._find_common(**settings)
result = dataclasses.replace(self, **overrides)
if 'grammar' in overrides:
result.name = result.grammar |
Dropping them completely from |
Those are all interesting ideas. We can take it one step at a time. I'm merging this pull request right now to start the testing against existing parsers. The final test is re-generating I gave you write access to this repo so we can work collaboratively on whatever comes next. |
The merge is to the branch, so full testing can be done before merging to master. |
#353) * Deprecate ` {eol_}comments_re` in `ParserConfig` (#352) * [buffering] drop forced multiline match for string patterns Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. --------- Co-authored-by: Vincent Fazio <[email protected]> Co-authored-by: Vincent Fazio <[email protected]>
There are numerous changes in this MR,
Update the default eol_comments pattern to use
(?m)
. Since bootstrap.py was not regenerated after Pre-compile regexps #338, tests didnt fail like they should have, however users trying to generate code in the new version got bit by a behavior change.One of the reasons this got missed is because the code generator uses
repr
and there's apparently no test to compare the current bootstrap.py with a newly generated file fromgrammar/tatsu.ebnf
so this test may be worth adding at some point.Drop the forced
re.MULTILINE
injection tostr
type regex patterns per discussion ParserConfig type changes may cause a regression in pattern matching #351 (comment)Document the behavior change regarding
re.MULTILINE
match no longer being injected into regex string patterns implicitlyDrop
eol_comments_re
andcomments_re
fromParserConfig
's arguments list. They are deprecated in favor of theeol_comments
andcomments
which already provide the same purpose and are parsed from directives. These arguments will be compiled as part of the dataclass's__post_init__
and set private *_re variables that have a read only property interface for other infrastructure to query. The *_re values cannot be mutated so the two values stay in sync.Update code generators to use new arguments
Regenerate bootstrap.py
Document the changes in arguments in the syntax section
The first three ensure consistent behavior between string patterns and compiled patterns. Users will still potentially need to update their patterns, but at least it's documented in some way. As far as I can tell, there is no published changelog or deprecations list that users can easily consult where this can be listed
The last 4 are to migrate away from the *_re values as init arguments to
ParserConfig
so the regex values are only ever specified in one place. The only real value here, now that the values are treated the same, is that there's only one way to set the regex, and it's via the string argument, not a precompiled regex value.After working on this MR, I think there are a couple of things that should maybe be spawned off into separate issues:
bootstrap.py
does not need to be regeneratedbuffering.py::Buffer.build_whitespace_re
only ever receives strings based on the typing ofParserConfig
but it has a work around forRETYPE
.buffering.py::Buffer.build_whitespace_re
is automagically prepending(?m)
to the whitespace pattern and maybe shouldn't be if the expectation is that users configure this properly. TatSu tests pass fine with this dropped.self._pattern
withpattern.regex
so that code can be simplified to only ever expect anre.Pattern
and not deal with compile on the fly unless that's required. Otherwise, I think there's an optimization to have a pattern map at the beginning of the generated file so that all of the patterns are compiled once and are simply referenced in theself._pattern
argument to avoid constantly compiling patterns over and over each timeself._string_
is called for example. I don't see use of functools lrucache so i'm guessing these calls aren't cached in anyway and they are being compiled continuously.Closes #351