Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make {eol_}comments_re read-only and non-init arguments in ParserConfig #352

Merged
merged 11 commits into from
Dec 29, 2024

Conversation

vfazio
Copy link
Collaborator

@vfazio vfazio commented Dec 27, 2024

There are numerous changes in this MR,

  1. Update the default eol_comments pattern to use (?m). Since bootstrap.py was not regenerated after Pre-compile regexps #338, tests didnt fail like they should have, however users trying to generate code in the new version got bit by a behavior change.
    One of the reasons this got missed is because the code generator uses repr and there's apparently no test to compare the current bootstrap.py with a newly generated file from grammar/tatsu.ebnf so this test may be worth adding at some point.

  2. Drop the forced re.MULTILINE injection to str type regex patterns per discussion ParserConfig type changes may cause a regression in pattern matching #351 (comment)

  3. Document the behavior change regarding re.MULTILINE match no longer being injected into regex string patterns implicitly

  4. Drop eol_comments_re and comments_re from ParserConfig's arguments list. They are deprecated in favor of the eol_comments and comments which already provide the same purpose and are parsed from directives. These arguments will be compiled as part of the dataclass's __post_init__ and set private *_re variables that have a read only property interface for other infrastructure to query. The *_re values cannot be mutated so the two values stay in sync.

  5. Update code generators to use new arguments

  6. Regenerate bootstrap.py

  7. Document the changes in arguments in the syntax section

The first three ensure consistent behavior between string patterns and compiled patterns. Users will still potentially need to update their patterns, but at least it's documented in some way. As far as I can tell, there is no published changelog or deprecations list that users can easily consult where this can be listed

The last 4 are to migrate away from the *_re values as init arguments to ParserConfig so the regex values are only ever specified in one place. The only real value here, now that the values are treated the same, is that there's only one way to set the regex, and it's via the string argument, not a precompiled regex value.

After working on this MR, I think there are a couple of things that should maybe be spawned off into separate issues:

  1. There should be a coverage report/badge/analysis for this repo as there is not 100% coverage on the code
  2. There should be a test or a CI check that bootstrap.py does not need to be regenerated
  3. buffering.py::Buffer.build_whitespace_re only ever receives strings based on the typing of ParserConfig but it has a work around for RETYPE.
  4. buffering.py::Buffer.build_whitespace_re is automagically prepending (?m) to the whitespace pattern and maybe shouldn't be if the expectation is that users configure this properly. TatSu tests pass fine with this dropped.
  5. ngcodegen should consider switching filling self._pattern with pattern.regex so that code can be simplified to only ever expect an re.Pattern and not deal with compile on the fly unless that's required. Otherwise, I think there's an optimization to have a pattern map at the beginning of the generated file so that all of the patterns are compiled once and are simply referenced in the self._pattern argument to avoid constantly compiling patterns over and over each time self._string_ is called for example. I don't see use of functools lrucache so i'm guessing these calls aren't cached in anyway and they are being compiled continuously.

Closes #351

@apalala
Copy link
Collaborator

apalala commented Dec 28, 2024

The difficult part in these complex changes is that bootstrap.py is a generated file. It is so so there's a consistency check for all of TatSu.

Plese redirect the pull request to the following branch, and I will test it against existing parsers.

https://github.com/neogeny/TatSu/tree/drop_comments_re

@vfazio
Copy link
Collaborator Author

vfazio commented Dec 28, 2024

Oof, I didn't realize I was going to have to address lint errors in code I didn't touch. I'll have to patch those up.

What are your thoughts on ParseContext? Should I drop those proxied values for the compiled REs?

TatSu/tatsu/contexts.py

Lines 166 to 172 in 1b632e8

@property
def comments_re(self):
return self.active_config.comments_re
@property
def eol_comments_re(self):
return self.active_config.eol_comments_re

I ran the unit tests the other day and they're never referenced. I wonder if we should be generating some coverage data to show how comprehensive the test suite is.

Anyway I'll fix this up rq and change targets

Previously, when scanning for matches to a regex, if the type of the
pattern was `str`, the pattern was always compiled with `re.MULTILINE`.

Recent changes to `ParserConfig` [0] changed the type used for regex
matches in generated code from `str` to `re.Pattern` which could lead to
a difference in behavior from previous versions where a defined comments
or eol_comments may have been implicitly relying on the `re.MULTILINE`
flag.

After discussion [1], it has been determined that usage of `re` flags
within TatSu should be deprecated in favor of users specifying the
necessary flags within patterns.

As such, drop the `re.MULTILINE` flag for strings compiled on the fly.

[0]: neogeny#338
[1]: neogeny#351 (comment)
@vfazio vfazio changed the base branch from master to drop_comments_re December 29, 2024 00:21
@vfazio vfazio force-pushed the vfazio-drop-comment-res branch from 157937b to f922dad Compare December 29, 2024 00:21
vfazio and others added 10 commits December 28, 2024 18:32
Make the default eol_comments regex use multiline matching.

Recent changes to `ParserConfig` [0] now use a precompiled regex (an
`re.Pattern`) instead of compiling the `str` regex on the fly.

The `Tokenizer` previously assumed `str` type regexes should all be
`re.MULTILINE` regardless of options defined in the regex itself when
compiling the pattern. This behavior has since changed to no longer
automatically apply and thus requires configurations to specify the
option in the pattern.

[0]: neogeny#338
Previously, the `eol_comments_re` and `comments_re` attributes were
public init arguments, were modifiable, and could thus become out of
sync with the `eol_comments` and `comments` attributes.

Also, with recent changes to `ParserConfig` [0], there were two ways to
initialize the regex values for comments and eol_comments directives;
either via the constructor using the *_re variables or by using the
sister string arguments and relying on `__post_init__` to compile the
values which trumped the explicit *_re argument values.

Now, the constructor interface has been simplified to not take either
`eol_comments_re` or `comments_re` as arguments. Callers may only use
`eol_comments` and `comments`.

The `eol_comments_re` and `comments_re` attributes are still
public, but are read-only so they are always a reflection of their
sister string values passed into the constructor.

[0]: neogeny#200
@vfazio vfazio force-pushed the vfazio-drop-comment-res branch from f922dad to fdad793 Compare December 29, 2024 00:37
@vfazio vfazio changed the title Drop usage of {eol_}comments_re from ParserConfig Make {eol_}comments_re read-only and non-init arguments in ParserConfig Dec 29, 2024
@vfazio vfazio marked this pull request as ready for review December 29, 2024 01:53
@vfazio
Copy link
Collaborator Author

vfazio commented Dec 29, 2024

updated the description and commits

@vfazio
Copy link
Collaborator Author

vfazio commented Dec 29, 2024

@apalala it sounded like you wanted the attributes dropped completely which i did not do.

I have a new branch which does do this and all tests pass. I can push that if you would prefer.

Dropping a diff here just in case:

(venv) vfazio@Zephyrus:~/development/TatSu$ git diff vfazio-keep_comments_re vfazio-drop-comment-res 
diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 5a2a91f..f2a2e77 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -49,6 +49,8 @@ class Buffer(Tokenizer):
         self.text = self.original_text = text
 
         self.whitespace_re = self.build_whitespace_re(config.whitespace)
+        self.comments_re = None if not config.comments else re.compile(config.comments)
+        self.eol_comments_re = None if not config.eol_comments else re.compile(config.eol_comments)
         self.nameguard = (
             config.nameguard
             if config.nameguard is not None
@@ -269,11 +271,11 @@ class Buffer(Tokenizer):
         return self._eat_regex(self.whitespace_re)
 
     def eat_comments(self):
-        comments = self._eat_regex_list(self.config.comments_re)
+        comments = self._eat_regex_list(self.comments_re)
         self._index_comments(comments, lambda x: x.inline)
 
     def eat_eol_comments(self):
-        comments = self._eat_regex_list(self.config.eol_comments_re)
+        comments = self._eat_regex_list(self.eol_comments_re)
         self._index_comments(comments, lambda x: x.eol)
 
     def next_token(self):
diff --git a/tatsu/contexts.py b/tatsu/contexts.py
index 12873fd..425fdbf 100644
--- a/tatsu/contexts.py
+++ b/tatsu/contexts.py
@@ -163,14 +163,6 @@ class ParseContext:
     def trace_filename(self):
         return self.active_config.trace_filename
 
-    @property
-    def comments_re(self):
-        return self.active_config.comments_re
-
-    @property
-    def eol_comments_re(self):
-        return self.active_config.eol_comments_re
-
     @property
     def whitespace(self):
         return self.active_config.whitespace
diff --git a/tatsu/infos.py b/tatsu/infos.py
index 3bba14f..5bc2100 100644
--- a/tatsu/infos.py
+++ b/tatsu/infos.py
@@ -2,7 +2,6 @@ from __future__ import annotations
 
 import copy
 import dataclasses
-import re
 from collections.abc import Callable, MutableMapping
 from itertools import starmap
 from typing import Any, NamedTuple
@@ -30,9 +29,6 @@ class ParserConfig:
     start_rule: str | None = None  # FIXME
     rule_name: str | None = None  # Backward compatibility
 
-    _comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
-    _eol_comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
-
     tokenizercls: type[Tokenizer] | None = None  # FIXME
     semantics: type | None = None
 
@@ -63,18 +59,6 @@ class ParserConfig:
     def __post_init__(self):  # pylint: disable=W0235
         if self.ignorecase:
             self.keywords = [k.upper() for k in self.keywords]
-        if self.comments:
-            self._comments_re = re.compile(self.comments)
-        if self.eol_comments:
-            self._eol_comments_re = re.compile(self.eol_comments)
-
-    @property
-    def comments_re(self) -> re.Pattern | None:
-        return self._comments_re
-
-    @property
-    def eol_comments_re(self) -> re.Pattern | None:
-        return self._eol_comments_re
 
     @classmethod
     def new(
@@ -109,20 +93,8 @@ class ParserConfig:
         else:
             return self.replace(**vars(other))
 
-    # non-init fields cannot be used as arguments in `replace`, however
-    # they are values returned by `vars` and `dataclass.asdict` so they
-    # must be filtered out.
-    # If the `ParserConfig` dataclass drops these fields, then this filter can be removed
-    def _filter_non_init_fields(self, settings: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
-        for field in [
-            field.name for field in dataclasses.fields(self) if not field.init
-        ]:
-            if field in settings:
-                del settings[field]
-        return settings
-
     def replace(self, **settings: Any) -> ParserConfig:
-        overrides = self._filter_non_init_fields(self._find_common(**settings))
+        overrides = self._find_common(**settings)
         result = dataclasses.replace(self, **overrides)
         if 'grammar' in overrides:
             result.name = result.grammar

@vfazio
Copy link
Collaborator Author

vfazio commented Dec 29, 2024

Dropping them completely from ParserConfig and letting Buffer compile the patterns may save a small amount of time considering the patterns get recompiled every time ParserConfig.replace, ParserConfig.replace_config, or ParserConfig.new get called.

@apalala
Copy link
Collaborator

apalala commented Dec 29, 2024

Those are all interesting ideas. We can take it one step at a time.

I'm merging this pull request right now to start the testing against existing parsers. The final test is re-generating bootstrap.py to check that everything is consistent.

I gave you write access to this repo so we can work collaboratively on whatever comes next.

@apalala
Copy link
Collaborator

apalala commented Dec 29, 2024

The merge is to the branch, so full testing can be done before merging to master.

@apalala apalala merged commit 7e4f8f4 into neogeny:drop_comments_re Dec 29, 2024
2 checks passed
apalala added a commit that referenced this pull request Jan 3, 2025
#353)

* Deprecate ` {eol_}comments_re` in `ParserConfig` (#352)
* [buffering] drop forced multiline match for string patterns

Previously, when scanning for matches to a regex, if the type of the
pattern was `str`, the pattern was always compiled with `re.MULTILINE`.

Recent changes to `ParserConfig` [0] changed the type used for regex
matches in generated code from `str` to `re.Pattern` which could lead to
a difference in behavior from previous versions where a defined comments
or eol_comments may have been implicitly relying on the `re.MULTILINE`
flag.

After discussion [1], it has been determined that usage of `re` flags
within TatSu should be deprecated in favor of users specifying the
necessary flags within patterns.

As such, drop the `re.MULTILINE` flag for strings compiled on the fly.

---------

Co-authored-by: Vincent Fazio <[email protected]>
Co-authored-by: Vincent Fazio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParserConfig type changes may cause a regression in pattern matching
2 participants