Make {eol_}comments_re read-only and non-init arguments in `ParserConfig` #352

vfazio · 2024-12-27T20:44:55Z

There are numerous changes in this MR,

Update the default eol_comments pattern to use (?m). Since bootstrap.py was not regenerated after Pre-compile regexps #338, tests didnt fail like they should have, however users trying to generate code in the new version got bit by a behavior change.
One of the reasons this got missed is because the code generator uses repr and there's apparently no test to compare the current bootstrap.py with a newly generated file from grammar/tatsu.ebnf so this test may be worth adding at some point.
Drop the forced re.MULTILINE injection to str type regex patterns per discussion ParserConfig type changes may cause a regression in pattern matching #351 (comment)
Document the behavior change regarding re.MULTILINE match no longer being injected into regex string patterns implicitly
Drop eol_comments_re and comments_re from ParserConfig's arguments list. They are deprecated in favor of the eol_comments and comments which already provide the same purpose and are parsed from directives. These arguments will be compiled as part of the dataclass's __post_init__ and set private *_re variables that have a read only property interface for other infrastructure to query. The *_re values cannot be mutated so the two values stay in sync.
Update code generators to use new arguments
Regenerate bootstrap.py
Document the changes in arguments in the syntax section

The first three ensure consistent behavior between string patterns and compiled patterns. Users will still potentially need to update their patterns, but at least it's documented in some way. As far as I can tell, there is no published changelog or deprecations list that users can easily consult where this can be listed

The last 4 are to migrate away from the *_re values as init arguments to ParserConfig so the regex values are only ever specified in one place. The only real value here, now that the values are treated the same, is that there's only one way to set the regex, and it's via the string argument, not a precompiled regex value.

After working on this MR, I think there are a couple of things that should maybe be spawned off into separate issues:

There should be a coverage report/badge/analysis for this repo as there is not 100% coverage on the code
There should be a test or a CI check that bootstrap.py does not need to be regenerated
buffering.py::Buffer.build_whitespace_re only ever receives strings based on the typing of ParserConfig but it has a work around for RETYPE.
buffering.py::Buffer.build_whitespace_re is automagically prepending (?m) to the whitespace pattern and maybe shouldn't be if the expectation is that users configure this properly. TatSu tests pass fine with this dropped.
ngcodegen should consider switching filling self._pattern with pattern.regex so that code can be simplified to only ever expect an re.Pattern and not deal with compile on the fly unless that's required. Otherwise, I think there's an optimization to have a pattern map at the beginning of the generated file so that all of the patterns are compiled once and are simply referenced in the self._pattern argument to avoid constantly compiling patterns over and over each time self._string_ is called for example. I don't see use of functools lrucache so i'm guessing these calls aren't cached in anyway and they are being compiled continuously.

Closes #351

apalala · 2024-12-28T16:43:36Z

The difficult part in these complex changes is that bootstrap.py is a generated file. It is so so there's a consistency check for all of TatSu.

Plese redirect the pull request to the following branch, and I will test it against existing parsers.

https://github.com/neogeny/TatSu/tree/drop_comments_re

vfazio · 2024-12-28T19:59:01Z

Oof, I didn't realize I was going to have to address lint errors in code I didn't touch. I'll have to patch those up.

What are your thoughts on ParseContext? Should I drop those proxied values for the compiled REs?

TatSu/tatsu/contexts.py

Lines 166 to 172 in 1b632e8

    
           @property 
        
           def comments_re(self): 
        
               return self.active_config.comments_re 
        
           @property 
        
           def eol_comments_re(self): 
        
               return self.active_config.eol_comments_re

I ran the unit tests the other day and they're never referenced. I wonder if we should be generating some coverage data to show how comprehensive the test suite is.

Anyway I'll fix this up rq and change targets

Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: neogeny#338 [1]: neogeny#351 (comment)

Make the default eol_comments regex use multiline matching. Recent changes to `ParserConfig` [0] now use a precompiled regex (an `re.Pattern`) instead of compiling the `str` regex on the fly. The `Tokenizer` previously assumed `str` type regexes should all be `re.MULTILINE` regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern. [0]: neogeny#338

Previously, the `eol_comments_re` and `comments_re` attributes were public init arguments, were modifiable, and could thus become out of sync with the `eol_comments` and `comments` attributes. Also, with recent changes to `ParserConfig` [0], there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on `__post_init__` to compile the values which trumped the explicit *_re argument values. Now, the constructor interface has been simplified to not take either `eol_comments_re` or `comments_re` as arguments. Callers may only use `eol_comments` and `comments`. The `eol_comments_re` and `comments_re` attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor. [0]: neogeny#200

vfazio · 2024-12-29T01:53:51Z

updated the description and commits

vfazio · 2024-12-29T14:38:48Z

@apalala it sounded like you wanted the attributes dropped completely which i did not do.

I have a new branch which does do this and all tests pass. I can push that if you would prefer.

Dropping a diff here just in case:

(venv) vfazio@Zephyrus:~/development/TatSu$ git diff vfazio-keep_comments_re vfazio-drop-comment-res 
diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 5a2a91f..f2a2e77 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -49,6 +49,8 @@ class Buffer(Tokenizer):
         self.text = self.original_text = text
 
         self.whitespace_re = self.build_whitespace_re(config.whitespace)
+        self.comments_re = None if not config.comments else re.compile(config.comments)
+        self.eol_comments_re = None if not config.eol_comments else re.compile(config.eol_comments)
         self.nameguard = (
             config.nameguard
             if config.nameguard is not None
@@ -269,11 +271,11 @@ class Buffer(Tokenizer):
         return self._eat_regex(self.whitespace_re)
 
     def eat_comments(self):
-        comments = self._eat_regex_list(self.config.comments_re)
+        comments = self._eat_regex_list(self.comments_re)
         self._index_comments(comments, lambda x: x.inline)
 
     def eat_eol_comments(self):
-        comments = self._eat_regex_list(self.config.eol_comments_re)
+        comments = self._eat_regex_list(self.eol_comments_re)
         self._index_comments(comments, lambda x: x.eol)
 
     def next_token(self):
diff --git a/tatsu/contexts.py b/tatsu/contexts.py
index 12873fd..425fdbf 100644
--- a/tatsu/contexts.py
+++ b/tatsu/contexts.py
@@ -163,14 +163,6 @@ class ParseContext:
     def trace_filename(self):
         return self.active_config.trace_filename
 
-    @property
-    def comments_re(self):
-        return self.active_config.comments_re
-
-    @property
-    def eol_comments_re(self):
-        return self.active_config.eol_comments_re
-
     @property
     def whitespace(self):
         return self.active_config.whitespace
diff --git a/tatsu/infos.py b/tatsu/infos.py
index 3bba14f..5bc2100 100644
--- a/tatsu/infos.py
+++ b/tatsu/infos.py
@@ -2,7 +2,6 @@ from __future__ import annotations
 
 import copy
 import dataclasses
-import re
 from collections.abc import Callable, MutableMapping
 from itertools import starmap
 from typing import Any, NamedTuple
@@ -30,9 +29,6 @@ class ParserConfig:
     start_rule: str | None = None  # FIXME
     rule_name: str | None = None  # Backward compatibility
 
-    _comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
-    _eol_comments_re: re.Pattern | None = dataclasses.field(default=None, init=False, repr=False)
-
     tokenizercls: type[Tokenizer] | None = None  # FIXME
     semantics: type | None = None
 
@@ -63,18 +59,6 @@ class ParserConfig:
     def __post_init__(self):  # pylint: disable=W0235
         if self.ignorecase:
             self.keywords = [k.upper() for k in self.keywords]
-        if self.comments:
-            self._comments_re = re.compile(self.comments)
-        if self.eol_comments:
-            self._eol_comments_re = re.compile(self.eol_comments)
-
-    @property
-    def comments_re(self) -> re.Pattern | None:
-        return self._comments_re
-
-    @property
-    def eol_comments_re(self) -> re.Pattern | None:
-        return self._eol_comments_re
 
     @classmethod
     def new(
@@ -109,20 +93,8 @@ class ParserConfig:
         else:
             return self.replace(**vars(other))
 
-    # non-init fields cannot be used as arguments in `replace`, however
-    # they are values returned by `vars` and `dataclass.asdict` so they
-    # must be filtered out.
-    # If the `ParserConfig` dataclass drops these fields, then this filter can be removed
-    def _filter_non_init_fields(self, settings: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
-        for field in [
-            field.name for field in dataclasses.fields(self) if not field.init
-        ]:
-            if field in settings:
-                del settings[field]
-        return settings
-
     def replace(self, **settings: Any) -> ParserConfig:
-        overrides = self._filter_non_init_fields(self._find_common(**settings))
+        overrides = self._find_common(**settings)
         result = dataclasses.replace(self, **overrides)
         if 'grammar' in overrides:
             result.name = result.grammar

vfazio · 2024-12-29T14:42:44Z

Dropping them completely from ParserConfig and letting Buffer compile the patterns may save a small amount of time considering the patterns get recompiled every time ParserConfig.replace, ParserConfig.replace_config, or ParserConfig.new get called.

apalala · 2024-12-29T14:52:03Z

Those are all interesting ideas. We can take it one step at a time.

I'm merging this pull request right now to start the testing against existing parsers. The final test is re-generating bootstrap.py to check that everything is consistent.

I gave you write access to this repo so we can work collaboratively on whatever comes next.

apalala · 2024-12-29T14:53:52Z

The merge is to the branch, so full testing can be done before merging to master.

#353) * Deprecate ` {eol_}comments_re` in `ParserConfig` (#352) * [buffering] drop forced multiline match for string patterns Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. --------- Co-authored-by: Vincent Fazio <[email protected]> Co-authored-by: Vincent Fazio <[email protected]>

vfazio changed the base branch from master to drop_comments_re December 29, 2024 00:21

vfazio force-pushed the vfazio-drop-comment-res branch from 157937b to f922dad Compare December 29, 2024 00:21

vfazio and others added 10 commits December 28, 2024 18:32

[codegen] migrate to {eol_}comments

4bdd4a5

[ngcodegen] migrate to {eol_}comments

42cb810

[bootstrap] migrate to {eol_}comments

9160c08

[lint] resolve errors

03d4b7f

[docs] note {eol_}comments directive behavior changes

923a678

[docs] update syntax to reflect {eol_}comments arguments

ddbe27f

[test] fix test_parse_hash to use eol_comments

dacb978

[test] explicitly use multiline match in test_patterns_with_newlines

fdad793

vfazio force-pushed the vfazio-drop-comment-res branch from f922dad to fdad793 Compare December 29, 2024 00:37

vfazio changed the title ~~Drop usage of {eol_}comments_re from ParserConfig~~ Make {eol_}comments_re read-only and non-init arguments in ParserConfig Dec 29, 2024

vfazio marked this pull request as ready for review December 29, 2024 01:53

apalala merged commit 7e4f8f4 into neogeny:drop_comments_re Dec 29, 2024
2 checks passed

apalala mentioned this pull request Dec 29, 2024

Make {eol_}comments_re read-only and non-init arguments in `ParserCon… #353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make {eol_}comments_re read-only and non-init arguments in `ParserConfig` #352

Make {eol_}comments_re read-only and non-init arguments in `ParserConfig` #352

vfazio commented Dec 27, 2024 •

edited

Loading

apalala commented Dec 28, 2024

vfazio commented Dec 28, 2024

vfazio commented Dec 29, 2024

vfazio commented Dec 29, 2024

vfazio commented Dec 29, 2024

apalala commented Dec 29, 2024

apalala commented Dec 29, 2024

Make {eol_}comments_re read-only and non-init arguments in ParserConfig #352

Make {eol_}comments_re read-only and non-init arguments in ParserConfig #352

Conversation

vfazio commented Dec 27, 2024 • edited Loading

apalala commented Dec 28, 2024

vfazio commented Dec 28, 2024

vfazio commented Dec 29, 2024

vfazio commented Dec 29, 2024

vfazio commented Dec 29, 2024

apalala commented Dec 29, 2024

apalala commented Dec 29, 2024

Make {eol_}comments_re read-only and non-init arguments in `ParserConfig` #352

Make {eol_}comments_re read-only and non-init arguments in `ParserConfig` #352

vfazio commented Dec 27, 2024 •

edited

Loading