Skip to content

Conversation

@rocky
Copy link
Member

@rocky rocky commented Apr 14, 2025

Refactor the scanner. Remove prescanner. Handling escape sequences is not a separate phase, but instead is integrated into the scanning phase.

Fixes #125

@rocky rocky marked this pull request as draft April 14, 2025 15:28
@rocky rocky requested a review from mmatera April 14, 2025 15:31
@rocky
Copy link
Member Author

rocky commented Apr 14, 2025

@mmatera There is still a lot to do regarding writing tests and ensuring we handle errors and everything correctly.

But since this is getting large, this is a heads-up as to what's on the horizon.

@rocky rocky force-pushed the revise-escape-sequence-scanning branch 3 times, most recently from a5b13d0 to 1ab77fa Compare May 15, 2025 23:38
rocky added 20 commits May 28, 2025 19:19
Escape sequences other than named characters have been removed from the
prescanner and put in the scanner.
handle syntax errors in mathics3-tokens.
Tokenizer.code -> Tokenizer.source_text
Tokenizer.incomplete -> Tokenizer.get_more_input
Start to show syntax errors.
In particular errors with octal digits and incomplete named errors.
Go over docstrings in escape_sequences.py
and add more tests.
named-characters.yml: \[Mu] is letterlike
tokeniser.py: Correct identifier or pattern for those having letterlike escape sequences
and also add Theta to the list of letterlike symbols
Replace .format() with f-strings. Add comments around Symbol pattern.

sntx_message() Excpetion now saves name, tag, and args
Not sure how this worked before, but it did.
* "$\" is a thing
* Correct EscapeSyntaxError error message
* Better Symbol tokenization for things like a\[Mu]1. More in next
  commit though.
for things like \.78\.79

Imporve comments around DRYing identifier/symbol_name extension
This PR has gotten out of hand in size, we'll break it up into smaller chunks.
NamedChracterSyntax should be a new-style TranslateError
self.code -> self.source_text
misc sntx_message() fixes. Document better.
@rocky rocky force-pushed the revise-escape-sequence-scanning branch from ef9b7c5 to 53b1402 Compare May 29, 2025 16:48
@rocky rocky force-pushed the revise-escape-sequence-scanning branch from 53b1402 to 74587cc Compare May 29, 2025 16:58
TranslateError, TranslateErrorNew, ScanError now become ScannerError
rocky added 2 commits May 29, 2025 17:18
it should be just a little bit faster (and it is more modern)
@rocky rocky marked this pull request as ready for review May 29, 2025 22:19
Copy link
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@rocky
Copy link
Member Author

rocky commented May 31, 2025

@mmatera merging needs to be coordinated with Mathics3/mathics-core#1403 since exception handling has been revised. (In a previous PR, you commented on the use of TranslateErrorNew.)

After this PR, we'll still need to handle boxing expressions inside strings. Boxing expression outside strings, I think, works. But I haven't been able to get to galatea to understand what is expected versus not.

@rocky rocky force-pushed the revise-escape-sequence-scanning branch from fa1155d to 2422c60 Compare May 31, 2025 15:41
rocky added 4 commits June 1, 2025 07:08
An invalid escape sequence inside a string, like "\(a \+\)" is not
an error. Instead the sequence the same, e.g "\(a \+\)".
If the escape sequenced in a string can be a boxing construct, then this
is not an error in the escape sequence. Otherwise, it is.

For example

"\(" is not an error in a string while "\g" is.

Yes, this a bit involved. But that's the way WA works.
Also, flatten values in box operators for BOXING_CONSTRUCT_SUFFIXES
@rocky rocky merged commit 2424653 into master Jun 3, 2025
14 checks passed
@rocky rocky deleted the revise-escape-sequence-scanning branch June 3, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hex escape sequence in string literal doesn't work

3 participants