Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
44c6e05
Reduce prescanner use
rocky Apr 11, 2025
ae4aa63
Test workarounds.. for now.
rocky Apr 12, 2025
1248255
Isolate tokenizing escape sequences
rocky Apr 12, 2025
95bd105
Split out escape_sequence parsing.
rocky Apr 13, 2025
f1a06e1
Handle escape sequences outside of strings.
rocky Apr 14, 2025
f6846a2
Remove prescanner and ..
rocky Apr 14, 2025
ccfe943
Rename some variables
rocky Apr 14, 2025
3d0a2f7
Bang more on mathics3-tokens
rocky Apr 14, 2025
1c03e8b
Start going over error messages...
rocky Apr 15, 2025
3c1b977
Improve error handling...
rocky Apr 17, 2025
ded8885
Improve scanner...
rocky May 14, 2025
41fdc74
Handle EscapSequence errors better
rocky May 16, 2025
fa9b1a9
Handle embedded escape sequences in Symbols...
rocky May 17, 2025
8c582f5
WIP - bang on Symbol tokenization with backslash
rocky May 18, 2025
c1c015c
Be able to whether we are in a RowBox
rocky May 18, 2025
68346c0
Handle no-meaning operators
rocky May 19, 2025
3fe6a2b
WIP misc fixes...
rocky May 19, 2025
1719292
Better Symbol-name extension test...
rocky May 19, 2025
42a3e8d
WIP - small tweaks before moving master forward
rocky May 20, 2025
9c596be
Small bugs related to escape-character handling
rocky May 29, 2025
74587cc
Use git branch for testing Mathics
rocky May 29, 2025
25f5672
Revise Scanner error exception class
rocky May 29, 2025
e503b3a
Let's use 3.12 in CI testing
rocky May 29, 2025
e1b27fa
Small tidying changes to comments
rocky May 29, 2025
c440e42
ScannerError -> SyntaxError
rocky May 29, 2025
5fce8a0
More tests
rocky May 29, 2025
a568063
One more escape test
rocky May 30, 2025
36d85a7
Allow escape space "\ " + more string tests
rocky May 31, 2025
00cbb48
Start unit test for comments
rocky May 31, 2025
2422c60
Fix a doc spelling typo + minor doc tweak
rocky May 31, 2025
7582e6b
invalid escape sequences inside strings...
rocky Jun 1, 2025
a49e453
Escape sequences in strings, yet again...
rocky Jun 1, 2025
1d10b18
Add LineSeparator, and \*
rocky Jun 1, 2025
0f0418d
Remove duplicate test
rocky Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/mathics.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.11']
python-version: ['3.12']
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -33,7 +33,7 @@ jobs:
git clone --depth 1 https://github.com/Mathics3/mathics-scanner.git
(cd mathics-scanner && pip install -e .)
# Until next Mathics3/mathics-core release is out...
git clone --depth 1 https://github.com/Mathics3/mathics-core.git
git clone --depth 1 --branch revise-escape-sequence-scanning https://github.com/Mathics3/mathics-core.git
cd mathics-core/
make PIP_INSTALL_OPTS='[full]'
# pip install Mathics3[full]
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Tokenization

Tokenization is performed by the ``Tokeniser`` class. The ``next`` method
consumes characters from a feeder and returns a token if the tokenization
succeeds. If the tokenization fails an instance of ``TranslateError`` is
succeeds. If the tokenization fails an instance of ``SyntaxError`` is
raised.

.. autoclass:: Tokeniser(object)
Expand Down
8 changes: 2 additions & 6 deletions mathics_scanner/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@
from mathics_scanner.errors import (
IncompleteSyntaxError,
InvalidSyntaxError,
ScanError,
TranslateError,
TranslateErrorNew,
SyntaxError,
)
from mathics_scanner.feed import (
FileLineFeeder,
Expand All @@ -36,12 +34,10 @@
"InvalidSyntaxError",
"LineFeeder",
"MultiLineFeeder",
"ScanError",
"SyntaxError",
"SingleLineFeeder",
# "Token",
# "Tokeniser",
"TranslateError",
"TranslateErrorNew",
"__version__",
"aliased_characters",
# "is_symbol_name",
Expand Down
12 changes: 11 additions & 1 deletion mathics_scanner/data/named-characters.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@
# the named character. If it is the same as unicode-equivalent
# it should be omitted
#
# wl-unicode-name: The name of the character corresponding to `wl-unicode`, if it exists. If it is the same as unicode-equivalent-name it can be omitted.
# wl-unicode-name: The name of the character corresponding to `wl-unicode`, if it exists.
# It will mentioned in Wolfram Language docs if it exists.
#
# Sources:
Expand Down Expand Up @@ -6628,6 +6628,16 @@ LightBulb:
wl-reference: https://reference.wolfram.com/language/ref/character/LightBulb.html
wl-unicode: "\uF723"

LineSeparator:
has-unicode-inverse: false
is-letter-like: false
unicode-equivalent: "\u2028"
unicode-equivalent-name: LINE SEPARATOR
unicode-reference: https://www.compart.com/en/unicode/U+2028
wl-reference: https://reference.wolfram.com/language/ref/character/LineSeparator.html
wl-unicode: "\u2028"
wl-unicode-name: LINE SEPARATOR

LongDash:
esc-alias: --
has-unicode-inverse: false
Expand Down
29 changes: 11 additions & 18 deletions mathics_scanner/errors.py
Original file line number Diff line number Diff line change
@@ -1,46 +1,39 @@
# -*- coding: utf-8 -*-


class TranslateErrorNew(Exception):
class SyntaxError(Exception):
"""Some sort of error in the scanning or tokenization phase parsing Mathics3.

There are more specific kinds of exceptions subclassed from this
exception class.
"""

def __init__(self, tag: str, *args):
super().__init__()
self.name = "Syntax"
self.tag = tag
self.args = args


class TranslateError(Exception):
"""
A generic class of tokenization errors. This exception is subclassed by other
tokenization errors
"""


class EscapeSyntaxError(TranslateErrorNew):
class EscapeSyntaxError(SyntaxError):
"""Escape sequence syntax error"""

pass


class IncompleteSyntaxError(TranslateErrorNew):
class IncompleteSyntaxError(SyntaxError):
"""More characters were expected to form a valid token"""

pass


class InvalidSyntaxError(TranslateErrorNew):
class InvalidSyntaxError(SyntaxError):
"""Invalid syntax"""

pass


class NamedCharacterSyntaxError(TranslateError):
class NamedCharacterSyntaxError(EscapeSyntaxError):
"""Named character syntax error"""

pass


class ScanError(TranslateErrorNew):
"""A generic scanning error"""

pass
148 changes: 148 additions & 0 deletions mathics_scanner/escape_sequences.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
"""
Helper Module for tokenizing character escape sequences.
"""

from typing import Optional, Tuple

from mathics_scanner.characters import named_characters
from mathics_scanner.errors import (
EscapeSyntaxError,
NamedCharacterSyntaxError,
SyntaxError,
)


def parse_base(source_text: str, start_shift: int, end_shift: int, base: int) -> str:
r"""
See if characters start_shift .. end shift
can be converted to an integer in base ``base``.

If so, chr(integer value converted from base) is returnd.

However, if the conversion fails, SyntaxError is raised.
"""
last = end_shift - start_shift
if last == 2:
tag = "sntoct2"
elif last == 3:
assert base == 8, "Only octal requires 3 digits"
tag = "sntoct1"
elif last in (4, 6):
tag = "snthex"
else:
raise ValueError()

if end_shift > len(source_text):
raise SyntaxError("Syntax", tag)

assert start_shift <= end_shift
text = source_text[start_shift:end_shift]
try:
result = int(text, base)
except ValueError:
raise SyntaxError(tag, source_text[start_shift:].rstrip("\n"))

return chr(result)


def parse_named_character(source_text: str, start: int, finish: int) -> Optional[str]:
r"""
Find the unicode-equivalent symbol for a string named character.

Before calling we have matched the text between "\[" and "]" of the input.

The name character is thus in source_text[start:finish].

Match this string with the known named characters,
e.g. "Theta". If we can match this, then we return the unicode equivalent from the
`named_characters` map (which is read in from JSON but stored in a YAML file).

If we can't find the named character, raise NamedCharacterSyntaxError.
"""
named_character = source_text[start:finish]
if named_character.isalpha():
char = named_characters.get(named_character)
if char is None:
raise NamedCharacterSyntaxError("sntufn", named_character)
else:
return char


def parse_escape_sequence(source_text: str, pos: int) -> Tuple[str, int]:
"""Given some source text in `source_text` starting at offset
`pos`, return the escape-sequence value for this text and the
follow-on offset position.
"""
result = ""
c = source_text[pos]
if c == "\\":
return "\\", pos + 1

# https://www.wolfram.com/language/12/networking-and-system-operations/use-the-full-range-of-unicode-characters.html
# describes hex encoding.
if c == ".":
# see if we have a 2-digit hexadecimal number.
# for example, \.42 is "b"
result += parse_base(source_text, pos + 1, pos + 3, 16)
pos += 3
elif c == ":":
# see if we have a 4-digit hexadecimal number.
# for example, \:03b8" is unicode small leter theta: θ.
result += parse_base(source_text, pos + 1, pos + 5, 16)
pos += 5
elif c == "|":
# see if we have a 6-digit hexadecimal number.
result += parse_base(source_text, pos + 1, pos + 7, 16)
pos += 7
elif c == "[":
pos += 1
i = pos + 1
while i < len(source_text):
if source_text[i] == "]":
break
i += 1
if i == len(source_text):
# Note: named characters do not have \n's in them. (Is this right)?
# FIXME: decide what to do here.
raise NamedCharacterSyntaxError("Syntax", "sntufn", source_text[pos:])

named_character = parse_named_character(source_text, pos, i)
if named_character is None:
raise NamedCharacterSyntaxError("Syntax", "sntufn", source_text[pos:i])

result += named_character
pos = i + 1
elif c in "01234567":
# See if we have a 3-digit octal number.
# For example \065 = "5"
result += parse_base(source_text, pos, pos + 3, 8)
pos += 3

# WMA escape characters \n, \t, \b, \r.
# Note that these are a similer to Python, but are different.
# In particular, Python defines "\a" to be ^G (control G),
# but in WMA, this is invalid.
elif c in "ntbfr $\n":
if c in "n\n":
result += "\n"
elif c == " ":
result += " "
elif c == "t":
result += "\t"
elif c == "b":
result += "\b"
elif c == "f":
result += "\f"
elif c in '$"':
# I don't know why \$ is defined, but it is!
result += rf"\{c}"
else:
assert c == "r"
result += "\r"
pos += 1
elif c in '!"':
result += c
pos += 1
else:
raise EscapeSyntaxError("stresc", rf"\{c}")
return result, pos
6 changes: 3 additions & 3 deletions mathics_scanner/feed.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,22 +130,22 @@ def empty(self) -> bool:
class SingleLineFeeder(LineFeeder):
"A feeder that feeds all the code as a single line."

def __init__(self, code: str, filename=""):
def __init__(self, source_text: str, filename=""):
"""
:param code: The source of the feeder (a string).
:param filename: A string that describes the source of the feeder, i.e.
the filename that is being feed.
"""
super().__init__(filename)
self.code = code
self.source_text = source_text
self._empty = False

def feed(self) -> str:
if self._empty:
return ""
self._empty = True
self.lineno += 1
return self.code
return self.source_text

def empty(self) -> bool:
return self._empty
Expand Down
23 changes: 14 additions & 9 deletions mathics_scanner/mathics3_tokens.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from mathics_scanner.errors import (
EscapeSyntaxError,
NamedCharacterSyntaxError,
ScanError,
SyntaxError,
)
from mathics_scanner.feed import FileLineFeeder, LineFeeder, SingleLineFeeder
from mathics_scanner.tokeniser import Tokeniser
Expand Down Expand Up @@ -162,25 +162,30 @@ def interactive_eval_loop(shell: TerminalShell, code_tokenize_format: bool):
try:
source_text = shell.feed()
tokens(source_text, code_tokenize_format)
except ScanError:
shell.errmsg(
"Syntax",
"sntxi",
"Expression error",
)
pass
except NamedCharacterSyntaxError:
shell.errmsg(
"Syntax",
"sntufn",
"Unknown unicode longname",
)
# This has to come after NamedCharacterSyntaxError
# since that is a subclass EscapeSyntaxError
except EscapeSyntaxError:
shell.errmsg(
"Syntax",
"sntufn",
"Unknown unicode longname",
)
# This has to come after NamedCharacterSyntaxError and
# EscapeSyntaxError since those are subclasses of
# SyntaxError
except SyntaxError:
shell.errmsg(
"Syntax",
"sntxi",
"Expression error",
)
pass
except KeyboardInterrupt:
print("\nKeyboardInterrupt. Type Ctrl-D (EOF) to exit.")
except EOFError:
Expand All @@ -199,7 +204,7 @@ def tokens(code, code_tokenize_format: bool):
while True:
try:
token = tokeniser.next()
except ScanError as scan_error:
except SyntaxError as scan_error:
mess = ""
if scan_error.tag == "sntoct1":
mess = r"3 octal digits are required after \ to construct an 8-bit character"
Expand Down
Loading
Loading