Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing mention to "rep" in operators list in README #193

Closed
mingodad opened this issue May 2, 2022 · 5 comments
Closed

Missing mention to "rep" in operators list in README #193

mingodad opened this issue May 2, 2022 · 5 comments

Comments

@mingodad
Copy link
Contributor

mingodad commented May 2, 2022

I'm still trying to extract the peglib grammar (why not it's already available?) and found that the operator rep used on it is not listed on the README like the others.

@mingodad
Copy link
Contributor Author

mingodad commented May 2, 2022

Doing my tests with my extracted grammar I noticed that when the grammar ends with a comment without newline the parser reject it, see the culebra.peg or the one shown bellow from the README on the playground (notice now newline after the last line), removing the EndLine from the Comenet fixes the problem and doesn't seem to have negative side effects (see bellow).

KEYWORD   <- 'keyword'
KEYWORDI  <- 'case_insensitive_keyword'
WORD      <-  < [a-zA-Z0-9] [a-zA-Z0-9-_]* >    # token boundary operator is used.
IDNET     <-  < IDENT_START_CHAR IDENT_CHAR* >  # token boundary operator is used.

Output:

4:83 syntax error

Actual hardcoded grammar:

    g["Comment"] <=
        seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())), g["EndOfLine"]);

Fixed to handle comments not ending in newline:

    g["Comment"] <=
        seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())));

@mingodad
Copy link
Contributor Author

mingodad commented May 2, 2022

OBS.: I edited this message with the latest fully working manually extracted grammar and the EBNF.

Here is the last extracted grammar, it has trouble parsing Sum ← List(Product, SumOpe), List(I, D) ← I (D I)* and IdentStart <- !"↑" !"⇑" ([a-zA-Z_%] / [\u0080-\uFFFF]), any help on fixing it is appreciated .

# Setup PEG syntax parser
Grammar <-  Spacing  Definition+  EndOfFile

Definition <-
	Ignore  IdentCont  Parameters  LEFTARROW Expression  Instruction?
	/ Ignore  Identifier  LEFTARROW  Expression Instruction?

Expression <-  Sequence  (SLASH  Sequence)*

Sequence <-  (CUT /  Prefix)*

Prefix <-  (AND /  NOT)?  SuffixWithLabel

SuffixWithLabel <- Suffix  (LABEL  Identifier)?

Suffix <-  Primary  Loop?

Loop <-  QUESTION /  STAR /  PLUS /  Repetition

Primary <-
	Ignore  IdentCont  Arguments !LEFTARROW
	/ Ignore  Identifier !(Parameters?  LEFTARROW)
	/ OPEN  Expression  CLOSE
	/ BeginTok  Expression  EndTok
	/ BeginCapScope  Expression  EndCapScope
	/ BeginCap  Expression  EndCap
	/ BackRef
	/ LiteralI
	/ Dictionary
	/ Literal
	/ NegatedClass
	/ Class
	/  DOT

Identifier <-  IdentCont  Spacing

IdentCont <- IdentStart  IdentRest*

IdentStart <-  !"↑"  !"⇑" ([a-zA-Z_%] /  [\u0080-\uFFFF])

IdentRest <-  IdentStart /  [0-9]

Dictionary <-  LiteralD  (PIPE  LiteralD)+

lit_ope <-
	[']  <(![']  Char)*> [']  Spacing
	/ ["]  <(!["]  Char)*> ["]  Spacing

Literal <-  lit_ope

LiteralD <-  lit_ope

LiteralI <-
	[']  <(![']  Char)*>  "'i" Spacing
	/ ["]  <(!["]  Char)*>  '"i' Spacing

# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class <-  '['  !'^' <(!']'  Range)+>  ']' Spacing
NegatedClass <-  "[^" <(!']'  Range)+>  ']' Spacing

Range <-  (Char  '-'  Char) /  Char

Char <-
	'\\'  [nrt'\"[\]\\^]
	/ '\\'  [0-3]  [0-7]  [0-7]
	/ '\\'  [0-7]  [0-7]?
	/ "\\x"  [0-9a-fA-F]  [0-9a-fA-F]?
	/ "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5})
	/ !'\\'   .

Repetition <- BeginBlacket  RepetitionRange  EndBlacket

RepetitionRange <-
	Number  COMMA  Number
	/ Number  COMMA
	/  Number
	/ COMMA  Number

Number <-  [0-9]+  Spacing

LEFTARROW <-  ("<-" / "←")  Spacing

~SLASH <-  '/'  Spacing
~PIPE <-  '|'  Spacing
AND <-  '&'  Spacing
NOT <-  '!'  Spacing
QUESTION <- '?'  Spacing
STAR <-  '*'  Spacing
PLUS <-  '+'  Spacing
~OPEN <-  '('  Spacing
~CLOSE <- ')'  Spacing
DOT <-  '.'  Spacing

CUT <-  "↑"  Spacing
~LABEL <-  ('^' /  "⇑")  Spacing

~Spacing <-  (Space /  Comment)*
Comment <- '#'  (!EndOfLine   . )*
Space <-  ' ' /  '\t' /  EndOfLine
EndOfLine <-  "\r\n" /  '\n' /  '\r'
EndOfFile <-  ! .

~BeginTok <-  '<'  Spacing
~EndTok <-  '>'  Spacing

~BeginCapScope <-  '$'  '('  Spacing
~EndCapScope <-  ')'  Spacing

BeginCap <-  '$'  <IdentCont>  '<'  Spacing
~EndCap <-  '>'  Spacing

BackRef <-  '$'  <IdentCont>  Spacing

IGNORE <-  '~'

Ignore <-  IGNORE?
Parameters <-  OPEN  Identifier (COMMA  Identifier)*  CLOSE
Arguments <-  OPEN  Expression (COMMA  Expression)*  CLOSE
~COMMA <-  ','  Spacing

# Instruction grammars
Instruction <-
	BeginBlacket (InstructionItem  (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem <- PrecedenceClimbing /  ErrorMessage /  NoAstOpt
~InstructionItemSeparator <-  ';'  Spacing

~SpacesZom <-  Space*
~SpacesOom <-  Space+
~BeginBlacket <-  '{'  Spacing
~EndBlacket <-  '}'  Spacing

# PrecedenceClimbing instruction
PrecedenceClimbing <- "precedence"  SpacesOom  PrecedenceInfo (SpacesOom  PrecedenceInfo)*  SpacesZom
PrecedenceInfo <- PrecedenceAssoc (~SpacesOom  PrecedenceOpe)+
PrecedenceOpe <-
	['] <(!(Space /  ['])  Char)*> [']
	/ ["] <(!(Space /  ["])  Char)*> ["]
	/ <(!(PrecedenceAssoc /  Space /  '}')  . )+>
PrecedenceAssoc <-  [LR]

# Error message instruction
ErrorMessage <- "message"  SpacesOom  LiteralD  SpacesZom

# No Ast node optimazation instruction
NoAstOpt <-  "no_ast_opt"  SpacesZom

And here converted to the EBNF to be viewed at https://www.bottlecaps.de/rr/ui:

//# Setup PEG syntax parser
Grammar::=  Spacing  Definition+  EndOfFile

Definition::=
	Ignore  IdentCont  Parameters  LEFTARROW Expression  Instruction?
	| Ignore  Identifier  LEFTARROW  Expression Instruction?

Expression::=  Sequence  (SLASH  Sequence)*

Sequence::=  (CUT |  Prefix)*

Prefix::=  (AND |  NOT)?  SuffixWithLabel

SuffixWithLabel::= Suffix  (LABEL  Identifier)?

Suffix::=  Primary  Loop?

Loop::=  QUESTION |  STAR |  PLUS |  Repetition

Primary::=
	Ignore  IdentCont  Arguments _NOT_ LEFTARROW
	| Ignore  Identifier _NOT_ (Parameters?  LEFTARROW)
	| OPEN  Expression  CLOSE
	| BeginTok  Expression  EndTok
	| BeginCapScope  Expression  EndCapScope
	| BeginCap  Expression  EndCap
	|  BackRef
	| LiteralI
	|  Dictionary
	|  Literal
	|  NegatedClass
	| Class
	|  DOT

Identifier::=  IdentCont  Spacing

IdentCont::= IdentStart  IdentRest*

IdentStart::=  _NOT_ "↑"  _NOT_ "⇑" ([a-zA-Z_%] |  [\u0080-\uFFFF])

IdentRest::=  IdentStart |  [0-9]

Dictionary::=  LiteralD  (PIPE  LiteralD)+

lit_ope::=
	['] _TKOPEN_ (_NOT_ [']  Char)* _TKCLOSE_ [']  Spacing
	| ["] _TKOPEN_ (_NOT_ ["]  Char)* _TKCLOSE_ ["]  Spacing

Literal::=  lit_ope

LiteralD::=  lit_ope

LiteralI::=
	['] _TKOPEN_ (_NOT_ [']  Char)* _TKCLOSE_  "'i" Spacing
	| ["] _TKOPEN_ (_NOT_ ["]  Char)* _TKCLOSE_  '"i' Spacing

//# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class::=  '['  _NOT_ '^' _TKOPEN_ ( _NOT_ ']'  Range)+ _TKCLOSE_  ']' Spacing
NegatedClass::=  "[^" _TKOPEN_ ( _NOT_ ']'  Range)+ _TKCLOSE_  ']' Spacing

Range::=  (Char  '-'  Char) |  Char

Char::=
	'\\'  [nrt'\"#x5B\#x5d\\^]
	| '\\'  [0-3]  [0-7]  [0-7]
	| '\\'  [0-7]  [0-7]?
	| "\\x"  [0-9a-fA-F]  [0-9a-fA-F]?
	| "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]'{4,4}') / [0-9a-fA-F]'{4,5}')
	| _NOT_ '\\'   .

Repetition::= BeginBlacket  RepetitionRange  EndBlacket

RepetitionRange::=
	Number  COMMA  Number
	| Number  COMMA
	|  Number
	| COMMA  Number

Number::=  [0-9]+  Spacing

LEFTARROW::=  ("<-" | "←")  Spacing

/*~*/SLASH::=  '/'  Spacing
/*~*/PIPE::=  '|'  Spacing
AND::=  '&'  Spacing
NOT::=  '!'  Spacing
QUESTION::= '?'  Spacing
STAR::=  '*'  Spacing
PLUS::=  '+'  Spacing
/*~*/OPEN::=  '('  Spacing
/*~*/CLOSE::= ')'  Spacing
DOT::=  '.'  Spacing

CUT::=  "↑"  Spacing
/*~*/LABEL::=  ('^' |  "⇑")  Spacing

/*~*/Spacing::=  (Space |  Comment)*
Comment::= '#'  (_NOT_ EndOfLine   . )*
Space::=  ' ' |  '\t' |  EndOfLine
EndOfLine::=  "\r\n" |  '\n' |  '\r'
EndOfFile::=  _NOT_  .

/*~*/BeginTok::=  '<'  Spacing
/*~*/EndTok::=  '>'  Spacing

/*~*/BeginCapScope::=  '$'  '('  Spacing
/*~*/EndCapScope::=  ')'  Spacing

BeginCap::=  '$' _TKOPEN_ IdentCont _TKCLOSE_  '<'  Spacing
/*~*/EndCap::=  '>'  Spacing

BackRef::=  '$' _TKOPEN_ IdentCont _TKCLOSE_  Spacing

IGNORE::=  '~'

Ignore::=  IGNORE?
Parameters::=  OPEN  Identifier (COMMA  Identifier)*  CLOSE
Arguments::=  OPEN  Expression (COMMA  Expression)*  CLOSE
/*~*/COMMA::=  ','  Spacing

//# Instruction grammars
Instruction::=
	BeginBlacket (InstructionItem  (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem::= PrecedenceClimbing |  ErrorMessage |  NoAstOpt
/*~*/InstructionItemSeparator::=  ';'  Spacing

/*~*/SpacesZom::=  Space*
/*~*/SpacesOom::=  Space+
/*~*/BeginBlacket::=  '{'  Spacing
/*~*/EndBlacket::=  '}'  Spacing

//# PrecedenceClimbing instruction
PrecedenceClimbing::= "precedence"  SpacesOom  PrecedenceInfo (SpacesOom  PrecedenceInfo)*  SpacesZom
PrecedenceInfo::= PrecedenceAssoc (/*~*/SpacesOom  PrecedenceOpe)+
PrecedenceOpe::=
	['] _TKOPEN_ (_NOT_ (Space |  ['])  Char)* _TKCLOSE_ [']
	| ["] _TKOPEN_ (_NOT_ (Space |  ["])  Char)* _TKCLOSE_ ["]
	| _TKOPEN_ (_NOT_ (PrecedenceAssoc |  Space |  '}')  . )+ _TKCLOSE_
PrecedenceAssoc::=  [LR]

//# Error message instruction
ErrorMessage::= "message"  SpacesOom  LiteralD  SpacesZom

//# No Ast node optimazation instruction
NoAstOpt::=  "no_ast_opt"  SpacesZom

//Tokens add for EBNF
_NOT_ ::= '!'
_TKOPEN_ ::= '<'
_TKCLOSE_ ::= '>'

@mingodad
Copy link
Contributor Author

mingodad commented May 2, 2022

One of the problematic rule is this one (that is hardcoded):

"\\u" ('0'  [0-9a-fA-F] /  "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}

When replaced by:

"\\u" [0-9a-fA-F]{4,5}

Then it pass parsing IdentStart <- [\u0080-\uFFFF], another problem that I found and was my fault was the COMMA <- ' ' Spacing on one of my search and replace I wiped out the ,.

I fixed and updated my previous post with the working grammar and EBNF, but still I'm puzzled by this expression "\\u" ('0' [0-9a-fA-F] / "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}

@mingodad
Copy link
Contributor Author

mingodad commented May 2, 2022

Again looking carefully I found again my mistake when manually converting this expression "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5}) (shown here fixed/correctly).

So two problems found with this manual conversion of the hardcoded grammar in peglib.h the Comment without newline at then end of the grammar and the missing rep operator in the README.

Thanks for all help !

@yhirose
Copy link
Owner

yhirose commented May 2, 2022

Thanks for the report. I just added rep in the operator table in README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants