Skip to content

Fix antlr/antlr4#1890 Standalone CR should be recognized as line separator#58

Closed
nixel2007 wants to merge 2 commits intotunnelvisionlabs:masterfrom
nixel2007:feature/standalone-cr
Closed

Fix antlr/antlr4#1890 Standalone CR should be recognized as line separator#58
nixel2007 wants to merge 2 commits intotunnelvisionlabs:masterfrom
nixel2007:feature/standalone-cr

Conversation

@nixel2007
Copy link

@nixel2007 nixel2007 commented Nov 22, 2019

Hello!

I've made changes to treat stand-alone CR as line separator.
This changes were tested on millions of LOC with-out any noticable perfomance regress. Also I use my fork to parse files everyday as a part of static analysis process.

P.S. Link to my PR to original ANTLR4 repo opened half an year ago...

@daniellansun
Copy link
Contributor

The builds on CI servers failed

@nixel2007
Copy link
Author

nixel2007 commented Nov 22, 2019

Hi, @danielsun1106

I see an error in build logs:

testCharSetRange(org.antlr.v4.test.tool.TestLexerExec)  Time elapsed: 0.078 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<...0]
[@1,4:5='34',<1>,[1:4]
[@2,7:8='a2',<2>,1:7]
[@3,10:12='abc',<2>,1:10]
[@4,18:17='<EOF>',<-1>,2]:3]
> but was:<...0]
[@1,4:5='34',<1>,[2:1]
[@2,7:8='a2',<2>,2:4]
[@3,10:12='abc',<2>,2:7]
[@4,18:17='<EOF>',<-1>,3]:3]
>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.antlr.v4.test.tool.TestLexerExec.testCharSetRange(TestLexerExec.java:497)

But I don't understand what exactly numbers after the second comma mean. Could you guide me about this a little?

The first number is a token type, i suggest. And the second and the third ones?

@daniellansun
Copy link
Contributor

@nixel2007 I can not remember its meaning either... @sharwell should be able to tell us ;-)

@sharwell
Copy link
Member

sharwell commented Nov 22, 2019

[@2,7:8='a2',<2>,2:4]
Item Meaning
@2 Token index (this in the third token)
7:8 Token span (character position) in the original input
a2 Token text
<2> Token type
2:4 Starting line:column for the token

Note that I'm not planning to take this change. It should be possible to derive from the existing types in the library if you need to customize the token positioning mechanism. (It would have been better if tokens didn't track this information at all, but it's too late to remove now.)

@nixel2007
Copy link
Author

nixel2007 commented Nov 22, 2019

a-ha, test string contains standalone CR, so the second 34 token is placed to the 2nd line. so, new numbers is correct.

String found = execLexer("L.g4", grammar, "L", "34\r 34 a2 abc \n ");

It should be possible to derive from the existing types in the library if you need to customize the token positioning mechanism.

could you point me to this classes/methods? I haven't digged it a lot, but I suppose that if I'll try to change token position somewhere else than LexerATNSimulator#consume, I will have to store line/character offsets and change them for every token rather than one-time change in consume. Not good for RAM neither CPU.

@sharwell
Copy link
Member

@nixel2007 LexerATNSimulator.consume is a virtual method. Create your own type derived from that, and override consume to handle this case.

@nixel2007
Copy link
Author

nixel2007 commented Jan 19, 2020

Create your own type derived from that, and override consume to handle this case.

Did as you said, thank you.
One small question - is there any option to override default constructor from g4 file?
The only way I've found for this is a constructor with another signature:

1c-syntax/bsl-parser@86bbfc5#diff-9f09662df115bb4cb45e8ad2afc23bbdR25

@sharwell
Copy link
Member

You can generate your grammar as abstract if you wish to provide a different constructor in code. I forget the exact command line syntax but it's handled by this code:

public boolean isAbstract() {
return Boolean.parseBoolean(getOptionString("abstract"));
}

@nixel2007
Copy link
Author

Ok, i'll try it. thank you!

@nixel2007 nixel2007 closed this Jan 19, 2020
@nrmancuso
Copy link

@nixel2007 did you end up successfully overriding the LexerATNSimulator constructor by generating your grammar as abstract to solve this issue? Do you have a working example that you can point me to?

@nixel2007
Copy link
Author

@nmancus1 yep. I've created CRAwareLexerATNSimulator (https://github.com/1c-syntax/bsl-parser/blob/master/src/main/java/com/github/_1c_syntax/bsl/parser/CRAwareLexerATNSimulator.java) and pass it to lexer in members block - https://github.com/1c-syntax/bsl-parser/blob/855158d4e7948326f76cf32b33ebd16c580896be/src/main/antlr/BSLLexer.g4#L29-L35. no need to make grammar abstract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants