ANTLR grammar based C++ language module #937

SirYwell · 2023-02-15T22:09:04Z

This Pull Request adds an alternative C++ language module based on the ANTLR grammar from https://github.com/antlr/grammars-v4/tree/master/cpp.

To avoid naming clashes, this language module and its base package are currently called cpp2.

Notes:

The current minimumTokenMatch is 12, I'm not sure if that should be changed.
The ELSE token is currently extracted before the body of the if is visited - I added a disabled test that shows this issue. If you know a good solution for that, please let me know.
I used multiple single-line comments, as spotless rearranges multi-line comments. I'd like to keep the formatting the way it is.

brodmo

Die vielen else if etwa in enterJumpStatement könnte man durch ein eleganteres Konstrukt ersetzen. Mein Vorschlag wäre, dass man in den enter/exit Methoden wenn es geht nur noch Maps erstellt und die eigentliche Logik in eine Hilfsmethode auslagert. Sähe dann so aus:

    public void enterJumpStatement(CPP14Parser.JumpStatementContext context) {
        processContextEnterOrExit(context, Map.of(
                context.Break(), CPPTokenType.BREAK,
                context.Continue(), CPPTokenType.CONTINUE,
                context.Goto(), CPPTokenType.GOTO,
                context.Return(), CPPTokenType.RETURN
        ));
    }

    private void processContextEnterOrExit(ParserRuleContext context, Map<TerminalNode, CPPTokenType> tokenTypeMap) {
        for (Map.Entry<TerminalNode, CPPTokenType> tokenTypeEntry: tokenTypeMap.entrySet()) {
            if (tokenTypeEntry.getKey() != null) {
                addEnter(tokenTypeEntry.getValue(), context.getStart());
                return;
            }
        }
    }

Methoden wie enterSelectionStatement könnte man so wohl auch entschachteln.

Sonst denke ich es würde zur Leserlichkeit beitragen wenn man die Contexts einzeln importieren würde, sodass man im Code statt etwa CPP14Parser.FunctionDefinitionContext nur noch FunctionDefinitionContext stehen hat. Dann wäre es auch konsistent mit dem TokenGeneratingTreeScanner im Java Sprachmodul, da wird es auch so gemacht.

tsaglam

Regarding the default minimum token match, I would say leave it as is, we may adapt that in the future. I will have a look at the if-else problem and see if I can find a solution.

languages/cpp2/src/main/java/de/jplag/cpp2/CPPLanguage.java

languages/cpp2/src/main/java/de/jplag/cpp2/CPPParserAdapter.java

languages/cpp2/src/main/java/de/jplag/cpp2/CPPTokenListener.java

languages/cpp2/src/main/java/de/jplag/cpp2/CPPTokenType.java

languages/cpp2/src/test/java/de/jplag/cpp2/LanguageTest.java

tsaglam · 2023-02-20T15:20:04Z

languages/cpp2/src/main/java/de/jplag/cpp2/CPPTokenListener.java

+    @Override
+    public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {
+        if (context.Switch() != null) {
+            addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());
+        } else if (context.If() != null) {
+            addEnter(CPPTokenType.IF_BEGIN, context.getStart());
+            if (context.Else() != null) {
+                addEnter(CPPTokenType.ELSE, context.Else().getSymbol());
+            }
+        }
+    }
+
+    @Override
+    public void exitSelectionStatement(CPP14Parser.SelectionStatementContext context) {
+        if (context.Switch() != null) {
+            addEnter(CPPTokenType.SWITCH_END, context.getStop());
+        } else if (context.If() != null) {
+            addEnter(CPPTokenType.IF_END, context.getStop());
+        }
+    }


The grammar is the problem that leads to the weird token order. In the grammar, an if-else combination is represented by a single selection statement, IF statement1 ELSE statement2. Thus, something like this:

if(...) a else if (...) b else c

is depicted as two nested selection statements:
IF a ELSE (IF b ELSE c)

Now, this is where things get interresting. The visitor order is not how we read this statement. The visit order is like this:

enter outer selection statement

enter inner statement

exit inner selection statement

exit outer selection statement

However, as the else is part of the selection statement, it will be either visited on entering the statement (with the if keyword and thus before the if body) or when exiting the selection statement (after the else body).

The trick to solving this is utilizing the enter/exitBlockDeclaration method and tracking the last relevant keyword. This is done in some of the newer language modules. This solves not only the token order issue but also the issue that your if-end tokens are currently not placed correctly! They are all at the end after both selection statements. They are not closing off their if blocks.

Rough solution (without the required imports or changes to token types), fixing token order, if-end placement and missing else_end:

Click to show...

Suggested change

@Override

public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {

if (context.Switch() != null) {

addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());

} else if (context.If() != null) {

addEnter(CPPTokenType.IF_BEGIN, context.getStart());

if (context.Else() != null) {

addEnter(CPPTokenType.ELSE, context.Else().getSymbol());

}

}

}

@Override

public void exitSelectionStatement(CPP14Parser.SelectionStatementContext context) {

if (context.Switch() != null) {

addEnter(CPPTokenType.SWITCH_END, context.getStop());

} else if (context.If() != null) {

addEnter(CPPTokenType.IF_END, context.getStop());

}

}

@Override

public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {

if (context.Switch() != null) {

addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());

trackedState.add(CPPTokenType.SWITCH_END);

} else if (context.If() != null) {

addEnter(CPPTokenType.IF_BEGIN, context.getStart());

trackedState.add(CPPTokenType.IF_END);

if (context.Else() != null) {

trackedState.add(CPPTokenType.ELSE);

trackedState.add(CPPTokenType.ELSE_END);

lastElseToken = context.Else().getSymbol();

}

}

}

static Deque<CPPTokenType> trackedState = new ArrayDeque<>();

static Token lastElseToken = null;

@Override

public void enterStatement(StatementContext context) {

if (context.getParent().getRuleIndex() == CPP14Parser.RULE_selectionStatement) {

if (trackedState.peekFirst() == CPPTokenType.ELSE) {

addEnter(trackedState.removeFirst(), lastElseToken);

}

}

}

@Override

public void exitStatement(StatementContext context) {

if (context.getParent().getRuleIndex() == CPP14Parser.RULE_selectionStatement) {

if (trackedState.peekFirst() == CPPTokenType.ELSE_END) {

addExit(trackedState.removeFirst(), context.getStop());

}

if (trackedState.peekFirst() == CPPTokenType.IF_END) {

addExit(trackedState.removeFirst(), context.getStop());

}

}

}

This is how the tokens are then extracted:

1 void a(int a, int b, int x, int y) { |FUNCTION{ |VARDEF|VARDEF|VARDEF|VARDEF 2 if (a < b) { |IF{ 3 {x = 5;} |ASSIGN 4 } else if (a > b) { |}IF |ELSE{ |IF{ 5 y = 10; |ASSIGN 6 x = y + b; |ASSIGN 7 } else { |}ELSE |}IF |ELSE{ 8 y = -20; |ASSIGN 9 } |}ELSE 10 } |}FUNCTION |EOF

Thanks, I hoped to avoid additional state.

This solves not only the token order issue but also the issue that your if-end tokens are currently not placed correctly!

I assume you're referring to the fact that multiple }if tokens are always placed at the very end? This is also the case in the Java language module currently.

In your code snippet, you're extracting a token for the end of an else, which currently isn't extracted in the Java module, but I assume it makes sense to change that too if the }if token is changed?

Ah, even more issues in the Java module unearthed. Okay, then let me investigate first if the if-end and the else-end have an impact on the detection quality. If not, we do not need to fix it. I just assumed it would be correct in the Java module. I will have a look at it. Interesting!

@SirYwell you are right, java does not place the if tokens correctly either:

7 if (args == null) { |IF{ 9 } else if (args.length > 1) { |ELSE |IF{ 11 } else { |ELSE 13 } |}if |}if

As a result, leave it as the Java module does it now, I can always change it in the future. However, please ensure the else tokens are correctly placed. If the else token comes before the tokens of the if block, then we cannot differ between code in the if block and code in the else block.

EDIT: Also, sonar found one code smell. Just rename the local variable tokens to tokenStream.

SirYwell · 2023-02-20T20:40:38Z

Die vielen else if etwa in enterJumpStatement könnte man durch ein eleganteres Konstrukt ersetzen.

Der konkrete Vorschlag würde so nicht funktionieren, weil die Map keine null Keys erlaubt (und auch sonst maximal ein mal null vorkommen dürfte), aber man kann key und value natürlich tauschen.

Mein Vorschlag wäre stattdessen jedoch folgender:

    record Extraction<T>(Predicate<T> nonNullTest, TokenType toExtract) {

        static <T> Extraction<T> of(Function<T, ?> contextToAnything, TokenType toExtract) {
            return new Extraction<>(t -> Objects.nonNull(contextToAnything.apply(t)), toExtract);
        }
    }

    private static final List<Extraction<JumpStatementContext>> JUMP_STATEMENT_TOKENS = List.of(
            Extraction.of(JumpStatementContext::Break, BREAK),
            Extraction.of(JumpStatementContext::Continue, CONTINUE), 
            Extraction.of(JumpStatementContext::Goto, GOTO),
            Extraction.of(JumpStatementContext::Return, RETURN)
    );
    
    @Override
    public void enterJumpStatement(JumpStatementContext context) {
        extractFirstNonNull(context, context.getStart(), JUMP_STATEMENT_TOKENS);
    }
    
    private <T> void extractFirstNonNull(T context, Token token, List<Extraction<T>> extractions) {
        for (Extraction<? super T> extraction : extractions) {
            if (extraction.nonNullTest().test(context)) {
                addEnter(extraction.toExtract(), token);
                return;
            }
        }
    }

Dadurch muss man nicht jedes Mal eine Collection konstruieren. Was hältst du davon @mbrdl?

Sonst denke ich es würde zur Leserlichkeit beitragen wenn man die Contexts einzeln importieren würde

Das habe ich mal gemacht.
(Im Java-Sprachmodul sind die Elemente top level interfaces, wodurch dieses Problem gar nicht auftritt. Andere ANTLR basierten Sprachmodule nutzen nur teilweise statischen Importe für die Klassen)

SirYwell · 2023-03-14T12:49:21Z

I updated the PR and included the requested changes. The if/else extraction should now behave like the Java language module.

tsaglam · 2023-03-14T15:00:41Z

@SirYwell can you fix the two sonar issues, then I can merge.

sonarcloud · 2023-03-14T16:29:55Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

SirYwell · 2023-03-14T16:33:30Z

@tsaglam Sonarcloud seems to be happy now :)

SirYwell added 26 commits February 4, 2023 12:51

Initial work on new C++ module

6ee744c

Mark new C++ module as Experimental

215202d

Current LM state

170b39a

Extract more tokens

4297d37

Extract assignments for ++ and --

e6f1d34

Extract assign in initialization contexts

8102700

Improve extraction of VARDEF and APPLY

9d9f718

Extract VARDEF for function parameters

eb9a76f

Fix APPLY extraction

ccde4d4

Clean up extraction code and tests

df0234a

Add more APPLY/VARDEF tests and fixes

415dc3c

Apply spotless

ceb98c6

Remove unneeded code

d61af59

Update comment

5f31cd3

Initial README design based on other language modules

950913b

Cleanup imports

08472c1

Address review comments

58cb140

Apply spotless

8a88b1d

Extract union type declarations

9c84d42

Extract braced inits and fix token positions

b5714fb

Do not extract block start/end tokens

be976f3

Increase CLI language counter

faf0833

Clean up no longer needed workaround

26f3df9

Remove unneeded special case for column selection

fa6fe71

Clean up tests

c3a56c4

Import de.jplag.Language

0e25055

tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Feb 16, 2023

tsaglam self-requested a review February 16, 2023 07:07

tsaglam requested a review from brodmo February 16, 2023 09:53

brodmo reviewed Feb 19, 2023

View reviewed changes

tsaglam requested changes Feb 20, 2023

View reviewed changes

tsaglam reviewed Feb 20, 2023

View reviewed changes

SirYwell added 2 commits February 20, 2023 20:56

Add more Javadocs and import token types statically

a258c59

Import ANTLR rule context classes statically

5d25116

SirYwell added 3 commits March 14, 2023 10:12

Rework extraction design to decrease duplicated code

f1e6a47

Rename shadowing variable

979e4ff

Extract else at the end of the if block

8c57c80

Merge nested ifs

315e492

tsaglam approved these changes Mar 14, 2023

View reviewed changes

tsaglam merged commit 8659517 into jplag:develop Mar 14, 2023

tsaglam mentioned this pull request May 16, 2023

Merge side branch of develop into main #1083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANTLR grammar based C++ language module #937

ANTLR grammar based C++ language module #937

SirYwell commented Feb 15, 2023

brodmo left a comment

tsaglam left a comment

tsaglam Feb 20, 2023 •

edited

Loading

SirYwell Feb 20, 2023

tsaglam Feb 20, 2023

tsaglam Feb 24, 2023 •

edited

Loading

SirYwell commented Feb 20, 2023

SirYwell commented Mar 14, 2023

tsaglam commented Mar 14, 2023

sonarcloud bot commented Mar 14, 2023

SirYwell commented Mar 14, 2023

ANTLR grammar based C++ language module #937

ANTLR grammar based C++ language module #937

Conversation

SirYwell commented Feb 15, 2023

brodmo left a comment

Choose a reason for hiding this comment

tsaglam left a comment

Choose a reason for hiding this comment

tsaglam Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

SirYwell Feb 20, 2023

Choose a reason for hiding this comment

tsaglam Feb 20, 2023

Choose a reason for hiding this comment

tsaglam Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

SirYwell commented Feb 20, 2023

SirYwell commented Mar 14, 2023

tsaglam commented Mar 14, 2023

sonarcloud bot commented Mar 14, 2023

SirYwell commented Mar 14, 2023

tsaglam Feb 20, 2023 •

edited

Loading

tsaglam Feb 24, 2023 •

edited

Loading