Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANTLR grammar based C++ language module #937

Merged
merged 32 commits into from
Mar 14, 2023

Conversation

SirYwell
Copy link
Contributor

This Pull Request adds an alternative C++ language module based on the ANTLR grammar from https://github.com/antlr/grammars-v4/tree/master/cpp.

To avoid naming clashes, this language module and its base package are currently called cpp2.

Notes:

  • The current minimumTokenMatch is 12, I'm not sure if that should be changed.
  • The ELSE token is currently extracted before the body of the if is visited - I added a disabled test that shows this issue. If you know a good solution for that, please let me know.
  • I used multiple single-line comments, as spotless rearranges multi-line comments. I'd like to keep the formatting the way it is.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Feb 16, 2023
@tsaglam tsaglam self-requested a review February 16, 2023 07:07
@tsaglam tsaglam requested a review from brodmo February 16, 2023 09:53
Copy link
Contributor

@brodmo brodmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Die vielen else if etwa in enterJumpStatement könnte man durch ein eleganteres Konstrukt ersetzen. Mein Vorschlag wäre, dass man in den enter/exit Methoden wenn es geht nur noch Maps erstellt und die eigentliche Logik in eine Hilfsmethode auslagert. Sähe dann so aus:

    public void enterJumpStatement(CPP14Parser.JumpStatementContext context) {
        processContextEnterOrExit(context, Map.of(
                context.Break(), CPPTokenType.BREAK,
                context.Continue(), CPPTokenType.CONTINUE,
                context.Goto(), CPPTokenType.GOTO,
                context.Return(), CPPTokenType.RETURN
        ));
    }

    private void processContextEnterOrExit(ParserRuleContext context, Map<TerminalNode, CPPTokenType> tokenTypeMap) {
        for (Map.Entry<TerminalNode, CPPTokenType> tokenTypeEntry: tokenTypeMap.entrySet()) {
            if (tokenTypeEntry.getKey() != null) {
                addEnter(tokenTypeEntry.getValue(), context.getStart());
                return;
            }
        }
    }

Methoden wie enterSelectionStatement könnte man so wohl auch entschachteln.

Sonst denke ich es würde zur Leserlichkeit beitragen wenn man die Contexts einzeln importieren würde, sodass man im Code statt etwa CPP14Parser.FunctionDefinitionContext nur noch FunctionDefinitionContext stehen hat. Dann wäre es auch konsistent mit dem TokenGeneratingTreeScanner im Java Sprachmodul, da wird es auch so gemacht.

Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the default minimum token match, I would say leave it as is, we may adapt that in the future. I will have a look at the if-else problem and see if I can find a solution.

Comment on lines 98 to 117
@Override
public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {
if (context.Switch() != null) {
addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());
} else if (context.If() != null) {
addEnter(CPPTokenType.IF_BEGIN, context.getStart());
if (context.Else() != null) {
addEnter(CPPTokenType.ELSE, context.Else().getSymbol());
}
}
}

@Override
public void exitSelectionStatement(CPP14Parser.SelectionStatementContext context) {
if (context.Switch() != null) {
addEnter(CPPTokenType.SWITCH_END, context.getStop());
} else if (context.If() != null) {
addEnter(CPPTokenType.IF_END, context.getStop());
}
}
Copy link
Member

@tsaglam tsaglam Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grammar is the problem that leads to the weird token order. In the grammar, an if-else combination is represented by a single selection statement, IF statement1 ELSE statement2. Thus, something like this:

if(...)
    a
else if (...)
    b
else
    c

is depicted as two nested selection statements:
IF a ELSE (IF b ELSE c)

Now, this is where things get interresting. The visitor order is not how we read this statement. The visit order is like this:

  1. enter outer selection statement
  2. enter inner statement
  3. exit inner selection statement
  4. exit outer selection statement

However, as the else is part of the selection statement, it will be either visited on entering the statement (with the if keyword and thus before the if body) or when exiting the selection statement (after the else body).

The trick to solving this is utilizing the enter/exitBlockDeclaration method and tracking the last relevant keyword. This is done in some of the newer language modules. This solves not only the token order issue but also the issue that your if-end tokens are currently not placed correctly! They are all at the end after both selection statements. They are not closing off their if blocks.

Rough solution (without the required imports or changes to token types), fixing token order, if-end placement and missing else_end:

Click to show...
Suggested change
@Override
public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {
if (context.Switch() != null) {
addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());
} else if (context.If() != null) {
addEnter(CPPTokenType.IF_BEGIN, context.getStart());
if (context.Else() != null) {
addEnter(CPPTokenType.ELSE, context.Else().getSymbol());
}
}
}
@Override
public void exitSelectionStatement(CPP14Parser.SelectionStatementContext context) {
if (context.Switch() != null) {
addEnter(CPPTokenType.SWITCH_END, context.getStop());
} else if (context.If() != null) {
addEnter(CPPTokenType.IF_END, context.getStop());
}
}
@Override
public void enterSelectionStatement(CPP14Parser.SelectionStatementContext context) {
if (context.Switch() != null) {
addEnter(CPPTokenType.SWITCH_BEGIN, context.getStart());
trackedState.add(CPPTokenType.SWITCH_END);
} else if (context.If() != null) {
addEnter(CPPTokenType.IF_BEGIN, context.getStart());
trackedState.add(CPPTokenType.IF_END);
if (context.Else() != null) {
trackedState.add(CPPTokenType.ELSE);
trackedState.add(CPPTokenType.ELSE_END);
lastElseToken = context.Else().getSymbol();
}
}
}
static Deque<CPPTokenType> trackedState = new ArrayDeque<>();
static Token lastElseToken = null;
@Override
public void enterStatement(StatementContext context) {
if (context.getParent().getRuleIndex() == CPP14Parser.RULE_selectionStatement) {
if (trackedState.peekFirst() == CPPTokenType.ELSE) {
addEnter(trackedState.removeFirst(), lastElseToken);
}
}
}
@Override
public void exitStatement(StatementContext context) {
if (context.getParent().getRuleIndex() == CPP14Parser.RULE_selectionStatement) {
if (trackedState.peekFirst() == CPPTokenType.ELSE_END) {
addExit(trackedState.removeFirst(), context.getStop());
}
if (trackedState.peekFirst() == CPPTokenType.IF_END) {
addExit(trackedState.removeFirst(), context.getStop());
}
}
}

This is how the tokens are then extracted:

1       void a(int a, int b, int x, int y) {
        |FUNCTION{
               |VARDEF|VARDEF|VARDEF|VARDEF

2           if (a < b) {
            |IF{

3               {x = 5;}
                   |ASSIGN

4           } else if (a > b) {
            |}IF
              |ELSE{
                   |IF{

5               y = 10;
                  |ASSIGN

6               x = y + b;
                  |ASSIGN

7           } else {
            |}ELSE
            |}IF
              |ELSE{

8               y = -20;
                  |ASSIGN

9           }
            |}ELSE

10      }
        |}FUNCTION
        |EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I hoped to avoid additional state.

This solves not only the token order issue but also the issue that your if-end tokens are currently not placed correctly!

I assume you're referring to the fact that multiple }if tokens are always placed at the very end? This is also the case in the Java language module currently.

In your code snippet, you're extracting a token for the end of an else, which currently isn't extracted in the Java module, but I assume it makes sense to change that too if the }if token is changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, even more issues in the Java module unearthed. Okay, then let me investigate first if the if-end and the else-end have an impact on the detection quality. If not, we do not need to fix it. I just assumed it would be correct in the Java module. I will have a look at it. Interesting!

Copy link
Member

@tsaglam tsaglam Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SirYwell you are right, java does not place the if tokens correctly either:

7               if (args == null) {
                |IF{

9               } else if (args.length > 1) {
                       |ELSE
                       |IF{

11              } else {
                       |ELSE

13              }
                |}if
                |}if

As a result, leave it as the Java module does it now, I can always change it in the future. However, please ensure the else tokens are correctly placed. If the else token comes before the tokens of the if block, then we cannot differ between code in the if block and code in the else block.

EDIT: Also, sonar found one code smell. Just rename the local variable tokens to tokenStream.

@SirYwell
Copy link
Contributor Author

Die vielen else if etwa in enterJumpStatement könnte man durch ein eleganteres Konstrukt ersetzen.

Der konkrete Vorschlag würde so nicht funktionieren, weil die Map keine null Keys erlaubt (und auch sonst maximal ein mal null vorkommen dürfte), aber man kann key und value natürlich tauschen.

Mein Vorschlag wäre stattdessen jedoch folgender:

    record Extraction<T>(Predicate<T> nonNullTest, TokenType toExtract) {

        static <T> Extraction<T> of(Function<T, ?> contextToAnything, TokenType toExtract) {
            return new Extraction<>(t -> Objects.nonNull(contextToAnything.apply(t)), toExtract);
        }
    }

    private static final List<Extraction<JumpStatementContext>> JUMP_STATEMENT_TOKENS = List.of(
            Extraction.of(JumpStatementContext::Break, BREAK),
            Extraction.of(JumpStatementContext::Continue, CONTINUE), 
            Extraction.of(JumpStatementContext::Goto, GOTO),
            Extraction.of(JumpStatementContext::Return, RETURN)
    );
    
    @Override
    public void enterJumpStatement(JumpStatementContext context) {
        extractFirstNonNull(context, context.getStart(), JUMP_STATEMENT_TOKENS);
    }
    
    private <T> void extractFirstNonNull(T context, Token token, List<Extraction<T>> extractions) {
        for (Extraction<? super T> extraction : extractions) {
            if (extraction.nonNullTest().test(context)) {
                addEnter(extraction.toExtract(), token);
                return;
            }
        }
    }

Dadurch muss man nicht jedes Mal eine Collection konstruieren. Was hältst du davon @mbrdl?

Sonst denke ich es würde zur Leserlichkeit beitragen wenn man die Contexts einzeln importieren würde

Das habe ich mal gemacht.
(Im Java-Sprachmodul sind die Elemente top level interfaces, wodurch dieses Problem gar nicht auftritt. Andere ANTLR basierten Sprachmodule nutzen nur teilweise statischen Importe für die Klassen)

@SirYwell
Copy link
Contributor Author

I updated the PR and included the requested changes. The if/else extraction should now behave like the Java language module.

@tsaglam
Copy link
Member

tsaglam commented Mar 14, 2023

@SirYwell can you fix the two sonar issues, then I can merge.

@sonarcloud
Copy link

sonarcloud bot commented Mar 14, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@SirYwell
Copy link
Contributor Author

@tsaglam Sonarcloud seems to be happy now :)

@tsaglam tsaglam merged commit 8659517 into jplag:develop Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants