Skip to content

LLVM IR min token and README adjustments #1322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions languages/llvmir/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,20 @@ These include binary and bitwise instructions (like addition and or), memory ope

To use the LLVM IR module, add the `-l llvmir` flag in the CLI, or use a `JPlagOption` object with `new de.jplag.llvmir.LLVMIRLanguage()` as `language` in the Java API as described in the usage information in the [readme of the main project](https://github.com/jplag/JPlag#usage) and [in the wiki](https://github.com/jplag/JPlag/wiki/1.-How-to-Use-JPlag).

We recommend using the [LLVM optimizer](https://llvm.org/docs/CommandGuide/opt.html) to optimize the LLVM IR code before using JPlag.
In our tests, optimization level 1 showed the best results in plagiarism detection quality and should therefore, be used.

### Minimum Token Match

It can be difficult to find a good value for the minimum token match because the range of possible candidates for low-level languages like the LLVM IR is much larger.
Values can range between 60 and 70 for code compiled from C to more than 1000 for code compiled from C++.
From our tests, we calculated a formula that depends on the average lines of code (avg. loc) to determine a value that should provide good results:

min_token_match(x) = 48.2055162 * e^(0.000333593799 * x)

with x = (avg. loc of the LLVM IR code) - (avg. loc of the source code), <br>
where the source code is the code from which the IR code was generated, for example, the C or C++ code.

<br>

#### Footnotes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ public class LLVMIRLanguage extends AbstractAntlrLanguage {

private static final String NAME = "LLVMIR Parser";
private static final String IDENTIFIER = "llvmir";
private static final int DEFAULT_MIN_TOKEN_MATCH = 40;
private static final int DEFAULT_MIN_TOKEN_MATCH = 70;
private static final String[] FILE_EXTENSIONS = {".ll"};

public LLVMIRLanguage() {
Expand Down