Skip to content

File-based Similarity in the Comparison View #1516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 13, 2024

Conversation

Kr0nox
Copy link
Member

@Kr0nox Kr0nox commented Jan 30, 2024

Displays the percentage of tokens of a file that are part of a match.
Also displays the total number of tokens in each submission.

The submission file index now holds data about the token count of each file

relates to #1109

@Kr0nox Kr0nox added the report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies label Jan 30, 2024
@Kr0nox
Copy link
Member Author

Kr0nox commented Feb 1, 2024

We should discuss how the EOF Token should be handled. Currently it is excluded in the percent calculation but not in the tooltip, which may be confusing for the user.
If we would include it in the percent calculation a file that is entirely copied could not achive a 100% match. So one option would be to not count it in both numbers

Copy link
Contributor

@TwoOfTwelve TwoOfTwelve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, except for the one thing I commented on.

Copy link
Member

@sebinside sebinside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the UI, I would argue for not displaying decimal numbers to minimize visual clutter. Also, I would argue for a dark gray instead of black text color.

@Kr0nox Kr0nox requested a review from sebinside February 6, 2024 12:07
@Kr0nox
Copy link
Member Author

Kr0nox commented Feb 6, 2024

The performance impact of this is pretty heavy.
Zipping was increased from 13 seconds to 61 seconds for 64 copies of JPlag (Comparison time was about 3m 40s)

Because of this the token calculation should probably be moved somewhere in the core where it is more efficient. We should discus where (if there even is a place where it is possible to speed it up) and how this should be done

After merging develop with the immediat zip wirting, the time was increased from ~13 seconds to ~16 seconds for 64 copies of JPlag (comparison time about 3m 40s). This also alligned with a time mesuarement for the the counting loop which accounted for 3 seconds.
Using a parralel stream over the submissions we could reduce this to under a second. This would reduce readibility tho in my opinion. What is your opinion on that?

We should discuss whether this is fine or whether it should be moved. We should factor in how difficult it would be to optimize that. since parsing of submissions is not done in parrallel there would be no perfomance improvement when moving this into the submission class and iterating over the list of tokens returned form the Language:parse method. So the only option I see to do this, is to do this directly in the parser of each language. This would lead to some big changes to the Language and Parser interface in my opinion tho.
@tsaglam What is your opinion?

@tsaglam
Copy link
Member

tsaglam commented Feb 8, 2024

@Kr0nox

Using a parallel stream over the submissions we could reduce this to under a second. This would reduce readibility tho in my opinion. What is your opinion on that?

I would say performance > readability here.

We should discuss whether this is fine or whether it should be moved. We should factor in how difficult it would be to optimize that. since parsing of submissions is not done in parallel there would be no performance improvement when moving this into the submission class and iterating over the list of tokens returned form the Language:parse method. So the only option I see to do this, is to do this directly in the parser of each language. This would lead to some big changes to the Language and Parser interface in my opinion tho. What is your opinion?

Doing it in each language is not something I am in favor of.
Another option would be to do it when creating the result object. e.g. either add a method to JPlagResult that returns the number of tokens per file, which is pre-calculated in the constructor and put in a map. Or add a method for that information to the submission class, but only calculate the info lazily when the method is first called. Thus, when writing the report, you can iterate over the submissions in parallel and call that method. This last solution is something I would actually prefer.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Feb 13, 2024
Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small thing.

Copy link

Quality Gate Passed Quality Gate passed for 'JPlag Report Viewer'

Issues
0 New issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

Quality Gate Passed Quality Gate passed for 'JPlag Plagiarism Detector'

Issues
0 New issues

Measures
0 Security Hotspots
90.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

@sebinside sebinside merged commit b91395a into develop Feb 13, 2024
@sebinside sebinside deleted the report-viewer/single-file-similarity branch February 13, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants