Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up comparison #1005

Merged
merged 1 commit into from
Apr 6, 2023
Merged

Speed up comparison #1005

merged 1 commit into from
Apr 6, 2023

Conversation

brodmo
Copy link
Contributor

@brodmo brodmo commented Apr 2, 2023

In the greedy string tiling comparison algorithm, certain tokens are marked. The marks are currently kept track of through sets of indices. By instead using boolean arrays containing whether each token is marked, JPlag's runtime is roughly cut in half.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 2, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Apr 3, 2023
@tsaglam tsaglam requested a review from a team April 3, 2023 09:32
@tsaglam tsaglam merged commit e66f678 into jplag:develop Apr 6, 2023
@brodmo
Copy link
Contributor Author

brodmo commented Apr 14, 2023

Some more measurements, each row being 100 runs:
Screenshot 2023-04-14 at 18 23 33
color = task, dark = without comparison speedup, light = with comparison speedup

I tested some other variations but they don't really make a difference. What does make a difference is the comparison speedup, the speedup factor of 2 I gave may even be a bit of an underestimate. What's noticeable is that for the measurements without the speedup, the standard deviation is much bigger. This gives us a hint toward the reason for the speedup. I think it has to do with caching, maybe not all of the set entries fit into the data cache and they need to be loaded again individually on every lookup. The array is contiguous in memory and can be loaded much much faster. @tsaglam in case you're interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants