-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate GCD for longs more efficiently #140
Conversation
Replaces the GCD implementation for long values with code that is several times faster. See https://medium.com/@m.langer798/stein-vs-stein-on-the-jvm-c911809bfce1 for details.
Interesting analysis on your blog. It would be helpful if you apply the same changes to |
Makes sense - I just pushed another commit that does exactly that. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #140 +/- ##
============================================
+ Coverage 99.23% 99.24% +0.01%
+ Complexity 1828 1802 -26
============================================
Files 70 70
Lines 4808 4779 -29
Branches 896 881 -15
============================================
- Hits 4771 4743 -28
Misses 10 10
+ Partials 27 26 -1 ☔ View full report in Codecov by Sentry. |
I did some further benchmarks, and it seems that the implementation for int is actually better left as is. I'll look more closely tomorrow or after tomorrow & share the results with you. |
After adding some benchmarks, I found out that the existing GCD implementation for ints is also very performant for longs, if only one small change is made to it. Thus I adapted the implementation for ints, and replaced the version for longs with the same code, but for 64 bits. Here are the benchmark results in 1000 GCDs per second on my laptop (see https://github.com/apache/commons-numbers/pull/140/files#diff-61d1811860900830accad2a21d17e8bcd905486d5c91ed6e43bef31e11e27147):
As you can verify, the performance for ints is not really affected by the change, however for longs, there is a big difference, as you can see here (see https://colab.research.google.com/drive/11uz20qhFhUgv_-2YewzR--SDRD4_swYr#scrollTo=1M4k9mlEbvab&line=32&uniqifier=1) |
It seems that commit history changed the int version in NUMBERS-132. This introduced the use of I'll review the code with some feedback. I expect the benchmark code to fail the build due to uncommented public methods. If these are made private it should be OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the code and benchmark. I am fine with the code changes as they copy the int implementation from NUMBERS-132 to long, and improve the performance.
I have given some ideas to change the benchmark to allow more flexibility in testing. I also think it will fail the build without some form of comments on public classes and methods. You can test this by running mvn
from the JMH module directory. Comments are always helpful so a future visitor will not have to look at this GH PR or your informative blog post on the topic.
@State(Scope.Benchmark) | ||
public static class Ints { | ||
final int[] values; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double new lines should be changed to single new lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be gone.
private static final long SEED = 42; | ||
|
||
@State(Scope.Benchmark) | ||
public static class Ints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may fail the build due to lack of a comment on a public class. You can test the build by running mvn
from the command line. The JMH module skips a lot of QA plugins but checkstyle is still enabled which flags some javadoc issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit, that hopefully fixes all checkstyle warnings.
@State(Scope.Benchmark) | ||
@Fork(value = 1, jvmArgs = {"-server", "-Xms512M", "-Xmx512M"}) | ||
public class GcdPerformance { | ||
private static final int NUM_PAIRS = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this constant into the State classes so it can be configured on the command line using JMH parameter injection. The same can be done for the seed.
@State(Scope.Benchmark)
public static class Longs {
/** Number of pairs. */
@Param({"1000"})
private int pairs;
/** Seed. */
@Param({"42"})
private long seed;
private long[] values;
/**
* @return the data sample
*/
long[] getValues() {
return values;
}
/** Create the data. */
@Setup(Level.Iteration)
public void setup() {
values = getRandomProvider(seed).longs().filter(i -> i != Long.MIN_VALUE).limit(pairs << 1).toArray();
// Different seed next time; just reuse one of the random values
seed = values[0];
}
}
Allows:
mvn package -Pexamples-jmh
java -jar target/examples-jmh.jar GcdPerformance -ppairs=100000 -pseed=123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - done.
return -negatedGcd; | ||
} | ||
|
||
public static long oldGcdLong(final long p, final long q) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make private. I would add a comment about the origin of the method, e.g.
This is a copy of the original method in {@code o.a.c.numbers.core.ArithmeticUtils} v1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done
return -u * (1L << k); // gcd is u*2^k | ||
} | ||
|
||
public static int oldGcdInt(int p, int q) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make private. Add comment detailing the origin (as before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done
|
||
@Benchmark | ||
public void gcdInt(Ints ints, Blackhole blackhole) { | ||
for (int i = 0; i < ints.values.length; i += 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the values are non-final (due to creation with parameterized sizing) you will have to copy a reference here:
final int[] a = ints.getValues();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will check if it makes any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the whole GCD benchmark and included this change as well. The difference is minimal, and I'm not sure it's even significant. However, it seems that if I copy the reference to a local variable, the old GCD implementation for int is slightly faster then the new one, while without the copy, it seems to be the other way round. Before recent changes, when values
was final, there was no difference between the implementations, so I think that this might be a benchmarking artifact.
What values do you get if you execute the benchmark?
@Benchmark | ||
public void gcdInt(Ints ints, Blackhole blackhole) { | ||
for (int i = 0; i < ints.values.length; i += 2) { | ||
blackhole.consume(ArithmeticUtils.gcd(ints.values[i], ints.values[i + 1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the use of a Blackhole does have some overhead. When the method being benchmarked is very fast then this should either be baselined (i.e. how long does it take to consume 1000 ints), or an alternative with less overhead used. For cases with primitives I add them up and return the sum (which JMH will then consume):
public int gcdInt(Ints ints) {
int sum = 0;
for (int i = 0; i < ints.values.length; i += 2) {
sum += ArithmeticUtils.gcd(ints.values[i], ints.values[i + 1]);
}
return sum;
}
This may be worth investigating to see if the throughput numbers reported are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the remark - I will check if there are any differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and didn't observe a significant difference, so I prefer to leave it as is.
Installed the numbers-core module locally (forgot to do this at first). MacBook Pro M2:
I did also check JDK 11 and saw the same results so using JDK 21 is not the reason for the difference. With benchmarking it is recommended to avoid I changed the benchmark to build new data each iteration: @Setup(Level.Iteration)
public void setup() {
values = getRandomProvider(seed).ints()
.filter(i -> i != Integer.MIN_VALUE).
limit(numPairs * 2)
.toArray();
seed = (((long) values[0]) << Integer.SIZE) | values[1];
}
@Setup(Level.Iteration)
public void setup() {
values = getRandomProvider(seed).longs()
.filter(i -> i != Long.MIN_VALUE)
.limit(numPairs * 2)
.toArray();
seed = values[0];
} I find you get more stable timings with more than 1000 pairs. Here are 100000 pairs reseeded each time:
I see the
Same. So on this machine it seems that the If this is repeatable on other machines then I think the new
I thought it may be due to the fact that the the loop is executed more often for 64-bit numbers and the change aided in pipelining as the delta is required more often. So I recompiled with the long data limited to 32-bit (int) data. However the difference is still there:
Note that int data make BigInteger 4x faster (it has a smaller representation in memory) and the long implementations 2x faster. So the loop execution count is approximately halved. The new long gcd version approaches the speed of the int versions. Since the speed difference is still there it may be that equality comparison between longs is slower than comparison to zero. But I did not investigate further. Note that to compare a long is not zero you can fold the upper and lower bits to a single 32-bit integer and test that is not zero. So perhaps comparison of long to zero is faster due to such optimisation. But I'm not familiar with what goes on at the hardware level. I will repeat the benchmark on another machine when I go back to work on Monday. I find the M2 processor can often fail to show differences that are seen on my old Xeon workstation processor, i.e. make sure to test on the worst processor(s) you have access to. |
Thanks a lot for the detailed analysis. I've adapted the benchmark according to your suggestions: I've changed the default number of pairs to 100_000 and generate new pairs after each iteration. Interestingly, now the new GCD implementation for
with a
|
Intel(R) Xeon(R) CPU E5-1680 v3 @ 3.20GHz
Here the two int versions are the same on JDK 11 and the old version faster on JDK 21. The new long version is again faster than the oldGcdIntAdaptedForLong. So on two machines the int versions are roughly the same speed (your machine has a 2% advantage to the new version). On the M2 processor the old int version is 9% faster, and 3.5% faster on the Xeon processor on JDK 21. I would be in favour of keeping the old int version and updating to the new long version (for a 3-4x speed-up). This would require the previously mentioned comment in the code as to why we have two slightly different versions. It would also require a change to transfer the core implementation to the benchmark as Thoughts? |
Sounds good 👍 , I can look into that in the coming days. |
As discussed, I've reverted my change to the GCD implementation for ints, and adapted the benchmarks. Inspired by one of your earlier comments, I also added a benchmark that tests the long implementation with ints. Here are my results for reference from my MacBoo with a
|
Confirmed performance of current code on MacBook Pro M2 using JDK 21:
|
Thanks for the contribution. |
Replaces the GCD implementation for long values with code that is several times faster. See
https://medium.com/@m.langer798/stein-vs-stein-on-the-jvm-c911809bfce1 for details.