Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate GCD for longs more efficiently #140

Merged
merged 9 commits into from
Mar 6, 2024

Conversation

mlangc
Copy link
Contributor

@mlangc mlangc commented Feb 25, 2024

Replaces the GCD implementation for long values with code that is several times faster. See
https://medium.com/@m.langer798/stein-vs-stein-on-the-jvm-c911809bfce1 for details.

Replaces the GCD implementation for long values with code that is
several times faster. See
https://medium.com/@m.langer798/stein-vs-stein-on-the-jvm-c911809bfce1
for details.
@aherbert
Copy link
Contributor

Interesting analysis on your blog. It would be helpful if you apply the same changes to public static int gcd(int p, int q) too.

@mlangc
Copy link
Contributor Author

mlangc commented Feb 26, 2024

Interesting analysis on your blog. It would be helpful if you apply the same changes to public static int gcd(int p, int q) too.

Makes sense - I just pushed another commit that does exactly that.

@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.24%. Comparing base (27ab685) to head (6f9ebd4).
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #140      +/-   ##
============================================
+ Coverage     99.23%   99.24%   +0.01%     
+ Complexity     1828     1802      -26     
============================================
  Files            70       70              
  Lines          4808     4779      -29     
  Branches        896      881      -15     
============================================
- Hits           4771     4743      -28     
  Misses           10       10              
+ Partials         27       26       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mlangc
Copy link
Contributor Author

mlangc commented Feb 26, 2024

I did some further benchmarks, and it seems that the implementation for int is actually better left as is. I'll look more closely tomorrow or after tomorrow & share the results with you.

@mlangc
Copy link
Contributor Author

mlangc commented Mar 1, 2024

After adding some benchmarks, I found out that the existing GCD implementation for ints is also very performant for longs, if only one small change is made to it. Thus I adapted the implementation for ints, and replaced the version for longs with the same code, but for 64 bits.

Here are the benchmark results in 1000 GCDs per second on my laptop (see https://github.com/apache/commons-numbers/pull/140/files#diff-61d1811860900830accad2a21d17e8bcd905486d5c91ed6e43bef31e11e27147):

Benchmark                                Mode  Cnt      Score      Error  Units
GcdPerformance.gcdBigInteger            thrpt   10   1504.116 ±  166.441  ops/s
GcdPerformance.gcdInt                   thrpt   10  21050.470 ± 1057.856  ops/s
GcdPerformance.gcdLong                  thrpt   10  10352.825 ±  272.360  ops/s
GcdPerformance.oldGcdInt                thrpt   10  21003.386 ±  849.209  ops/s
GcdPerformance.oldGcdIntAdaptedForLong  thrpt   10   4741.043 ±  562.422  ops/s
GcdPerformance.oldGcdLong               thrpt   10   2500.723 ±   79.173  ops/s

As you can verify, the performance for ints is not really affected by the change, however for longs, there is a big difference, as you can see here (see https://colab.research.google.com/drive/11uz20qhFhUgv_-2YewzR--SDRD4_swYr#scrollTo=1M4k9mlEbvab&line=32&uniqifier=1)
gcd-long-performance

@aherbert
Copy link
Contributor

aherbert commented Mar 1, 2024

It seems that commit history changed the int version in NUMBERS-132. This introduced the use of numberOfTrailingZeros for fast divide by powers of 2. It did not update the long version.

I'll review the code with some feedback. I expect the benchmark code to fail the build due to uncommented public methods. If these are made private it should be OK.

Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code and benchmark. I am fine with the code changes as they copy the int implementation from NUMBERS-132 to long, and improve the performance.

I have given some ideas to change the benchmark to allow more flexibility in testing. I also think it will fail the build without some form of comments on public classes and methods. You can test this by running mvn from the JMH module directory. Comments are always helpful so a future visitor will not have to look at this GH PR or your informative blog post on the topic.

@State(Scope.Benchmark)
public static class Ints {
final int[] values;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double new lines should be changed to single new lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be gone.

private static final long SEED = 42;

@State(Scope.Benchmark)
public static class Ints {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail the build due to lack of a comment on a public class. You can test the build by running mvn from the command line. The JMH module skips a lot of QA plugins but checkstyle is still enabled which flags some javadoc issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit, that hopefully fixes all checkstyle warnings.

@State(Scope.Benchmark)
@Fork(value = 1, jvmArgs = {"-server", "-Xms512M", "-Xmx512M"})
public class GcdPerformance {
private static final int NUM_PAIRS = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this constant into the State classes so it can be configured on the command line using JMH parameter injection. The same can be done for the seed.

    @State(Scope.Benchmark)
    public static class Longs {
        /** Number of pairs. */
        @Param({"1000"})
        private int pairs;
        /** Seed. */
        @Param({"42"})
        private long seed;
        
        private long[] values;

        /**
         * @return the data sample
         */
        long[] getValues() {
            return values;
        }

        /** Create the data. */
        @Setup(Level.Iteration)
        public void setup() {
            values = getRandomProvider(seed).longs().filter(i -> i != Long.MIN_VALUE).limit(pairs << 1).toArray();
            // Different seed next time; just reuse one of the random values
            seed = values[0];
        }
    }

Allows:

mvn package -Pexamples-jmh
java -jar target/examples-jmh.jar GcdPerformance -ppairs=100000 -pseed=123

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - done.

return -negatedGcd;
}

public static long oldGcdLong(final long p, final long q) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make private. I would add a comment about the origin of the method, e.g.

This is a copy of the original method in {@code o.a.c.numbers.core.ArithmeticUtils} v1.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

return -u * (1L << k); // gcd is u*2^k
}

public static int oldGcdInt(int p, int q) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make private. Add comment detailing the origin (as before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done


@Benchmark
public void gcdInt(Ints ints, Blackhole blackhole) {
for (int i = 0; i < ints.values.length; i += 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values are non-final (due to creation with parameterized sizing) you will have to copy a reference here:

final int[] a = ints.getValues();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will check if it makes any difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the whole GCD benchmark and included this change as well. The difference is minimal, and I'm not sure it's even significant. However, it seems that if I copy the reference to a local variable, the old GCD implementation for int is slightly faster then the new one, while without the copy, it seems to be the other way round. Before recent changes, when values was final, there was no difference between the implementations, so I think that this might be a benchmarking artifact.

What values do you get if you execute the benchmark?

@Benchmark
public void gcdInt(Ints ints, Blackhole blackhole) {
for (int i = 0; i < ints.values.length; i += 2) {
blackhole.consume(ArithmeticUtils.gcd(ints.values[i], ints.values[i + 1]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the use of a Blackhole does have some overhead. When the method being benchmarked is very fast then this should either be baselined (i.e. how long does it take to consume 1000 ints), or an alternative with less overhead used. For cases with primitives I add them up and return the sum (which JMH will then consume):

public int gcdInt(Ints ints) {
        int sum = 0;
        for (int i = 0; i < ints.values.length; i += 2) {
            sum += ArithmeticUtils.gcd(ints.values[i], ints.values[i + 1]);
        }
        return sum;
}

This may be worth investigating to see if the throughput numbers reported are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the remark - I will check if there are any differences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that and didn't observe a significant difference, so I prefer to leave it as is.

@aherbert
Copy link
Contributor

aherbert commented Mar 2, 2024

Installed the numbers-core module locally (forgot to do this at first).

MacBook Pro M2:

mvn -v
Apache Maven 3.9.4 (dfbb324ad4a7c8fb0bf182e6d91b0ae20e3d2dd9)
Maven home: /Users/ah403/mvn/mvn
Java version: 21.0.1, vendor: Eclipse Adoptium, runtime: /Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "14.2.1", arch: "aarch64", family: "mac"

I did also check JDK 11 and saw the same results so using JDK 21 is not the reason for the difference.

With benchmarking it is recommended to avoid final input (as the JVM may recognise this and learn what to do). So I would prefer the benchmark without final or fixed data.

I changed the benchmark to build new data each iteration:

@Setup(Level.Iteration)
public void setup() {
    values = getRandomProvider(seed).ints()
            .filter(i -> i != Integer.MIN_VALUE).
            limit(numPairs * 2)
            .toArray();
    seed = (((long) values[0]) << Integer.SIZE) | values[1];
}

@Setup(Level.Iteration)
public void setup() {
    values = getRandomProvider(seed).longs()
            .filter(i -> i != Long.MIN_VALUE)
            .limit(numPairs * 2)
            .toArray();
    seed = values[0];
}

I find you get more stable timings with more than 1000 pairs. Here are 100000 pairs reseeded each time:

Benchmark                               (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger                100000      42  thrpt   10   21.217 ± 0.117  ops/s
GcdPerformance.gcdInt                       100000      42  thrpt   10  231.625 ± 2.014  ops/s
GcdPerformance.gcdLong                      100000      42  thrpt   10  115.497 ± 0.446  ops/s
GcdPerformance.oldGcdInt                    100000      42  thrpt   10  253.902 ± 2.531  ops/s
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  thrpt   10   68.111 ± 0.909  ops/s
GcdPerformance.oldGcdLong                   100000      42  thrpt   10   29.181 ± 0.037  ops/s

I see the oldGcdInt is faster than the new gcdInt. The oldGcdIntAdaptedForLong is out performed by the new gcdLong. This is interesting given the trivial difference between them.

# average time
java -jar target/examples-jmh.jar Gcd -pnumPairs=100000 -bm avgt -tu ns

Benchmark                               (numPairs)  (seed)  Mode  Cnt         Score        Error  Units
GcdPerformance.gcdBigInteger                100000      42  avgt   10  47321374.441 ± 316079.389  ns/op
GcdPerformance.gcdInt                       100000      42  avgt   10   4365814.289 ±  34222.618  ns/op
GcdPerformance.gcdLong                      100000      42  avgt   10   8711690.051 ±  25766.036  ns/op
GcdPerformance.oldGcdInt                    100000      42  avgt   10   3945841.489 ±  35975.472  ns/op
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  avgt   10  14703364.077 ± 539905.259  ns/op
GcdPerformance.oldGcdLong                   100000      42  avgt   10  34265752.087 ± 206487.358  ns/op

Same.

So on this machine it seems that the oldGcdInt is OK, but adapting that method for long requires the change to not use while (a != b).

If this is repeatable on other machines then I think the new gcdLong implementation should have a comment to state why it is different from the gcdInt, e.g.

// This method is intentionally different from the int gcd implementation.
// Benchmarking shows the test for long inequality (a != b) is slow compared to
// testing for equality of the delta to zero. The same change on the int gcd
// reduces performance there, hence we have two variants of the method.

I thought it may be due to the fact that the the loop is executed more often for 64-bit numbers and the change aided in pipelining as the delta is required more often. So I recompiled with the long data limited to 32-bit (int) data. However the difference is still there:

Benchmark                               (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger                100000      42  thrpt   10   96.059 ± 2.658  ops/s
GcdPerformance.gcdInt                       100000      42  thrpt   10  230.165 ± 0.501  ops/s
GcdPerformance.gcdLong                      100000      42  thrpt   10  218.651 ± 0.516  ops/s
GcdPerformance.oldGcdInt                    100000      42  thrpt   10  252.595 ± 0.822  ops/s
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  thrpt   10  131.465 ± 0.451  ops/s
GcdPerformance.oldGcdLong                   100000      42  thrpt   10   58.857 ± 0.123  ops/s

Note that int data make BigInteger 4x faster (it has a smaller representation in memory) and the long implementations 2x faster. So the loop execution count is approximately halved. The new long gcd version approaches the speed of the int versions.

Since the speed difference is still there it may be that equality comparison between longs is slower than comparison to zero. But I did not investigate further. Note that to compare a long is not zero you can fold the upper and lower bits to a single 32-bit integer and test that is not zero. So perhaps comparison of long to zero is faster due to such optimisation. But I'm not familiar with what goes on at the hardware level.

I will repeat the benchmark on another machine when I go back to work on Monday. I find the M2 processor can often fail to show differences that are seen on my old Xeon workstation processor, i.e. make sure to test on the worst processor(s) you have access to.

@mlangc
Copy link
Contributor Author

mlangc commented Mar 3, 2024

Thanks a lot for the detailed analysis. I've adapted the benchmark according to your suggestions: I've changed the default number of pairs to 100_000 and generate new pairs after each iteration. Interestingly, now the new GCD implementation for int wins for me:

Benchmark                               (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger                100000      42  thrpt   10   16.422 ± 0.217  ops/s
GcdPerformance.gcdInt                       100000      42  thrpt   10  211.602 ± 0.962  ops/s
GcdPerformance.gcdLong                      100000      42  thrpt   10  104.997 ± 0.392  ops/s
GcdPerformance.oldGcdInt                    100000      42  thrpt   10  207.989 ± 2.261  ops/s
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  thrpt   10   50.588 ± 0.502  ops/s
GcdPerformance.oldGcdLong                   100000      42  thrpt   10   24.867 ± 0.099  ops/s

with a 2,7 GHz Quad-Core Intel Core i7 and

$ mvn -v
Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
Maven home: /usr/local/Cellar/maven/3.9.6/libexec
Java version: 21.0.2, vendor: Homebrew, runtime: /usr/local/Cellar/openjdk/21.0.2/libexec/openjdk.jdk/Contents/Home
Default locale: en_001, platform encoding: UTF-8
OS name: "mac os x", version: "14.2.1", arch: "x86_64", family: "mac"

@aherbert
Copy link
Contributor

aherbert commented Mar 4, 2024

Intel(R) Xeon(R) CPU E5-1680 v3 @ 3.20GHz

JDK 11.0.22, OpenJDK 64-Bit Server VM, 11.0.22+7-post-Ubuntu-0ubuntu220.04.1

Benchmark                               (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger                100000      42  thrpt   10   14.922 ± 0.164  ops/s
GcdPerformance.gcdInt                       100000      42  thrpt   10  149.786 ± 0.470  ops/s
GcdPerformance.gcdLong                      100000      42  thrpt   10   82.026 ± 0.437  ops/s
GcdPerformance.oldGcdInt                    100000      42  thrpt   10  149.918 ± 3.116  ops/s
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  thrpt   10   45.550 ± 0.068  ops/s
GcdPerformance.oldGcdLong                   100000      42  thrpt   10   24.903 ± 0.085  ops/s

JDK 21, OpenJDK 64-Bit Server VM, 21+35-2513

Benchmark                               (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger                100000      42  thrpt   10   15.186 ± 0.055  ops/s
GcdPerformance.gcdInt                       100000      42  thrpt   10  171.424 ± 0.222  ops/s
GcdPerformance.gcdLong                      100000      42  thrpt   10   84.787 ± 1.057  ops/s
GcdPerformance.oldGcdInt                    100000      42  thrpt   10  177.836 ± 0.261  ops/s
GcdPerformance.oldGcdIntAdaptedForLong      100000      42  thrpt   10   56.996 ± 0.155  ops/s
GcdPerformance.oldGcdLong                   100000      42  thrpt   10   24.625 ± 0.341  ops/s

Here the two int versions are the same on JDK 11 and the old version faster on JDK 21. The new long version is again faster than the oldGcdIntAdaptedForLong.

So on two machines the int versions are roughly the same speed (your machine has a 2% advantage to the new version). On the M2 processor the old int version is 9% faster, and 3.5% faster on the Xeon processor on JDK 21.

I would be in favour of keeping the old int version and updating to the new long version (for a 3-4x speed-up). This would require the previously mentioned comment in the code as to why we have two slightly different versions. It would also require a change to transfer the core implementation to the benchmark as gcdLongAdaptedForInt and rename the int complement to gcdIntAdaptedForLong.

Thoughts?

@mlangc
Copy link
Contributor Author

mlangc commented Mar 4, 2024

Sounds good 👍 , I can look into that in the coming days.

@mlangc
Copy link
Contributor Author

mlangc commented Mar 6, 2024

As discussed, I've reverted my change to the GCD implementation for ints, and adapted the benchmarks. Inspired by one of your earlier comments, I also added a benchmark that tests the long implementation with ints. Here are my results for reference from my MacBoo with a Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz running JDK 21.0.2, OpenJDK 64-Bit Server VM, 21.0.2+13-58:

Benchmark                            (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger             100000      42  thrpt   10   16.166 ± 0.510  ops/s
GcdPerformance.gcdInt                    100000      42  thrpt   10  205.558 ± 4.193  ops/s
GcdPerformance.gcdIntAdaptedForLong      100000      42  thrpt   10   49.751 ± 0.846  ops/s
GcdPerformance.gcdLong                   100000      42  thrpt   10  105.440 ± 1.035  ops/s
GcdPerformance.gcdLongAdaptedForInt      100000      42  thrpt   10  207.047 ± 0.848  ops/s
GcdPerformance.gcdLongWithInts           100000      42  thrpt   10  194.136 ± 2.390  ops/s
GcdPerformance.oldGcdLong                100000      42  thrpt   10   25.128 ± 0.106  ops/s

@aherbert
Copy link
Contributor

aherbert commented Mar 6, 2024

Confirmed performance of current code on MacBook Pro M2 using JDK 21:

Benchmark                            (numPairs)  (seed)   Mode  Cnt    Score   Error  Units
GcdPerformance.gcdBigInteger             100000      42  thrpt   10   20.600 ± 0.082  ops/s
GcdPerformance.gcdInt                    100000      42  thrpt   10  247.326 ± 1.044  ops/s
GcdPerformance.gcdIntAdaptedForLong      100000      42  thrpt   10   66.630 ± 0.316  ops/s
GcdPerformance.gcdLong                   100000      42  thrpt   10  111.815 ± 0.586  ops/s
GcdPerformance.gcdLongAdaptedForInt      100000      42  thrpt   10  222.632 ± 2.511  ops/s
GcdPerformance.gcdLongWithInts           100000      42  thrpt   10  217.358 ± 2.724  ops/s
GcdPerformance.oldGcdLong                100000      42  thrpt   10   29.178 ± 0.059  ops/s

@aherbert aherbert merged commit 7cd4b43 into apache:master Mar 6, 2024
4 checks passed
@aherbert
Copy link
Contributor

aherbert commented Mar 6, 2024

Thanks for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants