memory mapped files, branchless parsing, bitwiddle magic #5

royvanrijn · 2024-01-01T20:42:44Z

I've ran it using:
sdk use java 21.0.1-graal

Latest graalvm seems to give the best performance.

I've added AppCDS, very simple, by dumping on the first run, using on the other runs.

I've also implemented memory mapped files, based on spullara/bjhara's submission, that's awesome and gave even more improvement ideas.

Most improvements come from optimizing the inner-most loop (of course) by writing my own branchless parser for the numbers, which probably looks like magic.

Finally I've added a custom HashTable implementation, just for this specific usecase.... speeeed.

On the MacBook Pro M2 (32GB) it runs in:
real 0m2.873s

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

lobaorn · 2024-01-02T17:01:53Z

Hey @royvanrijn since you are already very engajed, if you would like to take a look if Lilliput builds using GenerationalZGC could yield benefits: https://twitter.com/gunnarmorling/status/1742227887745376300

If I do a try myself would probably be by the end of the week, and only if there is another feasible approach to compare with the already opened PRs. Other than that should be JVM and GC tweaking...

swaechter · 2024-01-02T17:50:09Z

@royvanrijn Nice solution, I had something similar in mind. Maybe using FFI for mmap and pthreads would improve the performance (Less JVM byte arrays for buffering etc. - but somehow bending the JNI rule). Maybe I'll find the time :)

suchwerk · 2024-01-02T18:04:17Z

What about using integer arithmetic instead of floats?

gunnarmorling · 2024-01-02T18:30:04Z

Hey @royvanrijn, are you planning to do further changes to this one? If so, wanna put it into "Draft" state until it's good to go from your side?

gunnarmorling · 2024-01-02T19:56:26Z

@royvanrijn, could you rebase this one to current main and squash everything into a single commit?

calculate_average_royvanrijn.sh

gunnarmorling · 2024-01-02T20:03:33Z

Preliminary updated the leaderboard with the result from the latest version here: 00:23.366 on the evaluation environment. You've taken over again, @royvanrijn :)

AlexanderYastrebov · 2024-01-02T23:53:21Z

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

+     * Branchless parser, goes from String to int (10x):
+     * "-1.2" to -12
+     * "40.1" to 401


@gunnarmorling Is this assumption acceptable? E.g. 1, 0, 2.00 are all valid doubles.

Also there should be an acceptance test suite for the implementations, I am pretty sure this implementation does not produce the same output as the baseline.

I'm pretty sure it produces the exact same output, I check regularly with each change.

The input and output have one decimal place precision (as stated by the website "rounded to one fractional digit").

I'm internally storing the doubles as 10x integers because the precision is just a single digit, and I'm making sure the rounding is correct afterwards for the average.

The README only says about output format but not input

1brc/README.md

Lines 27 to 28 in e7e7deb

The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this

(i.e. sorted alphabetically by station name, and the result values per station in the format `<min>/<mean>/<max>`, rounded to one fractional digit):

so its worth clarifying.

There are many rounding modes - this is also not specified, e.g. in go https://pkg.go.dev/math#Round is not the same as in java.

Also reference implementation does ad-hoc rounding Math.round(value * 10.0) / 10.0 but actual output depends on the string concatenation which performs another rounding, see https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Double.html#toString(double)

Due to unfortunate output format selection (see #14) one has to use word diff

Created #36 to address this

Sorry for not being explicit enough here. Can the behavior of the reference implementation be described using any of the existing values of RoundingMode?

I think the the exact mode does not really matter but what matters is that the baseline is correct.

I propose to change baseline to use BigDecimal for results accumulation, use scale 1 (instead of round(x*10)/10) and HALF_UP rounding mode (as most common at school) at the final step:

BigDecimal value = new BigDecimal("12.34"); BigDecimal rounded = value.setScale(1, RoundingMode.HALF_UP); System.out.println("=="+rounded.toString()+"=="); // prints ==12.3==

But that's the thing, I don't think we can change the behavior of the reference implementation at this point, as it would render existing submissions invalid if they implement a different behavior. So I'd rather make the behavior of the RI explicit, also if it's not the most natural one (agreed that HALF_UP behavior would have been better).

I don't think its possible to fix RI because it uses double division and rounds twice.
Since there are no acceptance tests I bet a lot of implementations (those that do not parse and calculate values the same way) will not match RI anyways.

I think RI should favor correctness over performance, then it can be used to build acceptance test suite.

Ok, I've logged #49 for getting this one sorted out separately and get this PR merged. Let's continue the rounding topic over there. Thx!

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

franz1981 · 2024-01-03T14:04:04Z

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

+            return toAdd.measurement;
+        }
+
+        private static int hashCode(byte[] a, int length) {


Hash code here has a data dependency: you either manually unroll this or just relax the hash code by using a var handle and use getLong amortizing the data dependency in batches, handing only the last 7 (or less) bytes separately, using the array.
In this way the most of computation would like resolve in much less loop iterations too, similar to https://github.com/apache/activemq-artemis/blob/25fc0342275b29cd73123523a46e6e94582597cd/artemis-commons/src/main/java/org/apache/activemq/artemis/utils/ByteUtil.java#L299

gunnarmorling · 2024-01-03T15:16:47Z

@royvanrijn, so what should we do with this one, and all the pending discussions? Wanna submit it as is and keep honing in follow-up PRs? I think it would be nice to be able to update the leaderboard with the current status (fastest right now is @spullara). For that, could you rebase it to resolve the merge conflict?

Added SWAR (SIMD Within A Register) code to increase bytebuffer processing/throughput Delaying the creation of the String by comparing hash, segmenting like spullara, improved EOL finding

Squashing for merge.

Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>

lobaorn · 2024-01-03T15:39:05Z

Shamelessly sharing this idea for JVM/GC tuning in another PR/discussion? #15 (comment)

Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>

Squashing for merge.

royvanrijn · 2024-01-03T15:44:56Z

@gunnarmorling Git is failing me so hard haha, such a mess; let's get this merged and start working on a v2.

gunnarmorling · 2024-01-03T18:17:54Z

@royvanrijn, could you add a line like this to your launch script for making sure the evaluation is done with GraalVM? I'll squash everything then and evaluate. Thx!

gunnarmorling · 2024-01-03T19:51:33Z

@royvanrijn dang, so I've done what I shouldn't have done and merged it before running. But it seems to take much longer / hang now actually. Any idea what's wrong?

franz1981 · 2024-01-03T20:37:22Z

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

+        long mask = match - 0x0101010101010101L;
+        mask &= ~match;
+        mask &= 0x8080808080808080L;
+        return Long.numberOfTrailingZeros(mask) >>> 3;


Here it shouldn't be the number of leading ones?
@royvanrijn @gunnarmorling

Weird, I thought by setting explicitly on 105 to LE would make the compatibility issues disappear. So running it on my machine would automatically mean it works on the target, although perhaps having a performance hit.

afk atm, I’ll check tomorrow, if somebody wants to fix it and tell me, be my guest 😂

I am not sure actually, for these things I need an old school paper and a pencil :)

I have some local classes that test; problem is that I believe it works on my machine, just not on the target machine, I’ll check soon.

Yeah, annoying, it runs fine locally (the code that was pushed), sigh. Kind of debugging in the dark haha... a challenge!

Thinking about it twice, I'm wrong at #5

let's say we have a byte[] data = { 0x01, 0x03 }
And we assume to have a short-based version of SWAR

reading the content of data with (a short) little-endian means:

0x0301

which have the less significant part at the lower address,
hence the binary hex SWAR result obtained (I'm using the Netty algorithm, but here should be the same) will be

0x8000

and, in order to find out 0x03, we have to use the trailing zeros (here 8 + 7 = 15 -> 15/8 = 1) .

Which means that is fine as it is!

franz1981 · 2024-01-03T20:38:12Z

@gunnarmorling added a comment to help

ddimtirov · 2024-01-04T11:35:51Z

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

-        // System.out.println("Took: " + (System.currentTimeMillis() - before));
+                    // Simple is faster for '\n' (just three options)
+                    int endPointer;
+                    if (bb.get(separatorPointer + 4) == '\n') {


For my input I get IOOBE here:

Exception in thread "main" java.lang.IndexOutOfBoundsException at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542) at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567) at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) at dev.morling.onebrc.CalculateAverage_royvanrijn.run(CalculateAverage_royvanrijn.java:144) at dev.morling.onebrc.CalculateAverage_royvanrijn.main(CalculateAverage_royvanrijn.java:92) Caused by: java.lang.IndexOutOfBoundsException at java.base/java.nio.Buffer$1.apply(Buffer.java:757) at java.base/java.nio.Buffer$1.apply(Buffer.java:754) at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213) at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210) at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302) at java.base/java.nio.Buffer.checkIndex(Buffer.java:768) at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:358) at dev.morling.onebrc.CalculateAverage_royvanrijn.lambda$run$0(CalculateAverage_royvanrijn.java:118) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960) at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934) at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327) at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Let me know if you need the file.

Ah yes please, perhaps there is some other bug that's platform dependent; do share.

Can you specify what platform you're running it on, and could you please also check this (improved) version:
https://github.com/royvanrijn/1brc/blob/8db31e6a36fbc305765a2393efb06ba6bff23f42/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

Running on Windows. Will check the new version later (it is quite late here now and starting from tomorrow I won't have access to PC for the weekend).

Let me know if you can suggest how to send you the data file - I started bzipping it and it takes forever, but even part way through, the archive is 2Gb (you can mail me upload coordinates at dimitar.dimitrov at gmail dot com)

If possible, can you narrow it down? Perhaps run a very small test? Do they all crash, just this one?

@ddimtirov for future reference, the default compression level on zstd will be a lot faster and offer reasonable compression:

On a MacBook Pro 2020 - 2 GHz Quad-Core Intel Core i5:

# time zstd -z measurements.txt 8s measurements.txt : 28.24% ( 12.8 GiB => 3.63 GiB, measurements.txt.zst) zstd -z measurements.txt 92.10s user 8.05s system 107% cpu 1:33.54 total

See related #61

We've also added some basic samples within #82

Added a test case from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline: ```sh $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive ``` not they match precise value `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr PASS itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive ``` For gunnarmorling#49

Added a test case from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline: nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log index 76d5b4e..495fb00 100644 --- a/tmp/rounding-baseline.log +++ b/tmp/rounding-precise.log @@ -1,22 +1,22 @@ [-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-] PASS {+artsiomkorzun+} {+FAIL+} baseline [-PASS-]{+FAIL+} bjhara [-PASS-]{+FAIL+} criccomini FAIL ddimtirov [-FAIL-]{+PASS+} ebarlas [-FAIL-]{+PASS+} filiphr FAIL itaske [-PASS jgrateron-]FAIL {+jgrateron+} {+PASS+} khmarbaise FAIL kuduwa-keshavram [-PASS-]{+FAIL+} lawrey[-PASS moysesb-] FAIL [-nstng-]{+moysesb+} PASS [-padreati-]{+nstng+} FAIL [-palmr-]{+padreati+} PASS [-richardstartin-]{+palmr+} FAIL [-royvanrijn-]{+richardstartin+} PASS {+royvanrijn+} {+FAIL+} seijikun [-PASS-]{+FAIL+} spullara [-PASS-]{+FAIL+} truelive ``` For gunnarmorling#49

Added two test cases from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log index 76d5b4e..495fb00 100644 --- a/tmp/rounding-baseline.log +++ b/tmp/rounding-precise.log @@ -1,22 +1,22 @@ [-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-] PASS {+artsiomkorzun+} {+FAIL+} baseline [-PASS-]{+FAIL+} bjhara [-PASS-]{+FAIL+} criccomini FAIL ddimtirov [-FAIL-]{+PASS+} ebarlas [-FAIL-]{+PASS+} filiphr FAIL itaske [-PASS jgrateron-]FAIL {+jgrateron+} {+PASS+} khmarbaise FAIL kuduwa-keshavram [-PASS-]{+FAIL+} lawrey[-PASS moysesb-] FAIL [-nstng-]{+moysesb+} PASS [-padreati-]{+nstng+} FAIL [-palmr-]{+padreati+} PASS [-richardstartin-]{+palmr+} FAIL [-royvanrijn-]{+richardstartin+} PASS {+royvanrijn+} {+FAIL+} seijikun [-PASS-]{+FAIL+} spullara [-PASS-]{+FAIL+} truelive ``` For gunnarmorling#49

Added two test cases from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ diff -y /tmp/rounding-baseline.log /tmp/rounding-precise.log FAIL armandino | PASS armandino FAIL artsiomkorzun | PASS artsiomkorzun PASS baseline | FAIL baseline PASS bjhara | FAIL bjhara PASS criccomini | FAIL criccomini FAIL ddimtirov FAIL ddimtirov FAIL ebarlas | PASS ebarlas FAIL filiphr | PASS filiphr FAIL itaske FAIL itaske PASS jgrateron | FAIL jgrateron FAIL khmarbaise | PASS khmarbaise FAIL kuduwa-keshavram FAIL kuduwa-keshavram PASS lawrey | FAIL lawrey PASS moysesb | FAIL moysesb FAIL nstng | PASS nstng PASS padreati | FAIL padreati FAIL palmr | PASS palmr PASS richardstartin | FAIL richardstartin FAIL royvanrijn | PASS royvanrijn PASS seijikun | FAIL seijikun PASS spullara | FAIL spullara PASS truelive | FAIL truelive ``` Its also interesting that e.g. `itaske` produces different results between runs and thus may pass or fail sporadically. For gunnarmorling#49

…board in local testing using evaluate2.sh] (#209) * Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time. * Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers. * A initial submit-worthy entry Comparison to select entries (averaged over 3 runs) * spullara 1.66s [5th on leaderboard currently] * vemana (this submission) 1.65s * artsiomkorzun 1.64s [4th on leaderboard currently] Tests: PASS Impl Class: dev.morling.onebrc.CalculateAverage_vemana Machine specs * 16 core Ryzen 7950X * 128GB RAM Description * Decompose the full file into Shards of memory mapped files and process each independently, outputting a TreeMap: City -> Statistics * Compose the final answer by merging the individual TreeMap outputs * Select 1 Thread per available processor as reported by the JVM * Size to fit all datastructure in 0.5x L3 cache (4MB/core on the evaluation machines) * Use linear probing hash table, with identity of city name = byte[] and hash code computed inline * Avoid all allocation in the hot path and instead use method parameters. So, instead of passing a single Object param called Point(x, y, z), pass 3 parameters for each of its components. It is ugly, but this challenge is so far from Java's idioms anyway * G1GC seems to want to interfere; use ParallelGC instead (just a quick and dirty hack) Things tried that did not work * MemorySegments are actually slower than MappedByteBuffers * Trying to inline everything: not needed; the JIT compiler is pretty good * Playing with JIT compiler flags didn't yield clear wins. In particular, was surprised that using a max level of 3 and reducing compilation threshold did nothing.. when the jit logs print that none of the methods reach level 4 and stay there for long * Hand-coded implementation of Array.equals(..) using readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte * Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place. There are methods on the data path that were being interpreted for far too long. For example, the method that takes a byte range and simply calls one method per line was taking a disproportionate amount of time. Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%. ============= vemana =============== [20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana; done; Using java version 21.0.1-graal in this shell. real 0m1.581s user 0m34.166s sys 0m1.435s Using java version 21.0.1-graal in this shell. real 0m1.593s user 0m34.629s sys 0m1.470s Using java version 21.0.1-graal in this shell. real 0m1.632s user 0m35.893s sys 0m1.340s Using java version 21.0.1-graal in this shell. real 0m1.596s user 0m33.074s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.611s user 0m35.516s sys 0m1.438s ============= artsiomkorzun =============== [20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh artsiomkorzun; done; Using java version 21.0.1-graal in this shell. real 0m1.669s user 0m38.043s sys 0m1.287s Using java version 21.0.1-graal in this shell. real 0m1.679s user 0m37.840s sys 0m1.400s Using java version 21.0.1-graal in this shell. real 0m1.657s user 0m37.607s sys 0m1.298s Using java version 21.0.1-graal in this shell. real 0m1.643s user 0m36.852s sys 0m1.392s Using java version 21.0.1-graal in this shell. real 0m1.644s user 0m36.951s sys 0m1.279s ============= spullara =============== [20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara; done; Using java version 21.0.1-graal in this shell. real 0m1.676s user 0m37.404s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.652s user 0m36.509s sys 0m1.486s Using java version 21.0.1-graal in this shell. real 0m1.665s user 0m36.451s sys 0m1.506s Using java version 21.0.1-graal in this shell. real 0m1.671s user 0m36.917s sys 0m1.371s Using java version 21.0.1-graal in this shell. real 0m1.634s user 0m35.624s sys 0m1.573s ========================== Running Tests ====================== [21:17:57] [lsv@vemana]$ ./runTests.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.150s user 0m1.035s sys 0m0.117s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.789s sys 0m0.116s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.115s user 0m0.948s sys 0m0.075s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.926s sys 0m0.066s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.110s user 0m0.734s sys 0m0.078s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.870s sys 0m0.095s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.843s sys 0m0.084s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.121s user 0m0.852s sys 0m0.171s * Improve by a few % more; now, convincingly faster than 6th place submission. So far, only algorithms and tuning; no bitwise tricks yet. Improve chunking implementation to avoid allocation and allow finegrained chunking for the last X% of work. Work now proceeds in two stages: big chunk stage and small chunk stage. This is to avoid straggler threads holding up result merging. Tests pass [07:14:49] [lsv@vemana]$ ./test.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.152s user 0m0.973s sys 0m0.107s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.840s sys 0m0.060s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.107s user 0m0.681s sys 0m0.085s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.105s user 0m0.894s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.099s user 0m0.895s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.098s user 0m0.813s sys 0m0.050s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.095s user 0m0.777s sys 0m0.087s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.112s user 0m0.904s sys 0m0.069s * Merge results from finished threads instead of waiting for all threads to finish. Not a huge difference overall but no reason to wait. Also experiment with a few other compiler flags and attempt to use jitwatch to understand what the jit is doing. * Move to prepare_*.sh format and run evaluate2.sh locally. Shows 7th place in leaderboard | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.588 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | 3 | 00:01.904 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | | 00:02.398 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.724 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.771 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.842 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Vemana](https://github.com/vemana) | | | | 00:02.902 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | | | 00:02.906 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) | | | | 00:02.970 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | * Tune chunksize to get another 2% improvement for 8 processors as used by the evaluation script. * Read int at a time for city name length detection; speeds up by 2% in local testing. * Improve reading temperature double by exiting loop quicker; no major tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing. This small change has caused a surprising gain in performance by about 4%. I didn't expect such a big change, but perhaps in combination with the earlier change to read int by int for the city name, temperature reading is dominating that aspect of the time. Also, perhaps the quicker exit (as soon as you see '.' instead of reading until '\n') means you get to simply skip reading the '\n' across each of the lines. Since the lines are on average like 15 characters, it may be that avoiding reading the \n is a meaningful saving. Or maybe the JIT found a clever optimization for reading the temperature. Or maybe it is simply the case that the number of multiplications is now down to 2 from the previous 3 is what's causing the performance gain? | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.531 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.794 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | 3 | 00:01.956 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | | 00:02.346 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.673 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Subrahmanyam](https://github.com/vemana) | | | | 00:02.689 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.785 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.926 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | | | 00:02.928 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) | | | | 00:02.932 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | * Reduce one multiplication when temperature is +ve. * Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time. * Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers. * A initial submit-worthy entry Comparison to select entries (averaged over 3 runs) * spullara 1.66s [5th on leaderboard currently] * vemana (this submission) 1.65s * artsiomkorzun 1.64s [4th on leaderboard currently] Tests: PASS Impl Class: dev.morling.onebrc.CalculateAverage_vemana Machine specs * 16 core Ryzen 7950X * 128GB RAM Description * Decompose the full file into Shards of memory mapped files and process each independently, outputting a TreeMap: City -> Statistics * Compose the final answer by merging the individual TreeMap outputs * Select 1 Thread per available processor as reported by the JVM * Size to fit all datastructure in 0.5x L3 cache (4MB/core on the evaluation machines) * Use linear probing hash table, with identity of city name = byte[] and hash code computed inline * Avoid all allocation in the hot path and instead use method parameters. So, instead of passing a single Object param called Point(x, y, z), pass 3 parameters for each of its components. It is ugly, but this challenge is so far from Java's idioms anyway * G1GC seems to want to interfere; use ParallelGC instead (just a quick and dirty hack) Things tried that did not work * MemorySegments are actually slower than MappedByteBuffers * Trying to inline everything: not needed; the JIT compiler is pretty good * Playing with JIT compiler flags didn't yield clear wins. In particular, was surprised that using a max level of 3 and reducing compilation threshold did nothing.. when the jit logs print that none of the methods reach level 4 and stay there for long * Hand-coded implementation of Array.equals(..) using readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte * Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place. There are methods on the data path that were being interpreted for far too long. For example, the method that takes a byte range and simply calls one method per line was taking a disproportionate amount of time. Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%. ============= vemana =============== [20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana; done; Using java version 21.0.1-graal in this shell. real 0m1.581s user 0m34.166s sys 0m1.435s Using java version 21.0.1-graal in this shell. real 0m1.593s user 0m34.629s sys 0m1.470s Using java version 21.0.1-graal in this shell. real 0m1.632s user 0m35.893s sys 0m1.340s Using java version 21.0.1-graal in this shell. real 0m1.596s user 0m33.074s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.611s user 0m35.516s sys 0m1.438s ============= artsiomkorzun =============== [20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh artsiomkorzun; done; Using java version 21.0.1-graal in this shell. real 0m1.669s user 0m38.043s sys 0m1.287s Using java version 21.0.1-graal in this shell. real 0m1.679s user 0m37.840s sys 0m1.400s Using java version 21.0.1-graal in this shell. real 0m1.657s user 0m37.607s sys 0m1.298s Using java version 21.0.1-graal in this shell. real 0m1.643s user 0m36.852s sys 0m1.392s Using java version 21.0.1-graal in this shell. real 0m1.644s user 0m36.951s sys 0m1.279s ============= spullara =============== [20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara; done; Using java version 21.0.1-graal in this shell. real 0m1.676s user 0m37.404s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.652s user 0m36.509s sys 0m1.486s Using java version 21.0.1-graal in this shell. real 0m1.665s user 0m36.451s sys 0m1.506s Using java version 21.0.1-graal in this shell. real 0m1.671s user 0m36.917s sys 0m1.371s Using java version 21.0.1-graal in this shell. real 0m1.634s user 0m35.624s sys 0m1.573s ========================== Running Tests ====================== [21:17:57] [lsv@vemana]$ ./runTests.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.150s user 0m1.035s sys 0m0.117s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.789s sys 0m0.116s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.115s user 0m0.948s sys 0m0.075s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.926s sys 0m0.066s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.110s user 0m0.734s sys 0m0.078s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.870s sys 0m0.095s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.843s sys 0m0.084s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.121s user 0m0.852s sys 0m0.171s * Improve by a few % more; now, convincingly faster than 6th place submission. So far, only algorithms and tuning; no bitwise tricks yet. Improve chunking implementation to avoid allocation and allow finegrained chunking for the last X% of work. Work now proceeds in two stages: big chunk stage and small chunk stage. This is to avoid straggler threads holding up result merging. Tests pass [07:14:49] [lsv@vemana]$ ./test.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.152s user 0m0.973s sys 0m0.107s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.840s sys 0m0.060s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.107s user 0m0.681s sys 0m0.085s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.105s user 0m0.894s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.099s user 0m0.895s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.098s user 0m0.813s sys 0m0.050s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.095s user 0m0.777s sys 0m0.087s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.112s user 0m0.904s sys 0m0.069s * Merge results from finished threads instead of waiting for all threads to finish. Not a huge difference overall but no reason to wait. Also experiment with a few other compiler flags and attempt to use jitwatch to understand what the jit is doing. * Move to prepare_*.sh format and run evaluate2.sh locally. Shows 7th place in leaderboard | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.588 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | 3 | 00:01.904 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | | 00:02.398 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.724 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.771 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.842 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Vemana](https://github.com/vemana) | | | | 00:02.902 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | | | 00:02.906 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) | | | | 00:02.970 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | * Tune chunksize to get another 2% improvement for 8 processors as used by the evaluation script. * Read int at a time for city name length detection; speeds up by 2% in local testing. * Improve reading temperature double by exiting loop quicker; no major tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing. This small change has caused a surprising gain in performance by about 4%. I didn't expect such a big change, but perhaps in combination with the earlier change to read int by int for the city name, temperature reading is dominating that aspect of the time. Also, perhaps the quicker exit (as soon as you see '.' instead of reading until '\n') means you get to simply skip reading the '\n' across each of the lines. Since the lines are on average like 15 characters, it may be that avoiding reading the \n is a meaningful saving. Or maybe the JIT found a clever optimization for reading the temperature. Or maybe it is simply the case that the number of multiplications is now down to 2 from the previous 3 is what's causing the performance gain? | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.531 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.794 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | 3 | 00:01.956 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | | 00:02.346 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.673 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Subrahmanyam](https://github.com/vemana) | | | | 00:02.689 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.785 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.926 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | | | 00:02.928 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) | | | | 00:02.932 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | * Reduce one multiplication when temperature is +ve. * Added some documentation on the approach. --------- Co-authored-by: vemana <[email protected]>

* Latest snapshot (#1) preparing initial version * Improved performance to 20seconds (-9seconds from the previous version) (#2) improved performance a bit * Improved performance to 14 seconds (-6 seconds) (#3) improved performance to 14 seconds * sync branches (#4) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * Copy less arrays. Improved performance to 12.7 seconds (-2 seconds) (#5) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * some tuning to increase performance * some tuning to increase performance * avoid copying data; fast hashCode with slightly more collisions * avoid copying data; fast hashCode with slightly more collisions * cleanup (#6) * tidy up

As the number conversion for the original problem was somewhat weird + arguably somewhat Java-specific (see discussions at gunnarmorling/1brc#5), I've decided to move away from using that and use the sample results + run checks against my own base implementation instead.

qtxo reviewed Jan 2, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Outdated Show resolved Hide resolved

keshavram-apptware reviewed Jan 2, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Outdated Show resolved Hide resolved

royvanrijn commented Jan 2, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Outdated Show resolved Hide resolved

qtxo reviewed Jan 2, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Outdated Show resolved Hide resolved

royvanrijn changed the title ~~Caved in and created a version that partitions the file~~ memory mapped files, branchless parsing, bitwiddle magic Jan 2, 2024

lobaorn reviewed Jan 2, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Show resolved Hide resolved

gunnarmorling reviewed Jan 2, 2024

View reviewed changes

calculate_average_royvanrijn.sh Outdated Show resolved Hide resolved

royvanrijn closed this Jan 2, 2024

royvanrijn reopened this Jan 2, 2024

AlexanderYastrebov reviewed Jan 2, 2024

View reviewed changes

franz1981 reviewed Jan 3, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Show resolved Hide resolved

franz1981 reviewed Jan 3, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Show resolved Hide resolved

franz1981 reviewed Jan 3, 2024

View reviewed changes

src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java Show resolved Hide resolved

franz1981 reviewed Jan 3, 2024

View reviewed changes

gunnarmorling mentioned this pull request Jan 3, 2024

fatroom's initial attempt #15

Merged

royvanrijn closed this Jan 3, 2024

royvanrijn and others added 5 commits January 3, 2024 16:30

Squashed everything into a single commit

7ca2a69

Added SWAR (SIMD Within A Register) code to increase bytebuffer processing/throughput Delaying the creation of the String by comparing hash, segmenting like spullara, improved EOL finding

Faster version of the data generator

58f01c3

Squashing for merge.

0d02423

Faster version of the data generator

0f58035

Squashing for merge.

581f099

Squashing for merge.

royvanrijn reopened this Jan 3, 2024

Add note about sharing non-Java solutions on GH discussions;

bfb72f7

Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>

rmoff and others added 4 commits January 3, 2024 16:39

Add note about sharing non-Java solutions on GH discussions;

4079ba6

Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>

Squashing for merge.

a27b369

Squashing for merge.

Merge branch 'main' of https://github.com/royvanrijn/1brc

d85b38f

Merge branch 'gunnarmorling:main' into main

b65fec6

gunnarmorling mentioned this pull request Jan 3, 2024

Clarify rounding semantics #49

Closed

gunnarmorling merged commit 5570f1b into gunnarmorling:main Jan 3, 2024
1 check passed

franz1981 reviewed Jan 3, 2024

View reviewed changes

ddimtirov reviewed Jan 4, 2024

View reviewed changes

AlexanderYastrebov mentioned this pull request Jan 4, 2024

wip,test: add rounding test case #115

Closed

	The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this
	(i.e. sorted alphabetically by station name, and the result values per station in the format `<min>/<mean>/<max>`, rounded to one fractional digit):

memory mapped files, branchless parsing, bitwiddle magic #5

memory mapped files, branchless parsing, bitwiddle magic #5

Conversation

royvanrijn commented Jan 1, 2024 • edited Loading

lobaorn commented Jan 2, 2024

swaechter commented Jan 2, 2024 • edited Loading

suchwerk commented Jan 2, 2024

gunnarmorling commented Jan 2, 2024

gunnarmorling commented Jan 2, 2024

gunnarmorling commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

royvanrijn Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

AlexanderYastrebov Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

gunnarmorling commented Jan 3, 2024

lobaorn commented Jan 3, 2024

royvanrijn commented Jan 3, 2024

gunnarmorling commented Jan 3, 2024

gunnarmorling commented Jan 3, 2024

franz1981 Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

royvanrijn Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

franz1981 Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franz1981 commented Jan 3, 2024

Choose a reason for hiding this comment

royvanrijn Jan 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DamienOReilly Jan 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

royvanrijn commented Jan 1, 2024 •

edited

Loading

swaechter commented Jan 2, 2024 •

edited

Loading

royvanrijn Jan 3, 2024 •

edited

Loading

AlexanderYastrebov Jan 3, 2024 •

edited

Loading

franz1981 Jan 3, 2024 •

edited

Loading

franz1981 Jan 3, 2024 •

edited

Loading

royvanrijn Jan 3, 2024 •

edited

Loading

franz1981 Jan 3, 2024 •

edited

Loading

royvanrijn Jan 4, 2024 •

edited

Loading

DamienOReilly Jan 4, 2024 •

edited

Loading