The Challenge

#1BRC https://github.com/gunnarmorling/1brc

Input

1 Billion rows 13 GB on Disk

Output

Constraints

JDK21
No native Code
GraalVM allowed
No third-party libraries

How to do that in Java

We need to

process all lines
Parse line into Station / Value
Aggreagte Value / Store with Station
Sort by Station

Solutions

Solution: Baseline

https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_baseline.java

Records for Measurement/ResultRow Custom Collector Files#lines (...) Streams / Grouping TreeMap for Sorting

time ./calculate_average_baseline.sh Runtime: 4:07*

*) on my Machine

Solution: Ales Justin

CalculateAverage_alesj

28 Lines!

Use built-in DoubleSummaryStatistics
Parallel Streams

time ./calculate_average_alesj.sh Runtime: 2:10*

Solution: Karl Heinz Marbaise

CalculateAverage_khmarbaise

Parallel Streams
Stream Pipeline
Sorting in Stream
Double summarize statistics

time ./calculate_average_khmarbaise.sh Runtime: 1:45*

How to improve?

Load data into Memory (max chunck size 2GB!) -> int index
Process data chunks in parallel

How:

FileChannel
- via RandomAccessFile#getChannel(..)
- FileChannel.open(..)
- FileInputStream#getChannel)
File MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, start, Math.min(CHUNK_SIZE, size - start));

Solution: Hampus Ram

CalculateAverage_bjhara

Runtime: 0:40*

Can we do better? Flame Graph

jbang --javaagent=ap-loader@jvm-profiling-tools/ap-loader=start,event=cpu,file=profile.html src/main/java/dev/morling/onebrc/CalculateAverage_bjhara.java

open profile.html

How about tuning Double.parseDouble(...) ?

time ./calculate_average_bjhara.sh Runtime: 0:32*

Solution: Anita SV

JEP 454: Foreign Function & Memory API

https://openjdk.org/jeps/454

import java.lang.foreign.Arena; import java.lang.foreign.MemorySegment; import java.lang.foreign.ValueLayout;

CalculateAverage_anitasv

Map whole File into memory! -> no 2GB limit anymore yay!

MemorySegment mmapMemory = fileChannel.map( FileChannel.MapMode.READ_ONLY, 0, fileSize, Arena.global());

Specialized custom HashMap -> FastHashMap

Parse "double" from raw bytes encoded as long into an int

time ./calculate_average_anitasv.sh Runtime: 0:12*

Solution: Arjen Wisse

CalculateAverage_arjenw

Custom HashMap
Custom double parsing as int from bytes

Short solution, very readable, no UNSAFE so far!

time ./calculate_average_arjenw.sh Runtime 0:09*

Mechanical Sympathy

"You don't have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy." Jackie Stewart, racing driver

The "Mechanical Sympathy" movement calls for developing this understanding before it's too late and to upfront design the applications taking into account how the machine is going to work on it.

Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer

Jitwatch

Tool to Analyze Machine Code generated by the JIT compiler

java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar

Sandbox simple inling

Understand the CPU

SIMD
ILP
Branch Prediction

SIMD

Single Instruction Multiple Data

JEP 448: Vector API (Sixth Incubator)

Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.

SIMD operations

A vector computation consists of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes.

Vectorized assembly instructions movq -> vmovq

Solution: Chris Bellew

SIMD Operations via Java Vector API

CalculateAverage_chrisbellew

Show vector operations:

java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar

Runtime: 0:07*

time ./calculate_average_ChrisBellew.sh

ILP

Instruction Level Parallelism

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

ILP must not be confused with concurrency. In ILP there is a single specific thread of execution of a process. On the other hand, concurrency involves the assignment of multiple threads to a CPU's core in a strict alternation, or in true parallelism if there are enough CPU cores, ideally one core for each runnable thread.

Example: https://en.wikipedia.org/wiki/Instruction-level_parallelism

Solution: Dr. Ian Preston

dev.morling.onebrc.CalculateAverage_ianopolousfast#parseStats

lineSize1, lineSize2 ...

time ./calculate_average_ianopolousfast.sh

Runtime: 0:05*

Solution: Samuel Yuvon

Helping Branch Prediction -> Brancheless code

dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMax dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMin

Now to the esoteric stuff

Low level bit hacks

Finding the ";" splitter

String line = "Los Angeles;14.0"; String[] items = line.split(";"); String station = items[0]; double value = Double.parseDouble(items[1]);

Can we do better?

Solution: Van Puh DO

Find character bytes within long byte sequence dev.morling.onebrc.CalculateAverage_gonix.Aggregator#valueSepMark dev.morling.onebrc.CalculateAverage_abeobk#getSemiCode dev.morling.onebrc.CalculateAverage_abeobk#getLFCode
Parse number as long from raw long bytes (originally from merykitty) dev.morling.onebrc.CalculateAverage_abeobk#num

time ./calculate_average_abeobk.sh

Runtime: 0:03*

Solution: Quan Ahn Mai

Unsafe
Vector API
Foreign Memory API
Bit level hacks

Unsafe Mechanics

private static final Unsafe UNSAFE;
static {
    try {
        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafe.setAccessible(true);
        UNSAFE = (Unsafe) theUnsafe.get(Unsafe.class);
    }
    catch (NoSuchFieldException | IllegalAccessException e) {
        throw new RuntimeException(e);
    }
}

time ./calculate_average_merykittyunsafe.sh Runtime: 0.02*

Squeezing out the last performance bits

GraalVM

./prepare_thomaswue.sh

Runtime: 0:01*

Honorable Mentions

Alternative Implementations
AWK gunnarmorling/1brc#171
Go gunnarmorling/1brc#67
Oracle Database (gunnarmorling/1brc#707)
Cobol: gunnarmorling/1brc#481
Implementation in C with AVX2 instructions (Fastest solution! 0.5s!!!) gunnarmorling/1brc#710

Lessons learned

You can already quite far with clean idiomatic Java Code
New APIs open up more possibilities Foreign Memory API, Vector API
Measure, don't guess!
Strive the balance between optimization and maintainability
Java is not slow!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

The Challenge

Input

Output

Constraints

How to do that in Java

Solutions

Solution: Baseline

Solution: Ales Justin

Solution: Karl Heinz Marbaise

How to improve?

Solution: Hampus Ram

Solution: Anita SV

Solution: Arjen Wisse

Mechanical Sympathy

Jitwatch

Understand the CPU

SIMD

Solution: Chris Bellew

ILP

Solution: Dr. Ian Preston

Solution: Samuel Yuvon

Now to the esoteric stuff

Low level bit hacks

Solution: Van Puh DO

Solution: Quan Ahn Mai

Squeezing out the last performance bits

GraalVM

Honorable Mentions

Lessons learned

Files

notes.md

Latest commit

History

notes.md

File metadata and controls

The Challenge

Input

Output

Constraints

How to do that in Java

Solutions

Solution: Baseline

Solution: Ales Justin

Solution: Karl Heinz Marbaise

How to improve?

Solution: Hampus Ram

Solution: Anita SV

Solution: Arjen Wisse

Mechanical Sympathy

Jitwatch

Understand the CPU

SIMD

Solution: Chris Bellew

ILP

Solution: Dr. Ian Preston

Solution: Samuel Yuvon

Now to the esoteric stuff

Low level bit hacks

Solution: Van Puh DO

Solution: Quan Ahn Mai

Squeezing out the last performance bits

GraalVM

Honorable Mentions

Lessons learned