Skip to content

Latest commit

 

History

History
299 lines (185 loc) · 7.02 KB

notes.md

File metadata and controls

299 lines (185 loc) · 7.02 KB

The Challenge

#1BRC https://github.com/gunnarmorling/1brc

Input

1 Billion rows 13 GB on Disk

Output

Constraints

  • JDK21
  • No native Code
  • GraalVM allowed
  • No third-party libraries

How to do that in Java

We need to

  • process all lines
  • Parse line into Station / Value
  • Aggreagte Value / Store with Station
  • Sort by Station

Solutions

Solution: Baseline

https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_baseline.java

Records for Measurement/ResultRow Custom Collector Files#lines (...) Streams / Grouping TreeMap for Sorting

time ./calculate_average_baseline.sh Runtime: 4:07*

*) on my Machine

Solution: Ales Justin

CalculateAverage_alesj

28 Lines!

  • Use built-in DoubleSummaryStatistics
  • Parallel Streams

time ./calculate_average_alesj.sh Runtime: 2:10*

Solution: Karl Heinz Marbaise

CalculateAverage_khmarbaise

  • Parallel Streams
  • Stream Pipeline
  • Sorting in Stream
  • Double summarize statistics

time ./calculate_average_khmarbaise.sh Runtime: 1:45*

How to improve?

  • Load data into Memory (max chunck size 2GB!) -> int index
  • Process data chunks in parallel

How:

  • FileChannel
    • via RandomAccessFile#getChannel(..)
    • FileChannel.open(..)
    • FileInputStream#getChannel)
  • File MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, start, Math.min(CHUNK_SIZE, size - start));

Solution: Hampus Ram

CalculateAverage_bjhara

Runtime: 0:40*

Can we do better? Flame Graph

jbang --javaagent=ap-loader@jvm-profiling-tools/ap-loader=start,event=cpu,file=profile.html src/main/java/dev/morling/onebrc/CalculateAverage_bjhara.java

open profile.html

How about tuning Double.parseDouble(...) ?

time ./calculate_average_bjhara.sh Runtime: 0:32*

Solution: Anita SV

JEP 454: Foreign Function & Memory API

https://openjdk.org/jeps/454

import java.lang.foreign.Arena; import java.lang.foreign.MemorySegment; import java.lang.foreign.ValueLayout;

CalculateAverage_anitasv

Map whole File into memory! -> no 2GB limit anymore yay!

MemorySegment mmapMemory = fileChannel.map( FileChannel.MapMode.READ_ONLY, 0, fileSize, Arena.global());

Specialized custom HashMap -> FastHashMap

Parse "double" from raw bytes encoded as long into an int

time ./calculate_average_anitasv.sh Runtime: 0:12*

Solution: Arjen Wisse

CalculateAverage_arjenw

  • Custom HashMap
  • Custom double parsing as int from bytes

Short solution, very readable, no UNSAFE so far!

time ./calculate_average_arjenw.sh Runtime 0:09*

Mechanical Sympathy

"You don't have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy." Jackie Stewart, racing driver

The "Mechanical Sympathy" movement calls for developing this understanding before it's too late and to upfront design the applications taking into account how the machine is going to work on it.

Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer

Jitwatch

Tool to Analyze Machine Code generated by the JIT compiler

java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar

Sandbox simple inling

Understand the CPU

  • SIMD
  • ILP
  • Branch Prediction

SIMD

Single Instruction Multiple Data

JEP 448: Vector API (Sixth Incubator)

Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.

SIMD operations

A vector computation consists of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes.

Vectorized assembly instructions movq -> vmovq

Solution: Chris Bellew

SIMD Operations via Java Vector API

CalculateAverage_chrisbellew

Show vector operations:

java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar

Runtime: 0:07*

time ./calculate_average_ChrisBellew.sh

ILP

Instruction Level Parallelism

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

ILP must not be confused with concurrency. In ILP there is a single specific thread of execution of a process. On the other hand, concurrency involves the assignment of multiple threads to a CPU's core in a strict alternation, or in true parallelism if there are enough CPU cores, ideally one core for each runnable thread.

Example: https://en.wikipedia.org/wiki/Instruction-level_parallelism

Solution: Dr. Ian Preston

dev.morling.onebrc.CalculateAverage_ianopolousfast#parseStats

lineSize1, lineSize2 ...

time ./calculate_average_ianopolousfast.sh

Runtime: 0:05*

Solution: Samuel Yuvon

  • Helping Branch Prediction -> Brancheless code

dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMax dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMin

Now to the esoteric stuff

Low level bit hacks

  • Finding the ";" splitter

String line = "Los Angeles;14.0"; String[] items = line.split(";"); String station = items[0]; double value = Double.parseDouble(items[1]);

Can we do better?

Solution: Van Puh DO

  • Find character bytes within long byte sequence dev.morling.onebrc.CalculateAverage_gonix.Aggregator#valueSepMark dev.morling.onebrc.CalculateAverage_abeobk#getSemiCode dev.morling.onebrc.CalculateAverage_abeobk#getLFCode
  • Parse number as long from raw long bytes (originally from merykitty) dev.morling.onebrc.CalculateAverage_abeobk#num

time ./calculate_average_abeobk.sh

Runtime: 0:03*

Solution: Quan Ahn Mai

  • Unsafe
  • Vector API
  • Foreign Memory API
  • Bit level hacks

Unsafe Mechanics

private static final Unsafe UNSAFE;
static {
    try {
        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafe.setAccessible(true);
        UNSAFE = (Unsafe) theUnsafe.get(Unsafe.class);
    }
    catch (NoSuchFieldException | IllegalAccessException e) {
        throw new RuntimeException(e);
    }
}

time ./calculate_average_merykittyunsafe.sh Runtime: 0.02*

Squeezing out the last performance bits

GraalVM

./prepare_thomaswue.sh

Runtime: 0:01*


Honorable Mentions

Lessons learned

  • You can already quite far with clean idiomatic Java Code
  • New APIs open up more possibilities Foreign Memory API, Vector API
  • Measure, don't guess!
  • Strive the balance between optimization and maintainability
  • Java is not slow!