1BRC 0.145s default dataset, 0.281s 10K dataset, accept all valid inputs (C++, SIMD, hash, tricks) #138

lehuyduc · 2024-01-05T13:55:24Z

lehuyduc
Jan 5, 2024

https://github.com/lehuyduc/1brc-simd

main.cpp and main_small.cpp follows all the requirements specified by the challenges (no preprocessing, has to work with all valid inputs length <= 100, no extra input assumption, single-file no libraries, etc).

Tested on file size 13795495846 bytes, generated by ./create_measurements.sh 1000000000. See my repo for link to both default and 10K datasets. Tested on many PCs. No HugeTLB.

Dual EPYC 9354 32c64t = 64c128t total, 1TB DDR5 unknown speed
This setup is actually slower than Threadripper PRO 5995WX when <= 32 threads are used. Likely because 5995WX has higher clock speed, and at <= 32 threads RAM bandwidth doesn't matter as much compared to running at 128 threads.
Ubuntu 20.04, virtual machine, g++ 9.4. All very detailed results are in benchmark_results folder in my repo.

Bandwidth = 1.4585e+11 byte/s tested using test_bandwidth.sh. To ensure benchmark are comparable, you must check not just the CPU but also RAM bandwidth. Also the dataset used can make a big difference.

Version: 1brc_valid23_small

To get the best number possible, you will need luck. Run your code a bunch of time and pray for a good result (it's just like playing the slot machine 😆 ). hyperfine will make the code slower, so just run manually.

_small suffix means it will still work with all inputs, but is slower on the 10K dataset (1.5-2x slower than 1brc_valid23). This is because the hash table size is only 16384, which is the smallest size > number of unique keys.

Version: 1brc_valid23, default dataset. In this test, Dual EPYC 9354 has same speed as Threadripper 5995WX despite having 2.25x the bandwidth, likely because all the hash table elements fit in cache.

[email protected]:~/1brc-simd$ ./run.sh 128 128
Using 128 threads
PC has 128 physical cores
Malloc cost = 0.024697
init mmap file cost = 0.040501ms
n_threads = 128
Gather key stats cost = 0.001181
Parallel process file cost = 176.391ms
Aggregate stats cost = 6.1021ms
Output stats cost = 0.615634ms
Runtime inside main = 183.267ms

real    0m0.186s
user    0m0.000s
sys     0m0.003s
a55d0d9d02661c33538f2e11bb86f1825a5f015d6dd3645416ec71bc50099ee5  result.txt

//---------------------
[email protected]:~/1brc-simd$ ./run.sh 64 64
Using 64 threads
PC has 64 physical cores
Malloc cost = 0.027221
init mmap file cost = 0.048513ms
n_threads = 64
Gather key stats cost = 0.001201
Parallel process file cost = 188.083ms
Aggregate stats cost = 3.45419ms
Output stats cost = 0.622074ms
Runtime inside main = 192.324ms

real    0m0.196s
user    0m0.003s
sys     0m0.000s
a55d0d9d02661c33538f2e11bb86f1825a5f015d6dd3645416ec71bc50099ee5  result.txt

//---------------------
[email protected]:~/1brc-simd$ ./run.sh 32 64
Using 32 threads
PC has 64 physical cores
Malloc cost = 0.015834
init mmap file cost = 0.018037ms
n_threads = 32
Gather key stats cost = 0.001022
Parallel process file cost = 313.758ms
Aggregate stats cost = 2.55168ms
Output stats cost = 0.614231ms
Runtime inside main = 317.043ms
real    0m0.319s
user    0m0.002s
sys     0m0.000s
a55d0d9d02661c33538f2e11bb86f1825a5f015d6dd3645416ec71bc50099ee5  result.txt

//---------------------
[email protected]:~/1brc-simd$ ./run.sh 8 64
Using 8 threads
PC has 64 physical cores
Malloc cost = 0.031908
init mmap file cost = 0.055012ms
n_threads = 8
Gather key stats cost = 0.002414
Parallel process file cost = 1159.25ms
Aggregate stats cost = 11.2678ms
Output stats cost = 0.652966ms
Runtime inside main = 1171.4ms
real    0m1.174s
user    0m0.002s
sys     0m0.000s
a55d0d9d02661c33538f2e11bb86f1825a5f015d6dd3645416ec71bc50099ee5  result.txt

Version: 1brc_valid23, 10K dataset. This version shows noticeable improvement (0.319s -> 0.281s)

Using 128 threads
PC has 128 physical cores
Malloc cost = 0.052008
init mmap file cost = 0.083825ms
n_threads = 128
Gather key stats cost = 0.003015
Parallel process file cost = 257.49ms
Aggregate stats cost = 35.0254ms
Output stats cost = 11.2243ms
Runtime inside main = 304.057ms
real    0m0.308s
user    0m0.003s
sys     0m0.000s

//---------------------
Using 64 threads
PC has 64 physical cores
Malloc cost = 0.030827
init mmap file cost = 0.054462ms
n_threads = 64
Gather key stats cost = 0.002964
Parallel process file cost = 240.938ms
Aggregate stats cost = 26.3041ms
Output stats cost = 11.1043ms
Runtime inside main = 278.586ms
real    0m0.281s <=== min
user    0m0.002s
sys     0m0.000s

//---------------------
[email protected]:~/1brc-simd$ ./run.sh 8 64
Using 8 threads
PC has 64 physical cores
Malloc cost = 0.023595
init mmap file cost = 0.04012ms
n_threads = 8
Gather key stats cost = 0.001373
Parallel process file cost = 1441.94ms
Aggregate stats cost = 9.04803ms
Output stats cost = 11.111ms
Runtime inside main = 1462.27ms
real    0m1.466s
user    0m0.001s
sys     0m0.002s
e99d23f6fa210b0d9c43a63e335d8d49f4a247ca7cc237bea0fe4c8b64b1933e  result.txt

At 128 threads, the program finishes the challenge faster than the OS can munmap the input file. munmap alone takes up over 50% of the time, so it cost more than the program's allocation, initialization, processing, output, and freeing resources all combined. So, I use the subprocess trick used by top Java solutions, basically bypassing munmap and memory deallocation. It's possible to do this the correct way by either using fork() instead of threads, or let each thread process different input sizes and munmap at different times.

Running using hyperfine is slower than running the command manually, as the previous munmap hasn't finished.

With this many threads, aggregating the results from all threads take a noticeable amount of time. So we use parallel processing to speed up that too, saving ~28ms at 128 threads. By using 2 layers of parallel aggregation, we save an extra ~2.5ms lol.

Results for Java version are currently outdated so I've removed them. In my repo you can see the old results in benchmark_results/old_post2.txt, and files other_artsi_7502p, other_thomas_7502p, other_royvanrijn_7502p, valid20_7502p, valid20_7502p_10k

Other submission
There's a super fast non-compliant solution at: https://curiouscoding.nl/posts/1brc/
It doesn't work with all inputs like required, but is extremely fast and has very creative ideas.

There's a super fast compliant version using .NET at: https://github.com/noahfalk/1brc/tree/main
I tested it on Dual EPYC 9354 (5995WX PC was not available at the time). This code doesn't use mmap, so its hyperfine result is similar to running manually. See files: other_noahfalk_9354, other_noahfalk_9354_10k, valid23_9354.txt, valid23_9354_10k.txt

Below is a lot of raw data, so here's the summary (unit: second)
Original dataset

Name/Threads	1	8	16	32	64	128
lehuyduc	9.096	1.174	0.593	0.319	0.196	0.184 / 0.145
noahfalk	5.531	0.723	0.388	0.236	0.196	0.242

10K dataset

Name/Threads	1	8	16	32	64	128
lehuyduc	11.440	1.465	0.758	0.423	0.281	0.283
noahfalk	17.674	2.311	1.243	0.784	0.761	1.075

@noahfalk solution use an extreme amount of manual SIMD and ILP tricks, so it performs much better at lower thread counts. I'm not sure why it performs worse at higher thread counts yet. On the 10K dataset for example, it plateaus completely around 32 threads.

Main ideas:

Unsigned int overflow hashing: cheapest hash method possible.
SIMD to find multiple separator ; and loop through them
SIMD hashing
SIMD for string comparison in hash table probing
ILP trick (each thread process 2 different lines at a time)
Parse number as int instead of float
Notice properties of actual data
- 99% of station names has length <= 16, use compiler hint + implement SIMD for this specific case. If length > 16, use a fallback => still meet requirements of MAX_KEY_LENGTH = 100
- -99.9 <= temperature <= 99.9 guaranteed, use special code using this property
Use mmap for fast file reading
Use multithreading for both parsing the file, and aggregating the data
Other random tricks (intentional ordering of variable assignments)
Automatic RAM disk: the contest assumes the file is in RAM, not disk. This is done automatically on Linux if you read the file using mmap once, then run the program again. Without this rule, all solutions will be completely different.

Others
~~There's a potential out-of-range-access exploit in the code. It's left as exercise for the reader.~~

I optimize the code for hyper threading. For example with a 16c32t CPU, if a change improves performance when running 32 threads but slightly decreases performance at 16, I will keep that change. Disabling HT will increase performance at the same thread count (maybe even 20-25%), so there are some situations where it's the correct choice.

BuslikDrev · 2024-01-06T10:07:41Z

BuslikDrev
Jan 6, 2024

Автор челленджа должен был выставить ссылку на базу:
https://drive.google.com/file/d/1HEyNw4M453n0tnuaAm9nwaCiLydQYnpo/view?usp=sharing

Теперь непонятно, все-ли участники сгенерировали одинаковый файл.

1 reply

lehuyduc Jan 6, 2024
Author

The file was generated by ./create_measurement.sh 1000000000, so it should represent the average case

sharpobject · 2024-01-07T19:59:40Z

sharpobject
Jan 7, 2024

You may get a better idea of the minimum time to read the file and do some trivial calculation on it by calculating a checksum of the file rather than by copying data. In problems similar to this one I've found that the cost of writing data to some other buffer is pretty large.

1 reply

lehuyduc Jan 8, 2024
Author

Yeah calculating checksum is a better way to shown max RAM read bandwidth. But since we have to write into different places in RAM (they're mostly in cache though), I chose to compare it with RAM copy instead.

There's probably a better way to compare, but I'm too lazy for now :D

AlexanderYastrebov · 2024-01-08T12:53:01Z

AlexanderYastrebov
Jan 8, 2024

@lehuyduc Great! I assume its faster than @dannyvankooten #46 isn't it?
Did you also try:

integer arithmetic instead of floats (see branch-less integer parseNumber 1BRC in C (< 1.7 seconds) #46 (reply in thread))
pinning threads to cores https://stackoverflow.com/questions/24645880/set-cpu-affinity-when-create-a-thread

5 replies

AlexanderYastrebov Jan 8, 2024

I assume its faster than

I see "Comparison to other submission" in the top post now.

lehuyduc Jan 8, 2024
Author

Some solutions treat the temperature values as integer. This version intentionally uses float because there are separate hardware units for int and float, which practically mean operations on int/float can happen in parallel in a single core. If I use int instead, total runtime is ~20% slower (on earlier 1brc_x.cpp version, untested on latest code)

I think float is just faster in my case, idk about others. About thread/core pinning, I don't have true sudo access on those PC, so it's hard to setup. Maybe next time :D

lehuyduc Jan 9, 2024
Author

@AlexanderYastrebov Huh, in my latest version, using int actually improves performance by quite a lot (like 3%). Thanks for suggesting me to try it again! Finally reached 2 seconds with 8 threads!

AlexanderYastrebov Jan 9, 2024

@lehuyduc Would you also try branchless version? Your's have branches and several multiplications afaict https://godbolt.org/z/4zvejKEhd

lehuyduc Jan 9, 2024
Author

~~Could you send me the link to the branchless version?~~ Remember it now. ~~Another 3-4% speedup!!~~ Thanks again!

Btw, down to 0.158 second now after many small changes :D

~~Soon to be 0.155 thanks to this suggestion. At 1 thread, the difference is a whooping 7% !!~~ Nevermind, I made a typo. It performs around the same, 1% better in best case. That's still great though! I'll add it in next patch, after I've found more improvements.

sharpobject · 2024-01-08T13:13:30Z

sharpobject
Jan 8, 2024

Why not skip munmap by calling _exit()

1 reply

lehuyduc Jan 8, 2024
Author

Even if you exit(1); early, munmap will still be measured by time command.

buybackoff · 2024-01-08T21:55:55Z

buybackoff
Jan 8, 2024

Added this to a stable comparison across languages https://github.com/buybackoff/1brc?tab=readme-ov-file#native.

Maybe I missed something faster, but so far it's the top result.

1 reply

lehuyduc Jan 9, 2024
Author

Thanks! Can you run again but with N_THREADS set to 12? N_THREADS = 128 in my code, but your test PC only has 12 threads

buybackoff · 2024-01-09T00:04:52Z

buybackoff
Jan 9, 2024

I would expect automatic detection. Otherwise every host will have to manually adjust the code.

…

On Tue, 9 Jan 2024, 1:02 am lehuyduc, ***@***.***> wrote: Thanks! Can you run again but with N_THREADS set to 12? N_THREADS = 128 in my code, but your test PC only has 12 threads — Reply to this email directly, view it on GitHub <#138 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXS7H4ZK2MZDG5Y5T7MYTYNSCKFAVCNFSM6AAAAABBOO22F2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DANJXHAYTO> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

lehuyduc Jan 9, 2024
Author

I've updated the repo so that run_cpp.sh automatically finds number of threads to use, and compile with that number as constexpr. Could you pull and run it again? Thanks!

merykitty · 2024-01-09T04:18:21Z

merykitty
Jan 9, 2024

Correct me if I'm wrong, this solution assumes the number of keys is not larger than around 17000, which is not really a valid assumption. Fixing it may not affect performance much, though.

2 replies

lehuyduc Jan 9, 2024
Author

The FAQ says the max number of key is 10k, so 16384 is enough for the worse case (just lower performance). I want to focus on the average case.

If we set it to 16384 * 2, then total time is increased by ~0.47ms * N_THREADS, so at 12 threads it'll probably add 5ms.

I changed it from 16384 * 2 to 16384 to chase the lowest possible time on the 128 thread servers.

merykitty Jan 9, 2024

@lehuyduc Yes my bad just noticed that assumption.

RagnarGrootKoerkamp · 2024-01-10T14:23:41Z

RagnarGrootKoerkamp
Jan 10, 2024

Continuing from my blog (https://curiouscoding.nl/posts/1brc/). I ran run_cpp.sh {1,6,12} at max processor speed (4.6GHz, but a bit lower in practice):

no hyperthreading (so only 6 cores/threads available)

Using 1 threads
init mmap file cost = 0.010562ms
Parallel process file cost = 11219.8ms
Aggregate stats cost = 0.113284ms
Output stats cost = 0.481684ms
Runtime inside main = 11220.4ms
Time to munmap = 138.373

real	0m11.360s
user	0m10.910s
sys	0m0.447s

Using 6 threads
init mmap file cost = 0.019057ms
Parallel process file cost = 2066.4ms
Aggregate stats cost = 0.513666ms
Output stats cost = 0.550436ms
Runtime inside main = 2067.48ms
Time to munmap = 141.872

real	0m2.212s
user	0m11.538s
sys	0m0.475s

Using 12 threads
init mmap file cost = 0.02661ms
Parallel process file cost = 1984.15ms
Aggregate stats cost = 1.07135ms
Output stats cost = 0.513624ms
Runtime inside main = 1985.75ms
Time to munmap = 140.398

real	0m2.130s
user	0m11.453s
sys	0m0.490s

with hyperthreading (so 12 threads over 6 cores)

Using 1 threads
init mmap file cost = 0.010041ms
Parallel process file cost = 11469.9ms
Aggregate stats cost = 0.118554ms
Output stats cost = 0.492019ms
Runtime inside main = 11470.5ms
Time to munmap = 198.781

real	0m11.671s
user	0m11.114s
sys	0m0.513s

Using 6 threads
init mmap file cost = 0.018979ms
Parallel process file cost = 2092.99ms
Aggregate stats cost = 0.530193ms
Output stats cost = 0.510248ms
Runtime inside main = 2094.05ms
Time to munmap = 165.116

real	0m2.262s
user	0m11.996s
sys	0m0.536s

Using 12 threads
init mmap file cost = 0.023873ms
Parallel process file cost = 1509.64ms
Aggregate stats cost = 1.15ms
Output stats cost = 0.514586ms
Runtime inside main = 1511.32ms
Time to munmap = 143.84

real	0m1.659s
user	0m17.091s
sys	0m0.646s

So you're still a bit slower than my ~1.55s on my machine. But note that I'm cheating (participating out-of-competition, let's say) by assuming that all city names appear in the first 100k chars, and that lines are at most 33 chars long.

6 replies

RagnarGrootKoerkamp Jan 10, 2024

I suspect hyperthreading details are very processor specific. For my code is was also 10% speedup when enabling hyperthreading.

Hmm yeah not sure why the 7s->9s slowdown with hyperthreading was so big. Might be because the file wasn't in cache yet or something like that. From other things I've seen I'd expect the single threaded version with hyperthreading on to be maybe 5-10% slower at most.

lehuyduc Jan 10, 2024
Author

For my code is was also 10% speedup when enabling hyperthreading.

For my code: on your PC, HT gives ~38% speedup. But on all the other machines I tested, HT gives ~62% speedup!!

So the interesting thing is, why HT speedup your program 10%, but speedup mine 38% or 62%. All my test PCs use AMD, maybe I should test on some intel CPU.

Update: on my AMD cpu, HT gives your program ~43% speedup. Actually it's not true because I'm measuring 16t with HT vs 32t with HT, while the correct comparison is 16t without HT vs 32t with HT. But I can't turn off HT on my test PC. Benchmark result

RagnarGrootKoerkamp Jan 10, 2024

Yeah idk... there's so many thins at play here and I have no idea about HT works on low level. It's probably very dependent on both hardware and the specific code.

lehuyduc Jan 10, 2024
Author

So you're still a bit slower than my ~1.55s on my machine. But note that I'm cheating (participating out-of-competition, let's say) by assuming that all city names appear in the first 100k chars, and that lines are at most 33 chars long.

Just checking, is 1.55s the output of time command (includes munmap and freeing resources) or from inside your program? I think in your blog post, it's equal to runtime inside main ?

Edit: found some bugs and wrong results in the blog post. Also added benchmark for your repo on my PC, but the results are much worse than in your blog. Could you share your measurements.txt? Update: fixed
RagnarGrootKoerkamp/1brc#1

RagnarGrootKoerkamp Jan 10, 2024

Oh right yeah that's inside the program, without munmap time.

MK-2012 · 2024-01-16T21:36:33Z

MK-2012
Jan 16, 2024

Hi. Great code. Small idea for optimization: have you tried replacing -O3 with -Ofast and perhaps playing with some other compiler arguments? Also, perhaps replacing GCC with clang might provide measurable improvement.

2 replies

lehuyduc Jan 17, 2024
Author

I think there's no reason to use -Ofast because there's zero float processing in the code.

If you have clang, please download the repo + dataset and try it :D I wonder what'll be the difference. I can't install clang because it's all borrowed PC lol

MK-2012 Jan 17, 2024

Tested it on my 4C/8T laptop. Changing -O3 to -Ofast improved running times from ~2,42s to ~2,38s. Using clang++ instead of GCC slowed the program down to ~2,63 s.

lehuyduc · 2024-01-17T12:12:05Z

lehuyduc
Jan 17, 2024
Author

Just uploaded the final version of my 1BRC (except if I find bugs). It's ~4x faster on the 10K keys dataset compared to before (see valid20_tr2950x_10k.txt vs valid16_tr2950x_10k.txt).

@buybackoff has a blog that contains a lot of different results here. valid20 should be ~2.5x than the last tested commit. Thanks for spending time testing everyone's code! ~~Might need to change the title though :D~~
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/

0 replies

lehuyduc · 2024-01-18T10:18:29Z

lehuyduc
Jan 18, 2024
Author

Hi, I found a low-overhead physical server with the same CPU (AMD EPYC 7502P) as the official test server. Everyone's results are faster there, and some are massively 78% faster!! I'm not asking to change the test server or anything, I just want to share the results, because it's interesting how benchmark results can be wildly different even on the same hardware when running on different cloud providers.

Tagging relevant people @gunnarmorling @artsiomkorzun @thomaswue @royvanrijn

Edit: it might be because the results on the main page is slightly oudated 😢

11 replies

lehuyduc Jan 22, 2024
Author

@artsiomkorzun So there's a problem with the evaluation method. numactl --physcpubind=0-7 ./run_your_code.sh can cause your code to be ~20% slower in worst case (tested on EPYC 7502), compared to setting number of threads to 8 manually in your code. For my code it's around ~5-10% loss (tested on different AMD PCs). I'm not sure if this is related to AMD's core clusters design or something else (my code on Zen 1 for example can randomly take 50% more time to finish 10K dataset when number of threads >= number of physical cores, but on Zen 2 it's stable).

So at the end of the contest, I think Java solutions can add num_threads as a command line argument, then the judge will use that to test instead of using numactl. But then the code might use num_threads + 1 instead of num_threads threads due to some unintentional reason, I guess we'll worry about that later :D

I compare 2 cases:

num_threads = hardware_concurrency() in code => num_threads = 64 on test PC
num_threads = 8

Commit: f435d64

// 10K DATASET
...
[email protected]:~/1brc$ hyperfine --warmup 1 --runs 10 "./calculate_average_artsiomkorzun.sh 2>&1"
Benchmark 1: ./calculate_average_artsiomkorzun.sh 2>&1
  Time (mean ± σ):     937.7 ms ±   3.7 ms    [User: 43003.2 ms, System: 7718.6 ms]
  Range (min … max):   932.3 ms … 944.4 ms    10 runs

 [email protected]:~/1brc$ numactl --physcpubind=0-7 hyperfine --warmup 1 --runs 10 "./calculate_average_artsiomkorzun.sh 2>&1"
Benchmark 1: ./calculate_average_artsiomkorzun.sh 2>&1
  Time (mean ± σ):      3.143 s ±  0.106 s    [User: 19.932 s, System: 4.894 s]
  Range (min … max):    3.008 s …  3.341 s    10 runs

...
// 10K dataset, set thread to 8 manually in code
[email protected]:~/1brc$ hyperfine --warmup 1 --runs 10 "./calculate_average_artsiomkorzun.sh 2>&1"
Benchmark 1: ./calculate_average_artsiomkorzun.sh 2>&1
  Time (mean ± σ):      2.392 s ±  0.006 s    [User: 17.259 s, System: 1.584 s]
  Range (min … max):    2.382 s …  2.400 s    10 runs
 
[email protected]:~/1brc$ numactl --physcpubind=0-7 hyperfine --warmup 1 --runs 10 "./calculate_average_artsiomkorzun.sh 2>&1"
Benchmark 1: ./calculate_average_artsiomkorzun.sh 2>&1
  Time (mean ± σ):      3.158 s ±  0.124 s    [User: 19.829 s, System: 5.102 s]
  Range (min … max):    3.015 s …  3.368 s    10 runs

https://github.com/lehuyduc/1brc-simd/blob/main/benchmark_results/other_artsi_7502p.txt

artsiomkorzun Jan 22, 2024

@lehuyduc please drop the output of: lscpu -p

lehuyduc Jan 22, 2024
Author

AMD 7502P

[email protected]:~/1brc-simd$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,1
5,5,0,0,,5,5,5,1
6,6,0,0,,6,6,6,1
7,7,0,0,,7,7,7,1
8,8,0,0,,8,8,8,2
9,9,0,0,,9,9,9,2
10,10,0,0,,10,10,10,2
11,11,0,0,,11,11,11,2
12,12,0,0,,12,12,12,3
13,13,0,0,,13,13,13,3
14,14,0,0,,14,14,14,3
15,15,0,0,,15,15,15,3
16,16,0,0,,16,16,16,4
17,17,0,0,,17,17,17,4
18,18,0,0,,18,18,18,4
19,19,0,0,,19,19,19,4
20,20,0,0,,20,20,20,5
21,21,0,0,,21,21,21,5
22,22,0,0,,22,22,22,5
23,23,0,0,,23,23,23,5
24,24,0,0,,24,24,24,6
25,25,0,0,,25,25,25,6
26,26,0,0,,26,26,26,6
27,27,0,0,,27,27,27,6
28,28,0,0,,28,28,28,7
29,29,0,0,,29,29,29,7
30,30,0,0,,30,30,30,7
31,31,0,0,,31,31,31,7
32,0,0,0,,0,0,0,0
33,1,0,0,,1,1,1,0
34,2,0,0,,2,2,2,0
35,3,0,0,,3,3,3,0
36,4,0,0,,4,4,4,1
37,5,0,0,,5,5,5,1
38,6,0,0,,6,6,6,1
39,7,0,0,,7,7,7,1
40,8,0,0,,8,8,8,2
41,9,0,0,,9,9,9,2
42,10,0,0,,10,10,10,2
43,11,0,0,,11,11,11,2
44,12,0,0,,12,12,12,3
45,13,0,0,,13,13,13,3
46,14,0,0,,14,14,14,3
47,15,0,0,,15,15,15,3
48,16,0,0,,16,16,16,4
49,17,0,0,,17,17,17,4
50,18,0,0,,18,18,18,4
51,19,0,0,,19,19,19,4
52,20,0,0,,20,20,20,5
53,21,0,0,,21,21,21,5
54,22,0,0,,22,22,22,5
55,23,0,0,,23,23,23,5
56,24,0,0,,24,24,24,6
57,25,0,0,,25,25,25,6
58,26,0,0,,26,26,26,6
59,27,0,0,,27,27,27,6
60,28,0,0,,28,28,28,7
61,29,0,0,,29,29,29,7
62,30,0,0,,30,30,30,7
63,31,0,0,,31,31,31,7

artsiomkorzun Jan 22, 2024

@lehuyduc can you try different cores, e.g. 8-16 or 16-24?

lehuyduc Jan 22, 2024
Author

Sure! But I'll test those details at the end of the contest. Each time I use the machine I have to setup everything again so I'm lazy T_T

austindonisan · 2024-01-19T01:08:13Z

austindonisan
Jan 19, 2024

Did attempt using SIMD to processes the chunks in parallel within a thread? I'm able able to get a ~50% speedup over this implementation with that strategy.

32 replies

austindonisan Jan 29, 2024

Well I fixed the scaling problem (slightly cheating with orphaned children, reaping them raises it to 82ms):

Using minimal shared memory, recursive fork(), and poll() on child process to merge the results as they come in. Also I learned that testing on multi-socket servers requires CPU pinning and flushing the page cache before changing thread counts to try to get everything is loaded in the right NUMA node. And even then it still is weird sometimes and it's hard to do things repeatably.

But otherwise at 64+ cores the testing becomes way more repeatable since the runs are so short that the CPUs don't thermally throttle. At lower core counts hyperfine runs fast enough that the repeated runs get slower, and physcpubind makes a huge difference to performance if you can't move cores between runs.

So here's an "official" run hyperfine test bound to 8 cores:

lehuyduc Jan 30, 2024
Author

Nice! The new results look to be scaling much better.

Did you get better result with physcpubind ?

austindonisan Jan 30, 2024

I guess I wasn't clear, but physcpubind causes much worse performance when combined with rapid testing (e.g. hyperfine). Without physcpubind and 8 threads it will average sub-600ms, but with physcpubind the first run is sub-600ms and it gradually rises to 660ms

lehuyduc Jan 30, 2024
Author

physcpubind causes much worse performance when combined with rapid testing (e.g. hyperfine)

Yeah that's what I found too, about 10% slower, 20% in the worst case.

charlielye Jan 31, 2024

This is next level stuff. I see me now sinking more time into understanding all this SIMD magic 😅

austindonisan · 2024-01-19T08:33:19Z

austindonisan
Jan 19, 2024

A couple of ideas that have improved performance for me:

Fork child processes instead of using threads. Each process then only mmaps() a portion of the file, and now the munmap() work can actually happen in parallel. Time spent should decrease linearly with cores.

Populate the page cache for the measurements file before parsing. There should be a way to do this efficiently with flags and system calls, but I can't figure it out. But simply having each thread walk the file and touch every page before starting parsing increases your code's performance 3%. Front loading the page faults stops the caches from being polluted with kernel code while parsing.

10 replies

dzaima Jan 19, 2024

Another idea might be mapping some small chunk (a couple megabytes?) at the start, and mapping the next chunk exactly over the previous one via MAP_FIXED, thus effectively unmapping the previous chunk. No clue if that'd actually result in anything better though, haven't yet tried.
edit: this is now in my solution; didn't make any noticable difference on my 2-thread 1.38GB test, but perhaps it does on the full input

austindonisan Jan 19, 2024

mmap() and munmap() hold a process-wide address space lock, which is why you can't effectively use them in parallel with threads. But by forking you get around this limitation.

My fork solution has the parent wait() for all the children, so it's not hiding any munmap() time (although you could).

MAP_POPULATE would be nice, but it unnecessarily holds onto the address-space lock the whole time.

Just add something like this this at the top of handle_line_raw();

long sum = 0;
for (i = chunk_start; i < chunk_end; += 4096) {
  sum += data[i];
}
asm volatile( "" : : [dummy] "r" (dummy));

austindonisan Jan 19, 2024

I thought the subprocess approach would be impossible for the Java challenge because of JVM startup time, but I forgot that about Graal. ASLR could still hurt the caches though, and with hyperthreading you'd want to spawn 2 threads per process and hope they get scheduled to the same CPU.

lehuyduc Jan 20, 2024
Author

mmap() and munmap() hold a process-wide address space lock, which is why you can't effectively use them in parallel with threads. But by forking you get around this limitation.

Huh, pretty neat detail. But if we're allowed to use fork() then we can just create a new process, do everything there, and return as soon as the output is printed. Then munmap time will be hidden completely.

austindonisan Jan 26, 2024

I've realized that while filling the page cache improves the throughput per cycle by 10-20%, that's a bad metric to aim for. I was testing with TurboBoost disabled for repeatability, but that doesn't translate to real life. The correct metric to aim for here is throughput for second, and that's achieved by keeping the clock speed high enough to more than compensate for the wasted work of the context switches and subsequent CPU cache mises.

Rough math estimates ahead that are probably somewhat wrong:

Without the page cache prefilled (and assuming Linux isn't doing any preloading) my code will page fault every ~5500 cycles, or 2.25 microseconds. That's substantially less than time interval that frequency throttling is calculated over (20us - 1ms according to what I've found). These chunks of kernel time are little breaks for the CPU where it stops doing hard math with full pipelines and just chases pointers, and they're spread finely enough that it just looks like a nice smooth power draw. So rather than giving the CPU an easy workload and then a hard workload, it's giving it a consistent workload which is ideal.

This revelation provides a explanation for why mmap()ing the entire file not be the best thing to do (with TurboBoost at least). The munmap() step is basically the reverse of what I found out was bad, which is having a big contiguous chunk of kernel time where the CPU is taking it relatively easy.

charlielye · 2024-01-19T18:26:37Z

charlielye
Jan 19, 2024

I notice I'm behind on recent developments in this discussion, but still, here's another solution that's similar:
#495

3 replies

lehuyduc Jan 19, 2024
Author

Wow that's a lot of AVX512, I don't have any PC with AVX512 so I didn't bother learning T_T But also the contest's evaluation server doesn't have AVX512 (EPYC 7502)

main.cpp:2:10: fatal error: crc32intrin.h: No such file or directory
    2 | #include <crc32intrin.h>

Is this from clang? I get this error with g++

lehuyduc Jan 19, 2024
Author

struct MinMaxAvg {
    std::string_view name;
    uint32_t key;
    int16_t min;
    int16_t max;
    int sum;
    unsigned int count;

    MinMaxAvg() : min(std::numeric_limits<int16_t>::max()), max(std::numeric_limits<int16_t>::min()), sum(0), count(0), key(0) {}

    inline void update(std::string_view const& key_str, int _key, int16_t value) {
        name = key_str;
        key = _key;
        min = std::min(min, value);
        max = std::max(max, value);
        sum += value;
        ++count;
    }
    ... 
};

constexpr size_t HashMapSize = 1024 * 16;
using MapIndex = uint64_t;
using TheMap = MinMaxAvg[HashMapSize];
inline MinMaxAvg* lookup(TheMap& map, MapIndex key) {
    auto lookup_key = key % HashMapSize;
    auto* entry = &map[lookup_key];
    while (nay(entry->key && entry->key != key)) {
      lookup_key = (lookup_key + 1) % HashMapSize;
      entry = &map[lookup_key];
    }
    return entry;
}

lookup(result, key)->update(key_str, key, value);

Maybe I'm missing something, could you explain how your code handle hash collision? Like when 2 station names have the same hash. Thanks! For example, these 2 strings will cause your function to have hash collision:

string 1: 000111100100000000001011110
string 2: 111001110
hash of both:1169095459

~~Also there's a problem with your string_view~~

while (main_loop) {
   ...
  std::string_view key_str(line_start, sc_pos - line_start);
  ...
  lookup(result, key)->update(key_str, key, value);
}

~~Basically you store a string_view in your map, but when the loop goes to the next iteration, that string doesn't exist anymore. Is this intended?~~ I never use string_view so idk. Edit: oh, it's string_view key_str not string key_str, so that isn't a problem.

dzaima Jan 19, 2024

@lehuyduc indeed it works on clang but not gcc; can replace the #include <crc32intrin.h> with #include <immintrin.h> and then it works with g++ too. On hash collision - Changing hash_name's return key; to return key&255; results in wrong results, probably meaning that there isn't (sufficient) collision checking? @charlielye

lehuyduc · 2024-01-23T05:47:06Z

lehuyduc
Jan 23, 2024
Author

noahfalk @noahfalk I've tested your code on a server similar to the official test PC. At this point CPU and dataset difference can be a big factor, so I need your help to test on your dataset/PC too :D

It's definitely my final final version, the parse 4 temperatures at once code looks too long so I'll give up 😭

0 replies

noahfalk · 2024-01-24T07:39:11Z

noahfalk
Jan 24, 2024

Hey @lehuyduc, hope this helps! These numbers are coming from the Hetzer CCX33 I have access to. My data sets were generated on this machine using the create_measurements.sh and create_measurements3.sh scripts. I just used the first one I got and its up to random chance if it winds up being good or bad for either of our hash functions :)

The results for my entry look pretty similar to what I posted on the README at my repo. Your entry looks improved and I assume the difference is that this benchmark run is using a more recent version of your code.

Machine info

root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC-Milan Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4890.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm r
                         ep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervi
                         sor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbas
                         e bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku o
                         spke rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Lehuyduc benchmarking

I compiled your app 3 different times with different thread settings and tested each on my default and 10K data. Let me know if there is some other specific config you are looking for.

root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# git log
commit 995e89ba84feeecdf1084bcde45a484461f25e7c (HEAD -> main, origin/main, origin/HEAD)
Author: lehuyduc <[email protected]>
Date:   Wed Jan 24 12:42:29 2024 +0700

    Add old post

root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# g++ -o main1 main.cpp -O3 -std=c++17 -march=native -m64 -lpthread -DN_THREADS_PARAM=1 -DN_CORES_PARAM=1 -g
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# g++ -o main4 main.cpp -O3 -std=c++17 -march=native -m64 -lpthread -DN_THREADS_PARAM=4 -DN_CORES_PARAM=4 -g
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# g++ -o main8 main.cpp -O3 -std=c++17 -march=native -m64 -lpthread -DN_THREADS_PARAM=8 -DN_CORES_PARAM=4 -g

root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main1 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./main1 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      9.121 s ±  0.059 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    9.052 s …  9.194 s    5 runs
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main4 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./main4 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      2.412 s ±  0.012 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    2.394 s …  2.424 s    5 runs
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main8 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./main8 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      1.768 s ±  0.021 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.734 s …  1.786 s    5 runs
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main1 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./main1 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):     12.746 s ±  0.150 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):   12.637 s … 13.006 s    5 runs
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main4 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./main4 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):      3.374 s ±  0.007 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    3.362 s …  3.378 s    5 runs
root@ubuntu-32gb-hil-1:~/git/lehuyduc_1brc# hyperfine -w 1 -r 5 "./main8 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./main8 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):      3.478 s ±  0.036 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    3.440 s …  3.532 s    5 runs

noahfalk entry

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc# git log
commit 9e22459211889c78ee09a6c513301e6691f2423d (HEAD -> main, origin/main, origin/HEAD)
Author: Noah Falk <[email protected]>
Date:   Mon Jan 22 02:40:07 2024 -0800

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 1 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./1brc --threads 1 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      5.975 s ±  0.058 s    [User: 4.888 s, System: 1.098 s]
  Range (min … max):    5.916 s …  6.066 s    5 runs

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 4 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./1brc --threads 4 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      1.554 s ±  0.009 s    [User: 4.967 s, System: 1.231 s]
  Range (min … max):    1.542 s …  1.563 s    5 runs

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 8 ~/git/1brc_data/measurements.txt"
Benchmark 1: ./1brc --threads 8 ~/git/1brc_data/measurements.txt
  Time (mean ± σ):      1.395 s ±  0.012 s    [User: 9.479 s, System: 1.532 s]
  Range (min … max):    1.375 s …  1.405 s    5 runs

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 1 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./1brc --threads 1 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):     20.211 s ±  0.161 s    [User: 18.770 s, System: 1.451 s]
  Range (min … max):   19.984 s … 20.374 s    5 runs

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 4 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./1brc --threads 4 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):      5.180 s ±  0.037 s    [User: 18.926 s, System: 1.698 s]
  Range (min … max):    5.144 s …  5.243 s    5 runs

root@ubuntu-32gb-hil-1:~/git/noahfalk_1brc/1brc/bin/release/net8.0/linux-x64/publish# hyperfine -w 1 -r 5 "./1brc --threads 8 ~/git/1brc_data/measurements-10K.txt"
Benchmark 1: ./1brc --threads 8 ~/git/1brc_data/measurements-10K.txt
  Time (mean ± σ):      3.744 s ±  0.022 s    [User: 27.428 s, System: 2.088 s]
  Range (min … max):    3.713 s …  3.768 s    5 runs

2 replies

lehuyduc Jan 24, 2024
Author

Thanks for adding more data points! The results look similar to my test on EPYC 7502, actually lot faster even. 8 thread case is slower because it's 4c8t vs 8 physical cores, so no problem.

Glad to see that the result can be replicated across different PCs!

Name/Test	Original, 1 thread	Original, 4 thread	Original, 8c vs 4c8t	10K, 1 thread	10K, 4 thread	10K, 8c vs 4c8t
EPYC 7502	10.720s	2.676s	1.358s	14.095s	3.589s	1.810s
CCX33	9.121	2.412	1.768	12.746	3.374	3.478

charlielye Jan 31, 2024

I've pushed some final work to: #495
It attempts to avoid forks and amortise the cost of unmapping over the runtime (which will be a losing battle against the fork single-run time, but I think will provide better average time when using hyperfine.
I'd be curious to see how the default dataset performs on the hardware you guys have been using.

1BRC 0.145s default dataset, 0.281s 10K dataset, accept all valid inputs (C++, SIMD, hash, tricks) #138

Main ideas:

Replies: 16 comments · 78 replies

lehuyduc Jan 6, 2024 Author

lehuyduc Jan 8, 2024 Author

lehuyduc Jan 8, 2024 Author

lehuyduc Jan 9, 2024 Author

lehuyduc Jan 9, 2024 Author

lehuyduc Jan 8, 2024 Author

lehuyduc Jan 9, 2024 Author

lehuyduc Jan 9, 2024 Author

lehuyduc Jan 9, 2024 Author

lehuyduc Jan 10, 2024 Author

lehuyduc Jan 10, 2024 Author

lehuyduc Jan 17, 2024 Author

lehuyduc Jan 17, 2024 Author

lehuyduc Jan 18, 2024 Author

lehuyduc Jan 22, 2024 Author

lehuyduc Jan 22, 2024 Author

lehuyduc Jan 22, 2024 Author

lehuyduc Jan 30, 2024 Author

lehuyduc Jan 30, 2024 Author

lehuyduc Jan 20, 2024 Author

lehuyduc Jan 19, 2024 Author

lehuyduc Jan 19, 2024 Author

Replies: 16 comments 78 replies

lehuyduc Jan 6, 2024
Author

lehuyduc Jan 8, 2024
Author

lehuyduc Jan 8, 2024
Author

lehuyduc Jan 9, 2024
Author

lehuyduc Jan 9, 2024
Author

lehuyduc Jan 8, 2024
Author

lehuyduc Jan 9, 2024
Author

lehuyduc Jan 9, 2024
Author

lehuyduc Jan 9, 2024
Author

lehuyduc Jan 10, 2024
Author

lehuyduc Jan 10, 2024
Author

lehuyduc Jan 17, 2024
Author

lehuyduc
Jan 17, 2024
Author

lehuyduc
Jan 18, 2024
Author

lehuyduc Jan 22, 2024
Author

lehuyduc Jan 22, 2024
Author

lehuyduc Jan 22, 2024
Author

lehuyduc Jan 30, 2024
Author

lehuyduc Jan 30, 2024
Author

lehuyduc Jan 20, 2024
Author

lehuyduc Jan 19, 2024
Author

lehuyduc Jan 19, 2024
Author