Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithmic Tweaks, parallel stream, memory mapped file #3

Merged
merged 21 commits into from
Jan 5, 2024

Conversation

twobiers
Copy link
Contributor

@twobiers twobiers commented Dec 30, 2023

Thank you for the interesting challenge.
I've set myself a limit to finish this evening to not invest so much time and this is what I came up with. Basically just algorithmic improvements on hot Code paths and utilizing a parallel stream.

Current results on my machine (AMD Ryzen 7 PRO 4750G 16 core, 48GB RAM) - Latest Temurin JDK:

# Result (m:s:ms) Implementation
1. 00:26.06 CalculateAverage_twobiers.java
2. 02:57.77 CalculateAverage.java (baseline)

I'm curious what others find.

I thought about caching some parts as it is most likely static data. However, I think this would not be in the spirit of the challenge.

@gunnarmorling
Copy link
Owner

Hey, wow, that's awesome, thanks a lot for this submission! I'll merge and evaluate it once I've officially launched and announced this challenge (planned for tomorrow).

@twobiers
Copy link
Contributor Author

twobiers commented Dec 31, 2023

Oh, I saw it in my GitHub Feed and thought it is already open for submissions. Actually I misread the deadline date and assumed the challenge would end today. I'm sorry for the inconvenience.

In that case I will convert the PR to a draft and take a look in January again to find further optimizations.

@twobiers twobiers marked this pull request as draft December 31, 2023 11:22
@gunnarmorling
Copy link
Owner

LOL, no worries, it's not a convinience whatsoever. To the contrary, it's very encouraging :)

In that case I will convert the PR to a draft and take a look in January again to find further optimizations.

+1. You'll have time until Jan 31. Note I'll do one more tweak and that is to also ask to emit min and max value per station. This is to avoid somebody cheats by only processing a part of the dataset (which should be obvious from looking at the code, but it might also be easy to miss).

@twobiers twobiers changed the title Add implementation using simple tweaks Algorithmic Tweaks, parallel stream Jan 1, 2024
@twobiers twobiers marked this pull request as ready for review January 2, 2024 20:30
@twobiers
Copy link
Contributor Author

twobiers commented Jan 2, 2024

I think I'm done for now, lacking more ideas. Might take a look again in 1-2 weeks

@twobiers twobiers changed the title Algorithmic Tweaks, parallel stream Algorithmic Tweaks, parallel stream, memory mapped file Jan 2, 2024
@lobaorn
Copy link

lobaorn commented Jan 3, 2024

Shamelessly sharing this idea for JVM/GC tuning in another PR/discussion? #15 (comment)

@gunnarmorling
Copy link
Owner

gunnarmorling commented Jan 5, 2024

Could you please run test.sh twobiers and make sure all the tests pass? Thanks!

@gunnarmorling
Copy link
Owner

Issue seems to be that you configure Shenandoah GC. Which JDK distro should this be run on?

@twobiers
Copy link
Contributor Author

twobiers commented Jan 5, 2024

I used the latest Temurin distribution that is available in sdkman.

@gunnarmorling
Copy link
Owner

Still seeing test failures. Can you also please adjust your launch script to set the right JDK. See @royvanrijn's one as an example. Thanks.

@twobiers
Copy link
Contributor Author

twobiers commented Jan 5, 2024

@gunnarmorling never change a running system... Tests should pass now.

@gunnarmorling
Copy link
Owner

51.678sec. Thx for being the first participant to this one!

@gunnarmorling gunnarmorling merged commit d617039 into gunnarmorling:main Jan 5, 2024
vemana added a commit to vemana/1brc that referenced this pull request Jan 17, 2024
…me further by

10%. As the jvm exits with exit(0) syscall, the kernel reclaims the
memory mappings via munmap() call. Prior to this change. all the unmap()
calls were happening right at the end as the JVM exited. This led to
serial execution of about 350ms out of 2500 ms right at the end after
each shard completed its work. We can parallelize it by exposing the
Cleaner from MappedByteBuffer and then ensure that it is truly parallel
execution of munmap() by using a non-blocking lock (SeqLock). The
optimal strategy for when each thread must call unmap() is an interesting math problem with an exact solution and this code roughly reflects it.

Commit gunnarmorling#3: Tried out reading long at a time from bytebuffer and
checking for presence of ';'.. it was slower compared to just reading int().
Removed the code for reading longs; just retaining the
hasSemicolonByte(..) check code

Commit gunnarmorling#2: Introduce processLineSlow() and processRangeSlow() for the
tial part.

Commit gunnarmorling#1: Create a separate tail piece of work for the last few lines to be
processed separately from the main loop. This allows the main loop to
read past its allocated range (by a 'long' if we reserve atleast 8 bytes
for the tail piece of work.)
gunnarmorling pushed a commit that referenced this pull request Jan 17, 2024
…m 16th based on local testing; no Unsafe; no bitwise tricks yet (#465)

* Squashing a bunch of commits together.

Commit#2; Uplift of 7% using native byteorder from ByteBuffer.
Commit#1: Minor changes to formatting.

* Commit #4: Parallelize munmap() and reduce completion time further by
10%. As the jvm exits with exit(0) syscall, the kernel reclaims the
memory mappings via munmap() call. Prior to this change. all the unmap()
calls were happening right at the end as the JVM exited. This led to
serial execution of about 350ms out of 2500 ms right at the end after
each shard completed its work. We can parallelize it by exposing the
Cleaner from MappedByteBuffer and then ensure that it is truly parallel
execution of munmap() by using a non-blocking lock (SeqLock). The
optimal strategy for when each thread must call unmap() is an interesting math problem with an exact solution and this code roughly reflects it.

Commit #3: Tried out reading long at a time from bytebuffer and
checking for presence of ';'.. it was slower compared to just reading int().
Removed the code for reading longs; just retaining the
hasSemicolonByte(..) check code

Commit #2: Introduce processLineSlow() and processRangeSlow() for the
tial part.

Commit #1: Create a separate tail piece of work for the last few lines to be
processed separately from the main loop. This allows the main loop to
read past its allocated range (by a 'long' if we reserve atleast 8 bytes
for the tail piece of work.)
jincongho added a commit to jincongho/1brc-jho that referenced this pull request Jan 18, 2024
gunnarmorling pushed a commit that referenced this pull request Jan 19, 2024
gunnarmorling pushed a commit that referenced this pull request Jan 28, 2024
* Latest snapshot (#1)

preparing initial version

* Improved performance to 20seconds  (-9seconds from the previous version) (#2)

improved performance a bit

* Improved performance to 14 seconds (-6 seconds) (#3)

improved performance to 14 seconds

* sync branches (#4)

* initial commit

* some refactoring of methods

* some fixes for partitioning

* some fixes for partitioning

* fixed hacky getcode for utf8 bytes

* simplified getcode for partitioning

* temp solution with syncing

* temp solution with syncing

* new stream processing

* new stream processing

* some improvements

* cleaned stuff

* run configuration

* round buffer for the stream to pages

* not using compute since it's slower than straightforward get/put. using own byte array equals.

* using parallel gc

* avoid copying bytes when creating a station object

* formatting

* Copy less arrays. Improved performance to 12.7 seconds (-2 seconds) (#5)

* initial commit

* some refactoring of methods

* some fixes for partitioning

* some fixes for partitioning

* fixed hacky getcode for utf8 bytes

* simplified getcode for partitioning

* temp solution with syncing

* temp solution with syncing

* new stream processing

* new stream processing

* some improvements

* cleaned stuff

* run configuration

* round buffer for the stream to pages

* not using compute since it's slower than straightforward get/put. using own byte array equals.

* using parallel gc

* avoid copying bytes when creating a station object

* formatting

* some tuning to increase performance

* some tuning to increase performance

* avoid copying data; fast hashCode with slightly more collisions

* avoid copying data; fast hashCode with slightly more collisions

* cleanup (#6)

* tidy up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants