-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithmic Tweaks, parallel stream, memory mapped file #3
Conversation
Hey, wow, that's awesome, thanks a lot for this submission! I'll merge and evaluate it once I've officially launched and announced this challenge (planned for tomorrow). |
Oh, I saw it in my GitHub Feed and thought it is already open for submissions. Actually I misread the deadline date and assumed the challenge would end today. I'm sorry for the inconvenience. In that case I will convert the PR to a draft and take a look in January again to find further optimizations. |
LOL, no worries, it's not a convinience whatsoever. To the contrary, it's very encouraging :)
+1. You'll have time until Jan 31. Note I'll do one more tweak and that is to also ask to emit min and max value per station. This is to avoid somebody cheats by only processing a part of the dataset (which should be obvious from looking at the code, but it might also be easy to miss). |
I think I'm done for now, lacking more ideas. Might take a look again in 1-2 weeks |
Shamelessly sharing this idea for JVM/GC tuning in another PR/discussion? #15 (comment) |
Could you please run |
Issue seems to be that you configure Shenandoah GC. Which JDK distro should this be run on? |
I used the latest Temurin distribution that is available in sdkman. |
Still seeing test failures. Can you also please adjust your launch script to set the right JDK. See @royvanrijn's one as an example. Thanks. |
@gunnarmorling never change a running system... Tests should pass now. |
51.678sec. Thx for being the first participant to this one! |
…me further by 10%. As the jvm exits with exit(0) syscall, the kernel reclaims the memory mappings via munmap() call. Prior to this change. all the unmap() calls were happening right at the end as the JVM exited. This led to serial execution of about 350ms out of 2500 ms right at the end after each shard completed its work. We can parallelize it by exposing the Cleaner from MappedByteBuffer and then ensure that it is truly parallel execution of munmap() by using a non-blocking lock (SeqLock). The optimal strategy for when each thread must call unmap() is an interesting math problem with an exact solution and this code roughly reflects it. Commit gunnarmorling#3: Tried out reading long at a time from bytebuffer and checking for presence of ';'.. it was slower compared to just reading int(). Removed the code for reading longs; just retaining the hasSemicolonByte(..) check code Commit gunnarmorling#2: Introduce processLineSlow() and processRangeSlow() for the tial part. Commit gunnarmorling#1: Create a separate tail piece of work for the last few lines to be processed separately from the main loop. This allows the main loop to read past its allocated range (by a 'long' if we reserve atleast 8 bytes for the tail piece of work.)
…m 16th based on local testing; no Unsafe; no bitwise tricks yet (#465) * Squashing a bunch of commits together. Commit#2; Uplift of 7% using native byteorder from ByteBuffer. Commit#1: Minor changes to formatting. * Commit #4: Parallelize munmap() and reduce completion time further by 10%. As the jvm exits with exit(0) syscall, the kernel reclaims the memory mappings via munmap() call. Prior to this change. all the unmap() calls were happening right at the end as the JVM exited. This led to serial execution of about 350ms out of 2500 ms right at the end after each shard completed its work. We can parallelize it by exposing the Cleaner from MappedByteBuffer and then ensure that it is truly parallel execution of munmap() by using a non-blocking lock (SeqLock). The optimal strategy for when each thread must call unmap() is an interesting math problem with an exact solution and this code roughly reflects it. Commit #3: Tried out reading long at a time from bytebuffer and checking for presence of ';'.. it was slower compared to just reading int(). Removed the code for reading longs; just retaining the hasSemicolonByte(..) check code Commit #2: Introduce processLineSlow() and processRangeSlow() for the tial part. Commit #1: Create a separate tail piece of work for the last few lines to be processed separately from the main loop. This allows the main loop to read past its allocated range (by a 'long' if we reserve atleast 8 bytes for the tail piece of work.)
* Latest snapshot (#1) preparing initial version * Improved performance to 20seconds (-9seconds from the previous version) (#2) improved performance a bit * Improved performance to 14 seconds (-6 seconds) (#3) improved performance to 14 seconds * sync branches (#4) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * Copy less arrays. Improved performance to 12.7 seconds (-2 seconds) (#5) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * some tuning to increase performance * some tuning to increase performance * avoid copying data; fast hashCode with slightly more collisions * avoid copying data; fast hashCode with slightly more collisions * cleanup (#6) * tidy up
Thank you for the interesting challenge.
I've set myself a limit to finish this evening to not invest so much time and this is what I came up with. Basically just algorithmic improvements on hot Code paths and utilizing a parallel stream.
Current results on my machine (AMD Ryzen 7 PRO 4750G 16 core, 48GB RAM) - Latest Temurin JDK:
I'm curious what others find.
I thought about caching some parts as it is most likely static data. However, I think this would not be in the spirit of the challenge.