-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't scale properly with the number of threads used. #12
Comments
the threads need to synchronize to do file IO, so that becomes a bottleneck as the number of threads increases on my machine with 8 cores, i found the optimal setting to be 8 threads thanks for running the benchmark though, its very informative |
@camel-cdr |
@Daniel-Liu-c0deb0t I didn't dig into the code but technically if the length of the data is not changed (ie. because you will never change a multi-byte char into a single-byte char or vice versa) you can pre-allocate the output file (ie. seek(length, SEEK_SET)) and then you can have each file have it's own file pointer writing only the portion of the data. In this way you wouldn't need locking. Also, if you are not doing it, cpu pinning helps a lot with the SSE4.1 instructions because you will not end-up asking the same units to process the data. When you do the pinning you will need to identify the cpu id via something like Swapping data out from L1/L2/L3 during the data processing is fine, it's very bad though to access the memory so it would be worth to process chunk sizes of an appropriate length (you can fetch the cpu cache sizes dynamically and calculate it at runtime). |
unfortunately, pre-allocating an output file isn't possible because stuff like adding emojis can change the length of the data cpu pinning is a good idea. im not doing it rn (the user passes in the number of threads), but it could be added. though in theory the os should be able to spread threads evenly between cores if you pass in a reasonable thread count. i do try to keep stuff in cache. each thread has its own buffer that should fit in L2 cache. its a fixed size of 64kb rn, as i did not implement checking cpu cache sizes |
I ran some benchmarks and noticed that the throughput reaches an optimum with 4 threads and starts decreasing afterwards:
(Ran on an
AMD Ryzen 5 1600X Six-Core Processor
CPU)The text was updated successfully, but these errors were encountered: