Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settings > Performance > Threads: Automatic setting works poorly on CPUs with many threads #7160

Open
chaimav opened this issue Jul 28, 2024 · 21 comments
Labels
scope: performance Performance issues and improvements

Comments

@chaimav
Copy link

chaimav commented Jul 28, 2024

The default setting of 0 (Automatic) does not perform well on modern Intel CPUs with high thread counts.
Tested on:

Test: Enabling Wavelets > Sharp-mask and clarity and panning while zoomed in. Raising from automatic to a higher number greatly reduces processing time post panning.

Processor: 12th Gen Intel(R) Core™ i7-12700H (20 CPUs), ~2.7GHz
Memory: 32768MB RAM
Card name: NVIDIA GeForce RTX 3070 Ti Laptop GPU (RawTherapee does not take advantage of the GPU…)
SSD: 1 terabyte - NVMe SAMSUNG MZVL21T0HCLR-00BT7

video here: https://discuss.pixls.us/t/how-to-optimize-rawtherapee/44786/27?u=chaimav

Also tested on:
Processor: 13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz
Memory: 16Gb
Card name: None (integrated graphics)
SSD: 1 TB NVMe Micron_2400_MTFDKBA1T0QFM

Can the automatic setting be improved to detect higher processors?

@Lawrence37
Copy link
Collaborator

I checked the code. The automatic detection works fine, but there are differences in what the code does depending on if the number of threads is set to 0 or not. Oddly enough, performance is actually worse for CPUs with few cores. Can you enable verbose mode and see what gets printed in the terminal/command prompt? Here's what I see.

Automatic:

Ip Wavelet uses 1 main thread(s) and up to 4 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 4 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

Manual (4 threads, the maximum for my computer):

Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

For me, automatic is about 30% faster.

@Lawrence37 Lawrence37 added the scope: performance Performance issues and improvements label Jul 28, 2024
@chaimav
Copy link
Author

chaimav commented Jul 28, 2024

Can you enable verbose mode and see what gets printed in the terminal/command prompt?

How do I do that?

@Lawrence37
Copy link
Collaborator

To enable verbose mode, find your options file (see https://rawpedia.rawtherapee.com/File_Paths#Config). Make sure RawTherapee is closed, then open the options file in a text editor such as Notepad. Find the line that says Verbose=false and change it to Verbose=true. Save it.

Open the terminal or command prompt. Run RawTherapee from there. You may need to add the -w option as indicated here: https://rawpedia.rawtherapee.com/Command-Line_Options#RawTherapee_GUI
Example: rawtherapee.exe -w
You may first need to navigate to where RawTherapee is installed. For example: cd /D "C:\Program Files\RawTherapee\5.10"

@chaimav
Copy link
Author

chaimav commented Jul 29, 2024

Is the command terminal supposed to show something as soon as I change the threads option?

@Lawrence37
Copy link
Collaborator

If it does show something, you can ignore it. We are only interested in what it shows when the preview updates.

@chaimav
Copy link
Author

chaimav commented Jul 29, 2024

Its not showing anything. Am I doing something wrong?
(I had to use .\rawtherapee.exe -w because rawtherapee.exe -w gave an error)

@Lawrence37
Copy link
Collaborator

I have encountered this issue before, but I don't remember how to get it to show messages because it's been a while since I debugged in Windows. Maybe @Desmis can tell us how to see the verbose messages.

@Desmis
Copy link
Collaborator

Desmis commented Jul 31, 2024

@Lawrence37
I am not at all a specialist...

The options file, in : C:\Users\jdesm\AppData\Local\RawTherapee5-dev
[General]
TabbedEditor=true
StoreLastProfile=true
StartupDirectory=last
StartupPath=D:\Coutest
DateFormat=%y-%m-%d
AdjusterMinDelay=100
AdjusterMaxDelay=200
MultiUser=true
Language=English (US)
LanguageAutoDetect=false
Theme=RawTherapee - Legacy
Version=5.10-452-g1a418552a
DarkFramesPath=
FlatFieldsPath=
CameraProfilesPath=
LensProfilesPath=
Verbose=true
Cropsleep=50
Reduchigh=0.84999999999999998
Reduclow=0.84999999999999998
Detectshape=true
Fftwsigma=true

[External Editor]
EditorKind=1
GimpDir=

and after in console Mingw64
./rawtherapee w

@chaimav
Copy link
Author

chaimav commented Aug 1, 2024

@Desmis that worked, apparently the dash is what threw it off. I needed to type .\rawtherapee.exe w and not .\rawtherapee.exe -w

Here is the output, I hope it is useful (because I don't really understand it)
Manually set:

Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

Automatic:

Ip Wavelet uses 1 main thread(s) and up to 24 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 24 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

@Lawrence37
Copy link
Collaborator

Interesting. It says it uses one thread when you manually set it, but uses 24 threads when it's automatic. Theoretically, it should be much faster with 24 threads (automatic) which is the opposite of what you observe.

@chaimav
Copy link
Author

chaimav commented Aug 1, 2024

I was puzzled by that as well. I guess it possible I mixed them up?

@Benitoite
Copy link
Contributor

Benitoite commented Aug 1, 2024

I am puzzled by both settings only using one main thread and a difference in nested threads. The GUI has an algorithm to calculate an optimum setting, but the efficiency depends on memory and wavelet levels.

We ran a controlled experiment using the -cli on some different CPUs including @chaimav 's.

================================
Available threads = 24  /  CPU = 13th Gen Intel(R) Core(TM) i7-13700  /  2100 MHz  /  Target = Processor: generic x86
27082 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 2
18057 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 4
14663 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 8
14928 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 16
================================

I believe we maxxed out the efficiency by offering OMP threads that closely matched the wavelet levels. Moving up to 16 only shaved a few hundred milliseconds off a pretty long and duplicative routine.

A similar data point measured by @SilvioGrosso shows a similar optimization around 8 threads:

================================
Available threads = 20 / CPU = 12th Gen Intel(R) Core(TM) i7-12700H /  2700 MHz / Target = Processor: generic x86
48426 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 2
28899 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 4
23573 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 8
25115 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 16
================================

Here, the increase to 16 from 8 was about 10% more inefficient.

@Lawrence37
Copy link
Collaborator

It's probably not mixed up. My results show the same behavior. The two specific things I find interesting are (1) the use of 24 threads when the number of cores you have is 20 (the code puts a limit equal to the number of cores) and (2) why the thread count is still 1 after manually setting the threads so high (the threads used is calculated with a formula that should result in a number greater than 1).

@Benitoite
Copy link
Contributor

@Lawrence37
@chaimav ‘s machine the 24 thread CPU is 8 Hyperthreaded Performance Cores and 8 Efficiency Cores, for a total of 16 + 8 =24 threads.

@SilvioGrosso ‘s computer has the 20 threads (6 P-cores, 8 E-cores).

@Lawrence37
Copy link
Collaborator

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

@Benitoite
Copy link
Contributor

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

I think @chaimav might have the two systems, but has only provided data from the 24-core so far.

@chaimav
Copy link
Author

chaimav commented Aug 4, 2024

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

I think @chaimav might have the two systems, but has only provided data from the 24-core so far.

Correct, I have only benchmarked one computer, an i7 13700 (full specs here https://www.amazon.com/dp/B0CFBDRMXT ). My previous computer was an i3 8100 which rotated to my wife. I it will be of value, I can run the scripts on it as well.

@Lawrence37
Copy link
Collaborator

@chaimav I created a branch which respects the number of threads set in preferences when using wavelets. I'm interested in knowing what the performance is like for different manually-set values. Executables will be available for download at the bottom of this page in a few minutes: https://github.com/Beep6581/RawTherapee/actions/runs/10238785142

@chaimav
Copy link
Author

chaimav commented Aug 5, 2024

@Lawrence37 I just tested RawTherapee_wavelet-thread-num_5.10-383-gf9bcf594b_win64_release with different numbers set for for threads and found no discernable difference with processing post scrolling**. Performance was similar to zero (automatic) of the standard dev build.

**Tested with a stopwatch so some error is to be expected, but on the regular Dev build, non zero numbers shows noticeable improvement

@Lawrence37
Copy link
Collaborator

It's also slow with 1 thread? I expected it to have the same performance as dev with manual threads since both use 1 thread.

@chaimav
Copy link
Author

chaimav commented Aug 6, 2024

I didn't try with 1, but I tried other numbers like 8 and 24. On the dev version, those are noticiably faster than 0. On this version I saw no difference between 0 and those numbers. I can try 1 when I get home tonight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: performance Performance issues and improvements
Projects
None yet
Development

No branches or pull requests

4 participants