Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance regression in 2.0.17 #13899

Open
miwelc opened this issue Apr 15, 2021 · 40 comments
Open

Huge performance regression in 2.0.17 #13899

miwelc opened this issue Apr 15, 2021 · 40 comments

Comments

@miwelc
Copy link
Contributor

miwelc commented Apr 15, 2021

We've experienced a drastic performance regression (between 3x - 8x) in our app by updating from 2.0.16 to 2.017.
On the other hand, size has decreased 14%.
I suspect it's due to LLVM's new pass manager. Is there any flag we can use to select the legacy LLVM pass manager? This new behavior is preventing us from updating as performance is mission critical for our use case and we don't really care about size.
Thanks.

@sbc100
Copy link
Collaborator

sbc100 commented Apr 15, 2021

I think you should be able to just pass -flegacy-pass-manager to the compiler. We should probably have included that in the ChangeLog.

@miwelc
Copy link
Contributor Author

miwelc commented Apr 15, 2021

Would this flag affect system libraries compilation (libc, libc++...)?

@miwelc
Copy link
Contributor Author

miwelc commented Apr 15, 2021

Passing -flegacy-pass-manager at compilation and --lto-legacy-pass-manager at link time doesn't change much.

@sbc100
Copy link
Collaborator

sbc100 commented Apr 15, 2021

Ah yes, the system libraries won't effected by -flegacy-pass-manager on the command line. To fully revert you would need to essentially revert #13818 (at least the emcc.py and building.py parts) then run ./embuilder --clear-cache, then rebuild your project.

@miwelc
Copy link
Contributor Author

miwelc commented Apr 15, 2021

I should have said I was testing on Safari 14.0.3. On Chrome 90 performance doesn't seem to have any regression, regardless of flags.

After further testing on Safari:

  • No flags: very bad performance, as in 8x worse.
  • -flegacy-pass-manager and --lto-legacy-pass-manager: same very bad performance
  • -flegacy-pass-manager, --lto-legacy-pass-manager and -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1: Much better performance but not as good as with 2.0.16
  • Only -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1: Same performance as with 2.0.16 😄.

Regardless of flags: ~14% size reduction. Mostly from system libraries?

Seems like the new LLVM pass manager combined with the old Binaryen inlining behavior gives the better performance for our use case on both browsers.

Anyway, I think it would be nice when there are changes to defaults if changelog included the flags needed to revert to old behavior 😅.

@sbc100
Copy link
Collaborator

sbc100 commented Apr 15, 2021

Thats somewhat counter intuitive.. it sounds like the only way to match the old performance is to use the new pass manager.

I guess we may want to revisit the binaryen inlining change and/or file a bug with apple about that.

@miwelc
Copy link
Contributor Author

miwelc commented Apr 15, 2021

I know... I don't get it but I've run each (flag combination x version) multiple times and those are the results I get...
Anyway, I'm going to close this now. Thanks!

@kripken
Copy link
Member

kripken commented Apr 15, 2021

Good point about the changelog, thanks, I'll open a PR to update it for both those changes.

@kripken
Copy link
Member

kripken commented Apr 15, 2021

As for the perf issue, perhaps more inlining ends up hitting a worse case with Safari register allocation or something like that? If you can provide a testcase that could be interesting to look at. Otherwise let's see if there are more reports of issues.

@mithrendal
Copy link

I see performance is dropping when going from 2.0.16 to 2.0.17. it drops from 50fps to 8fps. Strange thing is that it drops only when using Firefox. Chrome and Safari are doing fine on 2.0.17 builds.

code is at https://github.com/dirkwhoffmann/virtualc64web

@sbc100
Copy link
Collaborator

sbc100 commented Apr 19, 2021

Did you try with -s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1.. if that solution doesn't work then likely you are no hitting the same issue.

@mithrendal
Copy link

mithrendal commented Apr 19, 2021

thank you @sbc100 that setting brings back the performance of vc64web build with 2.0.17 on firefox to 50fps. Strangely in opposite to @miwelc 's project it does affect only firefox here. Safari runs fine without the setting whereas his project showed the slowdown only on Safari 🤔...

@sbc100
Copy link
Collaborator

sbc100 commented Apr 19, 2021

@kripken I wonder if we can find a reasonable a value to set --one-caller-inline-max-function-size to get most of the benefits without getting these negative side effects on certain engines?

@mithrendal
Copy link

Just tell me if you would like me to retest with another value than 1. 🤓

@kripken
Copy link
Member

kripken commented Apr 19, 2021

@mithrendal @miwelc

It would be good if both of you can test with more values of --one-caller-inline-max-function-size, yes. Perhaps you can bisect to find the specific value where change happens, then we can consider what to do with the default based on that.

Also, these are specific browser limitations or bugs that we need to investigate and file. Can you please create testcases showing the issue, that is, a build with the 'bad' and the 'good' settings? Then we can file bugs for them to look into. (I can help with the filing part.) Without filing bugs for them things may not improve.

@mithrendal
Copy link

mithrendal commented Apr 20, 2021

ok first some testing for vc64web ...

still fast values=1,1024,2048,4096,16384,18k,19k,19250,19280,19296,19304,19306
bad values = 1048576, 65535, 32768,20480,20k,19500,19375,19312,19308,19307

@mithrendal
Copy link

mithrendal commented Apr 20, 2021

machine is a mbp 2007 with macos10.11 and firefox 78.10.0esr (64-Bit)

logoutput for --one-caller-inline-max-function-size=19307

time[ms]=1520, audio_samples=65536, frames [executed=12, rendered=3] avg_fps=3
lost sync target=39, total_executed=6
lost sync target=38, total_executed=6
time[ms]=1565, audio_samples=45056, frames [executed=12, rendered=4] avg_fps=3
lost sync target=37, total_executed=6
lost sync target=37, total_executed=6
time[ms]=1509, audio_samples=90112, frames [executed=12, rendered=4] avg_fps=3

logoutput for --one-caller-inline-max-function-size=19306

time[ms]=1010, audio_samples=45056, frames [executed=51, rendered=51] avg_fps=45
time[ms]=1015, audio_samples=45056, frames [executed=50, rendered=51] avg_fps=45
time[ms]=1002, audio_samples=43008, frames [executed=51, rendered=46] avg_fps=45

any thoughts?

is it serious? or is it high enough to ignore it ? BTW what is the meaning of the number ? Is the unit of the number bytes or instructions ?

should I still provide the two wasm files?

@kripken
Copy link
Member

kripken commented Apr 20, 2021

@mithrendal

Thanks!

Yes, the wasm files would be very helpful. I'd like to see the change between them. And we can file a bug on the relevant browser with them.

The meaning of the number is the number of IR nodes in binaryen, which is almost exactly the number of instructions in wasm.

That the number is 19306/7 is interesting - this is not a small function that is being inlined. The next question is into what does it get inlined (I can see that with the wasm files).

@mithrendal
Copy link

mithrendal commented Apr 20, 2021

@kripken I made the two builds and put them into the zip file ... is that maybe already enough and you are looking only for those .. or do you need the other js files which are needed to run the whole thing too?

wasm_builds.zip

see the whole thing running (currently with a 2.0.16 build) here https://dirkwhoffmann.github.io/virtualc64web/

the source files (and also the other js files which are needed to run it) are here https://github.com/dirkwhoffmann/virtualc64web

@kripken
Copy link
Member

kripken commented Apr 22, 2021

Thanks @mithrendal !

Yes, the other JS files will be needed for filing a browser bug. What are the steps to plug in the js/wasm combos from the archive, to get it running with that repo?

Meanwhile, inspecting the wasm files, it looks like the difference is func $215 is inlined into func $327. The inlined function is indeed of size 19307 as expected. Raw data before:

func: 215
 [binary-bytes] : 44778   
 [total]        : 19307   
 [vars]         : 6       
 Binary         : 4441    
 Block          : 265     
 Break          : 415     
 Call           : 144     
 Const          : 3493    
 Drop:          : 0
 GlobalGet      : 1       
 GlobalSet      : 2       
 If             : 36      
 Load           : 2601    
 LocalGet       : 4599    
 LocalSet       : 1075    
 Loop           : 0
 Select         : 288     
 Store          : 1280    
 Switch         : 1       
 Unary          : 665     
 Unreachable    : 1       

func: 327
 [binary-bytes] : 1306    
 [total]        : 558     
 [vars]         : 13      
 Binary         : 101     
 Block          : 29      
 Break          : 25      
 Call           : 14      
 Const          : 100     
 Drop           : 4       
 If             : 11      
 Load           : 56      
 LocalGet       : 137     
 LocalSet       : 36      
 Loop           : 1       
 Select         : 1       
 Store          : 30      
 Switch         : 1       
 Unary          : 10      
 Unreachable    : 2       

and after:

func: 326
 [binary-bytes] : 46075   
 [total]        : 19863   
 [vars]         : 13      
 Binary         : 4542    
 Block          : 293     
 Break          : 440     
 Call           : 157     
 Const          : 3593    
 Drop           : 4       
 GlobalGet      : 1       
 GlobalSet      : 2       
 If             : 47      
 Load           : 2657    
 LocalGet       : 4735    
 LocalSet       : 1112    
 Loop           : 1       
 Select         : 289     
 Store          : 1310    
 Switch         : 2       
 Unary          : 675     
 Unreachable    : 3       

(the combined new function is index 326 since 215 was removed, and then all subsequent indexes decreased by 1.

I don't think there is much we can do on the tools side since this looks like something we should inline:

  • Inlining reduces total code size.
  • The inlined call was in a loop - those are often good to inline for speed.
  • The inlining does not increase the total vars in the target (still 13; so there is no obvious signal of register pressure).

@kripken
Copy link
Member

kripken commented Apr 22, 2021

@lars-t-hansen

To summarize the last few comments in this issue, we have a testcase that dropped from 50fps to 8fps on Firefox, after binaryen started to do more inlining. There is no change on Chrome or Safari, so this is Firefox specific.

Investigating the wasm file, I don't see a reason not to perform the inlining (see last comment) - it looks like something the toolchain should be doing as best I can tell (that is, the new inlining heuristics look broadly better, and there is no obvious tweak to the heuristics to avoid this particular testcase).

Perhaps this is hitting a bad case in the Firefox register allocator, or some internal limit on the optimizing compiler, or tiering? We can file a bug for further investigation if that would be useful (still need some details from @mithrendal for how to run the code).

@lars-t-hansen
Copy link

@kripken, thanks, I'll investigate next week. I assume this is x64?

@mithrendal
Copy link

mithrendal commented Apr 23, 2021

as vc64web will soon move to its new home address vc64web.github.io (we wait first to get the CORS related http-response headers set at related cross sites from which vc64web currently can fetch its retro content from the demo scene) I took this opportunity to publish the latest copy of it into it, (effectively use the new address for this issue) that is the build emsdk_2.0.17 with the problematic parameter --one-caller-inline-max-function-size=19307 the same that you analysed

so you will currently get that 2.0.17 version up and running here at its future address http://vc64web.github.io

(the current official address with the 2.0.16 build is still https://dirkwhoffmann.github.io/virtualc64web/)

when you start it, it asks you for ROM-chips. You may just hit install open-roms or press ESC now. In both case it will install three system roms from the mega65-open-roms project. But if you only! do this the slowdown will not yet! occur. This is because the open-roms do not include the romchip of the Commodore 1541 floppy drive and I found out that on the C++ side the slowdown occurs only! when emulating the 1541 floppy drive.

so the long function with 19307 WASM instructions seems to be this one in

/Emulator/C64.cpp
line 1092
if (drive8.isActive()) drive8.execute(durationOfOneCycle);

see link to code here
https://github.com/dirkwhoffmann/virtualc64web/blob/e913978fb06534843e48b9d05a8e06b805326764/Emulator/C64.cpp#L1092

@dirkwhoffmann can you confirm ? @lars-t-hansen had the question whether the code is x64 ? Can you say something on that matter too?

to get complete build (including the js-files) just head to https://github.com/vc64web/vc64web.github.io and click green button 'code' -> Download ZIP

to get the floppy drive8 activated you will have to click the missing 1541-rom-chip icon here.

Ohne Titel

try this file https://sourceforge.net/p/vice-emu/code/HEAD/tree/trunk/vice/data/DRIVES/d1541II?format=raw

as soon you loaded the missing floppy drive rom, the emulator will activate the emulation code for the 1541 floppy drive and begin to slow down dramatically when using firefox and when being compiled with --one-caller-inline-max-function-size=19307 or when the compiler option `--one-caller-inline-max-function-size' is not set at all

you can spot the perfstats in the web console they will be outputted every second. (there is also a live-debug-console which you can enable in the settings of vc64web)

if there is still need to build and compile then you will find the source code currently here https://github.com/dirkwhoffmann/virtualc64web
and the how-to-make instruction here https://github.com/dirkwhoffmann/virtualc64web/wiki/build_and_run

@kripken
Copy link
Member

kripken commented Apr 23, 2021

lars-t-hansen had the question whether the code is x64 ?

I think the question was where you saw the regression - is your machine x64 (as opposed to 32-bit, or ARM)? Info about the browser, CPU, and OS can be helpful in figuring out why it became slower on a particular combination.

@dirkwhoffmann
Copy link

I think the question was where you saw the regression - is your machine x64

Mithrendal was using Firefox 78.10.0esr (64-Bit) on a MacBook Pro from 2007 (Intel core2duo 2.2 GHz) with macOS 10.11 El Capitan.

@mithrendal
Copy link

today I changed the current app at
https://dirkwhoffmann.github.io/virtualc64web/

to a 2.0.17 --one-caller-inline-max-function-size=19306 build

the firefox problematic 2.0.17 / 19307 build is still at
https://vc64web.github.io

for building I used an intel macmini on bigsur latest patch level with emsdk 2.0.17.

test running was at a slow 2007 mbp machine as @dirkwhoffmann already descibed

@lars-t-hansen
Copy link

I have hardware like that but the latest Mac OS I have for it is 10.6.8, on which esr78 is not supported. Updating to 10.11 is a little bit of a project, as it's not exclusively my computer.

@mithrendal, I'm wondering, can you rerun the experiment but first go into about:config and set javascript.options.wasm_baselinejit (it may be called something slightly different than that but it should be obvious) to false? The most obvious explanation for the problem is that you're being bitten by tiering / being stuck in slow code, and it would be good to rule that out. (Load times may increase.)

@mithrendal
Copy link

mithrendal commented Apr 28, 2021

@lars-t-hansen I did set the baslinejit to false and it changed not anything, I spotted between 2 and 4 FPS on the old machine with that setting...

then I did a test on a newer machine a intel mac mini 2014 2.6 Ghz Big Sur 11.2.3 Firefox 88 64Bit:
with that newer machine

I got full 60 FPS when running the emsdk 2.0.17 build with --one-caller-inline-max-function-size=19306 and below
https://dirkwhoffmann.github.io/virtualc64web/

and around 12 FPS when running with--one-caller-inline-max-function-size=19307and above ...
https://vc64web.github.io/

@lars-t-hansen
Copy link

Thanks. It'll be easier to test on the newer hardware with newer Firefox, I'll try to repro.

@lars-t-hansen
Copy link

OK, I can repro on my current Linux desktop system, life is good. https://bugzilla.mozilla.org/show_bug.cgi?id=1708381.

@mithrendal
Copy link

@lars-t-hansen vc64web now finally moved to its new home address http://vc64web.github.io , which you used as a test case for the aggressive inline problem on firefox bug 1708381

that means the address serves no longer the build with the problematic inlining for firefox. As far as I understood from the bugzilla ticket you and your colleges have already isolated a small testcase which shows the same problem right? Anyway, if you still need that problematic build I can make one for you on another address. Good luck !

@lars-t-hansen
Copy link

@mithrendal, we do indeed have a separate test case for this, thanks for letting us know. The bug referenced above should have all the details.

@julian-seward1
Copy link

julian-seward1 commented Jun 2, 2021

@mithrendal I've been working on a fix for this (https://bugzilla.mozilla.org/show_bug.cgi?id=1712078). I'd like to verify that it really fixes the performance problem on the problematic build. So .. could you please make it available on another URL, per your comment above?

@mithrendal
Copy link

mithrendal commented Jun 2, 2021

Hi @julian-seward1

sure 🤓


build with maximal inlining (parameter --one-caller-inline-max-function-size is omitted)

published to
https://vc64web.github.io/devbuild/


build with limited inlining (--one-caller-inline-max-function-size=19307)

published to
https://vc64web.github.io

@julian-seward1
Copy link

@mithrendal Thanks. With the patch at https://bugzilla.mozilla.org/show_bug.cgi?id=1712078#c9 I got the following results:

Limited inlining, with patch 56 fps
Max inlining, with patch 56 fps
Max inlining, without patch 12 fps

In other words .. the patch fixes this problem.

kripken added a commit to WebAssembly/binaryen that referenced this issue Jul 1, 2021
We have seen some cases in both Chrome and Firefox where
extremely large modules cause overhead,

#3730 (comment) (and link therein)
emscripten-core/emscripten#13899 (comment)

There is no "right" value to use as a limit here, but pick an
arbitrary one that is very high. (This setting is verified to have
no effect on the emscripten benchmark suite.)
@andrewevstyukhin
Copy link
Contributor

Hi,
after I upgraded to emscripten 2.0.17, browser started to crash with long running application by out of memory condition (7GB for Chrome).
-s BINARYEN_EXTRA_PASSES=--one-caller-inline-max-function-size=1 helped to return stability but increased wasm size.
I'm confused why inlining tends to OOM.

@kripken
Copy link
Member

kripken commented Nov 9, 2021

It sounds like the OOM is of the browser itself? Inlining leads to larger function sizes, which can require more work by the VM. For example, if the VM does work that is O(N^2) in the number of locals (to do register allocation, for example), then inlining that adds more locals (usually the case) can raise memory usage by a lot.

@andrewevstyukhin
Copy link
Contributor

Yes, browser eats memory till crash. Problem stably happens in -s ASYNCIFY=1 builds which have no precise settings.
Probably I have to test the well-tuned asyncify project to see the global impact.

imho Amount of locals can be reduced by splitting big functions into subfunctions (each outer loop for example). It's like a research to measure all browsers and derive good enough ranges.

Maybe isUnderSizeLimit should be configurable to further reduce long compilation time under Safari.

@kripken
Copy link
Member

kripken commented Nov 10, 2021

You can try -Os (instead of -O3) as well as the various wasm-opt inlining tuning flags, like --flexible-inline-max-function-size (in addition to --one-caller-inline-max-function-size that is already mentioned above). That should reduce the amount of large functions in the output.

@andrewevstyukhin
Copy link
Contributor

I already use -Os -flto. Anyway --flexible-inline-max-function-size seems a handy parameter for upcoming tuning.
A nice hint, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants