Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about status of dgNewtonSse #215

Open
JayFoxRox opened this issue May 1, 2020 · 2 comments
Open

Questions about status of dgNewtonSse #215

JayFoxRox opened this issue May 1, 2020 · 2 comments

Comments

@JayFoxRox
Copy link
Contributor

For running Newton on the original Xbox (Pentium 3, 733MHz, MMX and SSE; specifically no SSE2 or higher) I wondered about the status of dgNewtonSse. That is still mentioned here:

#option("NEWTON_WITH_SSE_PLUGIN" "adding sse parallel solver" OFF)

and here:

if (NEWTON_WITH_SSE_PLUGIN)
add_subdirectory(dgNewtonSse)
endif()

(and potentially elsewhere)

However, the actual folder newton-dynamics/sdk/dgNewtonSse is nowhere to be found.
To find out when / why it was deleted, I checked the git history.

The first hint I found was in 4290f25 which mentions them as "unfinished" which implies they are still planned / wanted.

I then kept looking if they were finished in the past (before being unfinished by bitrot), and found the last revision with dgNewtonSse: https://github.com/MADEAPPS/newton-dynamics/tree/ce423a44e3d0e6b075d84195aec29081ce9acb66/sdk/dgNewtonSse
After that, it was renamed / moved to dgNewtonGL: https://github.com/MADEAPPS/newton-dynamics/tree/159b8b469b3f8cea55c69a8dfd93ceb1e76a68af/sdk/dgNewtonGL

Tracking the changes became tedious, so I started looking at the state of the current implementations for AVX and SSE4.2.

So my questions are:

  • When and why was dgNewtonSse removed / turned into dgNewtonGL? What was dgNewtonGL? Did SSE continue to exist elsewhere (where I didn't see it)?
  • Does auto-vectorization of modern compilers like Clang do a good enough job to use the reference solvers with good performance?
  • Would there be any expected performance benefit when adding explicit legacy SSE support (specifically SSE without SSE2+)?
  • What makes each vectorization plugin code (Sse4.2, Avx, Avx2) incompatible with other vectorization instruction sets, other than the little code in dgSolver.h?
  • The code in dgNewtonAvx2/dgSolver.cpp and dgNewtonSse4.2/dgSolver.cpp is identical with exception of a single line. Couldn't the code be re-organized to make it easier to add support for other instruction sets without so much redundancy?
@JulioJerez
Copy link
Contributor

there is not dgNewtonSse plugin.
the parallel solver is the default and use the most basic version of SSE and SSE2.
this solver is part of the engine, so it work static or as dll.
all other plugins use more advance version or SSE. instructions like gathering/ scatherisn, muladd, and some others but tha can only be load as DLLs.

the code template in teh dll solvers may look identical but the driver function are very different and that what make then incompatible. for example a cpu that does not support avx2 will not load but a that cpu may load the AVX plugin.
teh avx2 plugin, in theory execute twice as many flops because the muladd instruction, it also support gathering whi simply some operation. In practice avx2 is only about 5 to 10% faster.

in general the plugins are fasters because the use the simd vector are if they were GPU compute units. for example a avx2 solve solver 16 joints per iterations where the default solver solve one per iteration. This requires some overhead to transpose the data for array or structure to structures of array, therefore the true benefic is when thousand of joints are resolve so the cost of transposing is amortized by the gain of the solver.

Sutor vectorization doe not translate to big gain because the engine predate the date of these compiler optimization, but since the engine support scalar math, you can just define the preprocessor USE SCALAR operation and let the compiler do all the optimization.
the time I try I have never seen it doing a better job than the hand made optimizations.

@JulioJerez
Copy link
Contributor

JulioJerez commented May 2, 2020

if you are using an older intel Pentium 3, 733MHz, MMX and SSE;
the SSE mode is not really very good because internally it is still a 64 bit bust.
so SSE is kind of a place holder for that CPU, it was not until the intel core duo that SSE became a real factor in floats throughput. In your case, the default solver is your best option, because you avoid the extra work of the plugin, remember teh plugin are faster because the capitalize in one or more feature of the instruction set. be that 8 way float vector avx, muldd, 256 bit wide internal bust, and so on. you
you cpu does nor supports any of then so teh overhead of the plugin will be a wast that cann't be recovered by special instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants