-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvements for discussion #4138
base: 0_15
Are you sure you want to change the base?
Conversation
no real difference in FPS but code is faster. also 160bytes smaller, meaning it is actually faster
tested and working, also tested video
its not faster but cleaner (and uses less flash)
-calculations for virtual strips are done on each call, which is unnecessary. moved them into the if statement.
uses less flash so it should be faster (did not notice any FPS difference though) also cleaned code in ColorFromPaletteWLED (it is not faster, same amount of code)
…alette inlining getMappedPixelIndex gets rid of function entry instructions (hopefully) so it should be faster. also added the 'multi color math' trick to color_add function (it will not make much difference but code shrinks by a few bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There are some formatting issues (spaces) but that's minor.
wled00/FX_fcn.cpp
Outdated
@@ -1816,7 +1820,7 @@ bool WS2812FX::deserializeMap(uint8_t n) { | |||
return (customMappingSize > 0); | |||
} | |||
|
|||
uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const { | |||
__attribute__ ((always_inline)) inline uint16_t IRAM_ATTR WS2812FX::getMappedPixelIndex(uint16_t index) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non obligatory: I would prefer __attribute__
at the end but [[...]]
in front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no Idea what the precompiler instructions mean in detail, I copied this from what they use in fastled...
the idea behind this is to get rid of the 'function entry' instructions that are added when a function is called. When I added the inline
flash size increased by a few bytes, telling me that it is actually inlined. Since this short function is only called from two places and is called A LOT this may be faster. I have no way to check (would need a proper debugger that shows assembly instructions being executed line by line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've recently learned these are compiler attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not sure if always_inline plays well with IRAM_ATTR .... the first tells the compiler to always inline the function, the latter says "put the function into IRAM" which means that a real function is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is what I was wondering too, this is just a suggestion, i.e. to inline this for speed but how to tell the compiler to inline it to the functions that are in ram... not sure how it will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding:
inline
is a hint/suggestion to the compiler. So it might get inlined, or not.__attribute__((always_inline))
is a directive. So the compiler must inline this function, no matter if its efficient or not.
If you want to optimize function calls, its sometime useful to add __attribute__((pure))
or __attribute__((const))
to the function declaration. But only do this after double-checking that the code is actually "pure" (no side-effects) or "const" (solely depends on arguments). I did this in the MoonModules fork, but honestly it does not give you more than 1 or 2 fps even if you apply it to lots of functions.
one more thing I was thinking about: |
I had the same idea! I expect that using flat temporary render buffer of the virtual() sizes, then applying all mapping operations as a post-processing step, would yield a substantial average-case performance improvement. That said, if you try to do mirror and transpose during the render, I think you'll give up most of the benefits as every write still ends up being a pile of variable lookups and conditionals instead of just straight local pointer arithmetic -- I think it'd be better to perform all mapping operations after the render. We'd also get some advantages with code layout and IRAM caching, as each FX function wouldn't need to make so many external function calls. The big tradeoff is RAM usage. The catch is that the temporary buffer needs contiguous RAM. If it's held allocated, we lose the ability to share that memory (in highly constrained systems) with other tasks like the web server... but if it's not, we risk not being able to get a big enough buffer when it's needed for a render. There may also be systems where there just isn't enough RAM to hold two copies of every pixel state at once. That said, I'd still love to see a PR that implements a temporary render buffer and pulls the mapping operation out to a separate pass after the FX function call. I think the performance and code size improvements are likely to be compelling, even if it reduces the maximum LED count that can be supported on memory constrained hardware. |
How are buffers currently? What is the global buffer vs no global buffer? |
Let's start at the bottom and make our way up. NeoPixelBus (in 2.7.9) uses 3 buffer: 2 render buffers and 1 transmit buffer. Render buffers are used in "double buffer" fashion. Transmit buffer may be small, but render buffers are of size LED count * pixel size where pixel size can be 24bits up to 64bits. (2.8.0 or later may change that as @Makuna is working on optimisations). Different hardware (RMT, I2S, ...) may have transmit buffers of different size. WLED creates its own global buffer of size LED count * 32bit (if so selected in settings) for its Segment does not create any buffers of its own (though in the past we had This (lossy reconstruction) is why mirroring, reversing and transposing have to be done in the Newer versions of NPB are going to do brightness scaling only during transmission of data to LEDs so And then, there is a CCT support. NOTE: Note the difference between 'GetPixelColor() |
Thank you for detailing this, much clearer now. I was just looking at 'soap' FX as an example that uses
A solution could be to replace the global buffer with individual segment buffers that are not mapped nor brightness adjusted but do that in one go when it is transferred to NeoPixelBus buffer (saving a lot of if statements and back and forth mapping). Would that be a good way to try and solve it or is there something fundamentally flawed with this approach? |
I am not going to rewrite all of the effects. 😄
This was used in first iteration of 0.14 while we used |
true, lookup is fast.
yes, the main difference being the color correction, right? If I disable global buffer, colors get worse but speed stays (almost) the same.
Do you mind to quickly elaborate what problems these were? |
- there already is a method to calculate the table on the fly, there is no need to store it in flash, it can just be calculated at bootup (or cfg change)
Overlapping segments. |
- gamma correction only where needed - paletteIndex should be uint8_t (it is only used as that) note: integrating the new `ColorFromPaletteWLED()` into this would require a whole lot of code rewrite and would result in more color conversions from 32bit to CRGB. It would be really useful only if CRGB is replaced with native 32bit colors.
wled00/FX_fcn.cpp
Outdated
|
||
unsigned paletteIndex = i; | ||
uint8_t paletteIndex = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is only used as an uint8_t down the line and not involved in any calculation.
edit:
correction: I have index as an unsigned in the ColorFromPaletteWLED()
and do a byte()
cast there anyway to prevent overflows. so it may be better to revert back to unsigned for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. ATM. what about in the future?
if it is unsigned, compiler will trim it down when passing to a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok reverted to unsigned locally. saves 6 bytes of code. I thought compiler would do better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here, int
or unsigned
is almost always less code (and slightly faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XTensa doesn't have non-word arithmetic instructions -- any operations done on a 8-bit or 16-bit types have to be performed by upcasting, operating, and downcasting afterwards to restrict the output range. At least here,
int
orunsigned
is almost always less code (and slightly faster).
in general: absolutely yes. But if a function is performed purely in 8bit (like most of the fastled stuff) sometimes the compiler optimizes in a way that makes it actually faster (smaller code) when using uint8_t
, don't ask me why exactly, I have been playing with types to improve code quite a bit and sometimes its just not really clear what is happening.
though so ;) but glad it is 'only that' |
It is not. wait for "blending styles" |
future improvements to buffers aside: should i squash this draft and make a PR? any changes required before I do that? |
Please, and do address points discussed. |
@blazoncek one more general question: |
One thing to keep in mind regarding existing effects is that most were ported from FastLED type effects. Some taken directly from FastLED, some from places like soulmate.com. FastLED operated on its CRGB (or CHSV) and so most effects disregarded the W channel. I guess this was the reason to introduce "Automatic White Calculation" for strips like SK6812 where white channel could be used to reduce power draw. Any new effect should, if possible, take into account white channel as well IMO though most will really not benefit at all. It all comes down to the "artist" how he/she envisions an effect on RGB and RGBW strips. I do not know if stripping W from calculations is wise or not without prior notice to users. |
Agreed. These are just ideas. |
I would avoid CRGB. Why? Because WLED uses NeoPixelBus and not FastLED (well, palettes are exception). NeoPixelBus has its own classes comparable to CRGB but also extended to include CCT (or WW & CW) information which does not exist in FastLED. Unfortunately the world is not as simple as that. CRGB is very common and a lot of people know what it represents. To top that some people want to control white channel independently of effects. |
just saw the commit, looks ok for me, too 👍- it achieves the same thing, just making "_colorScaled" a private variable (aka sideeffect) instead of using an explicit function parameter. |
wled00/FX_fcn.cpp
Outdated
@@ -437,7 +440,21 @@ uint32_t IRAM_ATTR_YN Segment::currentColor(uint8_t slot) const { | |||
#endif | |||
} | |||
|
|||
void Segment::setCurrentPalette() { | |||
// pre-calculate drawing parameters for faster access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blazoncek I have the feeling you took an idea that I implemented previously in the MM fork - pre-calculate drawing parameters for faster access. It is indeed a big speedup (my test in MM were up to 15% faster).
I don't want to go pricky about this, but it would be nice if you state where you took inspiration from. Something like "based on an idea from @softhack007 in the MM fork". Please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blazoncek thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to go pricky about this, but it would be nice if you state where you took inspiration from.
None of this would be needed if you'd provide similar optimisations upstream or not adopt GPL in MM. So I had to come up with something similar. You have the liberty (due to MIT) to pick anything from upstream while I need to beg people or figure out something.
Just my 2¢.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPLv3 is not the problem. The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences. I am still the author/owner of my own code, so I may "provide" it to any other repo licensed with MIT, but you're not allowed to "take" without explicit permission. I know this is not optimal and we must find a better approach soon. In general, I'm willing to provide optimizations back - making "something similar" is not a good way, but a waste of resources.
@Aircoookie You have proposed a while ago to harmonize the licensing between both repo's, and remove this "rift" in the community, by using EUPL-1.2. I'm not sure if there was further discussion on discord.
The MM team is willing to go this way. If you also agree to change your repo to EUPL-1.2, we (MM team @troyhacks @ewoudwijma @netmindz @lost-hope @Brandon502) will start collecting the necessary permissions from contributors since our change to GPLv3, and move to the same license. Please give me a sign :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is MIT licensing in the AC repo, which has become a real no-go for me (and a few others) after some very frustrating experiences.
NOTE: Emphasis mine.
Can you explain? Can others mentioned explain as well? I've added about 48k lines of code to WLED (not counting 0_15 branch) but have yet to encounter "frustrating experience". What does "frustrating experience" even mean?
Is that a stolen or "borrowed" code? It inevitably happens with GPL as well.
I am not against license change, I just want to know/understand true reasons for it. Is MIT too permissive? Everyone knew/knows WLED uses MIT so there should be no "frustrating experiences" as a consequence of that IMO. I would really love to hear everyone's perspective, not for WLED project but my personal insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand you wanting to know more, but I'm not sure that a public discussion as a side topic on a PR is the appropriate place.
Can we arrange a call/meeting with relevant parties?
swap if statements in color_fade
IMO this is ready for broader testing. |
I have some additional modifications that are not strictly speed related but do decrease binary size for another kB or two. |
as long as we keep the commit history, this PR could be a collection of code improvements. |
- removing WS2812FX::setMode() - removing WS2812FX::setColor() - removing floating point in transition - color handling modification in set.cpp - replaced uint8_t with unsigned in function parameters - inlined WS2812FX::isUpdating() - (MAY BE BREAKING) alexa & smartnest update
- changes to `setPixelColorXY` give an extra FPS, some checks and the loops are only done when needed, additional function call is still faster (force inlining it gives negligible speed boost but eats more flash) - commented out the unused `boxBlur` function - code size improvemnts (also faster) in `moveX()` and `moveY()` by only copying whats required and avoiding code duplications - consolidated the `blur()` functions by enabling asymmetrical blur2D() to replace `blurRow` and `blurCol` - compiler warning fixes (explicit unsigned casts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few thoughts.
wled00/FX_2Dfcn.cpp
Outdated
else strip.setPixelColorXY(start + width() - xX - 1, startY + yY, col); | ||
unsigned groupLen = groupLength(); | ||
|
||
if(groupLen > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces the processing if group length is 1 by removing:
- four assignments (
yY
&&xX
,W
&H
)- hidden two subtractions and a comparison (
W
&H
)
- hidden two subtractions and a comparison (
- two multiplications
- two comparisons (
j
&g
,uint8_t
!) - two comparisons (
yY>=H
&&xX>=W
) - two increments (
xX
&yY
)
For each setPixelColorXY()
call.
What may seem irrelevant with low number of pixels actually produces a noticeable slowdown on large pixel count.
} | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is unused ATM but it produces quite different blur and I put this function in intentionally.
I would not like it removed.
uint32_t newPxCol[vW]; | ||
int newDelta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the point in modifying move()
functions, especially with convoluted math that is hard to follow.
How much speed does it gain?
wled00/FX_2Dfcn.cpp
Outdated
@@ -676,7 +652,10 @@ void Segment::drawCharacter(unsigned char chr, int16_t x, int16_t y, uint8_t w, | |||
case 60: bits = pgm_read_byte_near(&console_font_5x12[(chr * h) + i]); break; // 5x12 font | |||
default: return; | |||
} | |||
col = ColorFromPalette(grad, (i+1)*255/h, 255, NOBLEND); | |||
uint32_t col = ColorFromPaletteWLED(grad, (i+1)*255/h, 255, NOBLEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "shadow" of the col
from a few linea above.
FYI I am now running these improvements on my test set up (8266,32,S2,C3) for about a week and see no pitfalls. Code is smaller and faster. |
As I am moving away from WLED (at least for a while if not permanently) I would urge you to implement this PR ASAP as it benefits WLED tremendously. |
I can rebase and merge any time. It would require another beta release as there are many changes in this PR that are not well tested. If we do that #4181 should go along with it though. |
- correctly clear segment spacing change - renamed Segment::setUp() to Segment::setGeometry() - removed WS2812FX::setSegment() - removed obsolete/unfunctional word clock usermod .cpp file
|
||
stateChanged = true; // send UDP/WS broadcast | ||
|
||
if (stop) fill(BLACK); // turn old segment range off (clears pixels if changing spacing) | ||
if (stop || spc != spacing || m12 != map1D2D) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "if (stop)" do the same as isActive()
, or is there a special reason to avoid isActive() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, it is an old code. Perhaps if you shorten a segment. Most likely on second thought.
_vHeight = virtualHeight(); | ||
_vLength = virtualLength(); | ||
_segBri = currentBri(); | ||
fill(BLACK); // turn old segment range off or clears pixels if changing spacing (requires _vWidth/_vHeight/_vLength/_segBri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use beginDraw()
, instead of updating all cached values explicitly? It would make the code easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beginDraw()
does a lot of other things too which are unnecessary.
I did consider calling it, but decided against and added comment.
- code is a bit cleaner and faster as well - chaning array access to pointer access in bus_manager makes it a few instructions faster - changed getNumberOfPins and getNumberOfChannels to return 32bit values, saving the unnecessary 8bit conversion
Added a lot of improvements to core functions making them slightly faster and saving some flash.
I tested all functions to work the same as before. I tested speed by looking at FPS and it's hard to say how much faster it actually is but I see some improvement, maybe 1%.
The inline of getMappedPixelIndex() is just an idea, not sure that is correct nor do I know if it improves anything.
Did not test on ESP8266.