Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replaced repeated progress() calculation calls with a variable #4256

Open
wants to merge 3 commits into
base: 0_15
Choose a base branch
from

Conversation

DedeHai
Copy link
Collaborator

@DedeHai DedeHai commented Nov 7, 2024

progress() is called in setPixelColor() re-calculating the transition progress for each pixel.
Replaced that call with an inline function to get the new segment variable. The progress is updated in service() when handleTransition() is called. The new variable is in a spot where padding is added, so this should not use more RAM.

Result: over 10% increase in FPS on 16x16 matrix during transitions

progress() is called in setPixelColor(), calculating the transition progress for each pixel. Replaced that call with an inline function to get the new segment variable.
The progress is updated in service() when handleTransition() is called.
The new variable is in a spot where padding is added, so this should not use more RAM.
Result: over 10% increase in FPS on 16x16 matrix
@DedeHai DedeHai requested a review from softhack007 November 7, 2024 18:11
@@ -317,12 +317,12 @@ void Segment::stopTransition() {
}

// transition progression between 0-65535
uint16_t IRAM_ATTR Segment::progress() const {
void IRAM_ATTR Segment::updateTransitionProgress() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use IRAM_ATTR_YN here - it means that esp32 puts the function into IRAM, while 8266 doesn't. We'll save some IRAM space especially on the "_compat" builds.

As the function is only called once per frame in WS2812FX::service() - via seg.handleTransition() - it might even be better to remove the IRAM_ATTR as this call is not performance critical any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about removing the attribute but left it as is since I have no way to check the difference. I once as a test removed all IRAM_ATTR and on my setup there was zero performance change.
I think removing it here is safe, as you say, this is only called once per frame.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esp8266 will thank you ;-)

Copy link
Collaborator Author

@DedeHai DedeHai Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe WLED_SAVE_IRAM should also be defined on ESP32 C3: it is not as performant as the ESP32 any way and putting stuff in IRAM uses a lot more flash for some reason. If I enable WLED_SAVE_IRAM on the C3 that saves 1.6k of flash. On the ESP32 it only saves 68 bytes of flash.
Any suggestions for performance tests that would show if this is a valid option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any function marked with IRAM_ATTR will always be kept in fast SRAM and will never be fetched from flash. The basic idea for IRAM_ATTR is to be used in ISR or functions that may access (write to) flash directly.
The benefit of using it elsewhere is to speed up access to such function as it will never go to cache hit/miss logic.

Contrary to what @softhack007 is saying I still think adding IRAM_ATTR to functions that are called very often is beneficial. I am telling this from experience with over 50 installed ESP8266's with various options and usermods installed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to what @softhack007 is saying I still think adding IRAM_ATTR to functions that are called very often is beneficial.

It might be beneficial, however we talked about the new progress() that's only called a few hundred times per second (max) now. The function is not time critical any more with this PR, so why use IRAM_ATTR for it?

Copy link
Collaborator

@softhack007 softhack007 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(side-topic)

I once as a test removed all IRAM_ATTR and on my setup there was zero performance change.

This aligns with my own experiments on -S3 and esp32 with 80mhz flash - no noticeable performance impact, however sometimes IRAM_ATTR increases program size. This can be explained because the compiler cannot inline such a function, even when there would be a benefit for program size.

Maybe it's also depending a lot on the CPU caches. In fact a function that's called really often has a good chance to be cached by the CPU already. Also a board with fast flash (qio 80mhz) is like 4x faster on flash reading, compared to slow flash (dout 40mhz).

Many cheap 8266 still have 40mhz dout, plus smaller caches, so it makes sense that there is still some benefit of IRAM_ATTR on boards with slow flash.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no doubt that any ESP32 performs adequately without IRAM_ATTR.
However ESP8266 is another thing and while it may be old and lacking it is still used by many users (including me) who keep attaching plenty of peripherals to it while running WLED. Hence I strongly urge to keep IRAM_ATTR as many times as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test if this PR with latest commit has any impact on ESP8266? i.e. removed IRAM_ATTR from updateTransitionProgress()

The other question was to add WLED_SAVE_IRAM to C3 builds as it would save on flash size but may have negative performance impacts in certain situations / setups. If there is a way to test that on the C3 I could check. After all, the C3 is more of an upgraded ESP8266 compared to other ESP32 variants.

wled00/FX_fcn.cpp Outdated Show resolved Hide resolved
`updateTransitionProgress()` is called only once per frame, no need to put it in RAM.
wled00/FX.h Outdated
@@ -363,6 +363,7 @@ typedef struct Segment {
};
uint8_t startY; // start Y coodrinate 2D (top); there should be no more than 255 rows
uint8_t stopY; // stop Y coordinate 2D (bottom); there should be no more than 255 rows
uint16_t transitionprogress; // current transition progress 0 - 0xFFFF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a static variable as it will be modified in each handleTransition() call at the start of service() loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the current behaviour if for example the palette of one segment is changed with 10s transition, and after 5s palette of a second segment is changed? Will that stop transition of the first segment? i.e. are transition start/stop per segment or global?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition time is bound to strip ATM but each segment may start transition at its own (point in) time. So each segment should transition on its own (that was my goal when implementing current transitions, compared to previous, limited transitions) independent from others.

Copy link
Collaborator

@softhack007 softhack007 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a static variable as it will be modified in each handleTransition() call at the start of service() loop.

Rather than make it static, I'd say put it into strip.

Static member attributes of a class are always a good source of confusion when trying to read code written by someone else - because a 'static' member is technicially not even part of the object you work with. It survives delete segment, changing it in one object instances also changes the value in all other instances. Coping one segment to another will not create a copy of static members attributes - there will still be only one value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than make it static, I'd say put it into strip.

That would be a worse choice IMO. It can be considered in the same way as _vLength, etc in speed improvements branch by @DedeHai . It is used as a speedup, pre-calculated value.
If you think this will cause confusion, keep it as an instance member rather than strip member.
But that is just my opinion, no need to take it into account.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

progress() determines the state where transition is. It should not be "frame" based. It only depends on time since transition (for that segment) started. BTW it is possible to have multiple segments in transition with differing progress() value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason progress() should depend on the order of segment processing or when does this give an advantage aver calculating it once per frame? If for example there are lots of segments with heavy FX calculation (>1ms) later segments would already be further in the transition in the same frame, resulting in different palette or brightness for example. That does not appear to be the correct approach IMHO. It would only be right if startTransition() is called with the same time delay, which I don't think it is. Using the 'per segment' update is more consistent code though, as you said. I think your suggestion of following the Segment::beginDraw() is a good compromise.

Copy link
Collaborator

@blazoncek blazoncek Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason progress() should depend on the order of segment processing

It should not be dependant on the order of segment processing. Each "segment" (if you want it) has its own "next time" for update in the form of effect function return value. So some effects may run at independent frame rate. Look at Android and a few others (i.e. Game of life).

What matters is the time difference between the start of transition and current time (in regards to the transition duration). This is not frame or segment based. It is true that segment holds both start time and transition duration and these two are unique to each segment. This was the prime reason to have progress() as a function that calculates how far transition has progressed. If you impose a limit that progress is calculated at the start of frame (for optimisation for speed) then it needs to be pre-calculated for each segment separately. Hence the suggestion for beginDraw().

Why do you think that having different progress value for is wrong? Each segment operates independently and displays effect and/or palette independently of other segments. Even if overlaid. It is totally acceptable for two segments to be at different progress level.

And it is quite easy to achieve that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are different scenarios, your reasoning seems sound to me. My scenario was this: have 10 segments, all displaying the same FX, change palette on all of them at the same time -> a transition starts on all segments. Palette change may now not be simulatanuous on all segments, the last ones may already have progressed further. Is my thinking here correct for current code? I am not arguing against your suggestion to keep it 'per segment', which I think is the way to go, just trying to understand where a 'once per frame' would be better and where it would be worse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct and in most cases this will be the case (same start for all segments). But not in all cases (and not all segments may have same effect and/or palette).

So, if you have an exception, treat it as a regular behaviour (not all segments may start transition at the same time).

@blazoncek
Copy link
Collaborator

I've tested the implementation that uses static transitionProgress variable but otherwise uses exactly the same logic as this PR and everything seems to work as expected.

@netmindz
Copy link
Collaborator

netmindz commented Dec 5, 2024

Did we reach an agreement on the final implementation? @DedeHai

@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 5, 2024

we did, as just (wrongly) mentioned in the other PR: @blazoncek has already implemented and tested the changes

@netmindz
Copy link
Collaborator

netmindz commented Dec 5, 2024

Are these changes in another pr, already merged or just local on your machine at the moment @blazoncek ?

@blazoncek
Copy link
Collaborator

All of the changes are in my fork, combined with other updates. Probably not cherry-pickable, unfortunately. Implementation used there uses static member.
IMO A better place for this introduction would be "speed improvements" PR as it is just that.

Speed improvements and progress() improvements are running on my set-ups for over a month now without issues or glitches. But with some of my additions on top.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 5, 2024

I am not sure it is the best idea to put all changes in the same "speed improvements" PR, if you feel confident about that, I am fine with closing this one. Just have to remember the speed improvements PR must not be squash-merged or debugging will be hard.

@softhack007
Copy link
Collaborator

softhack007 commented Dec 5, 2024

I am not sure it is the best idea to put all changes in the same "speed improvements" PR

I agree with you - mashing up everything into a "speed improvement" PR which will not be cherry-pickable (as bugfixes and enhancements are all squashed up) does not seem like a good solution. We need a way to keep the codebase understandable and maintainable, so smaller PRs are better imho,

It might be possible to create a temporary branch of 0_15, and then merge all optimizations into this 0_15_optimized branch. Finally create a new PR where everything is consistent, but commit history is preserved. This may require a few "git merge" steps from the command line, though.

Edit: If I get a list of PRs to combine, I can give it a try next week.

@softhack007 softhack007 added the optimization re-working an existing feature to be faster, or use less memory label Dec 5, 2024
@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 6, 2024

@softhack007 I have most of my PRs running here already, merging them all is a bit of a pain as there are many conflicts to resolve. I merged them in this branch: https://github.com/DedeHai/WLED/tree/0_15_PS_potpourri
The open PRs in question besides this one are:
#4245 #4225 #4138
also #4145 (adds new FX features)

@blazoncek are your changes public? I can manually add them to this PR if not cherry-pickable, its just a few lines that changed right?

@blazoncek
Copy link
Collaborator

Anything I do is public when stable enough. Never hid anything. 😉

@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 6, 2024

@softhack007 I think merging into a new branch is not necessary, I can do the (non squashed) merge of the PRs in the correct order to have the least conflicts and resolve any that remain. What I probably will do is a test-run of the merges in my repo to figure out what the best order is: in that potpourri branch I had to start over twice...

this is now more aligned with other variables using the same logic
@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 7, 2024

@blazoncek are the changes in the latest commit ok?

@blazoncek
Copy link
Collaborator

I think yes, though I also moved updateTransitionProgress() into inline handleTransition() as it is pointless to force a function call since the function is only called once.

@DedeHai
Copy link
Collaborator Author

DedeHai commented Dec 7, 2024

pointless yes but I find it much more readable than jamming it all in that inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization re-working an existing feature to be faster, or use less memory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants