Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fiber] Adds a fiber processing library #439

Closed
wants to merge 8 commits into from

Conversation

henrikssn
Copy link
Contributor

@henrikssn henrikssn commented Jul 29, 2020

Adds support for fibers to modm, based on ideas from boost::context, RIOT-OS and FreeRTOS. It started as a solution to #432 (how did that escalate so quickly?)

The main difference to other fiber/coroutine/thread approaches is that this library does all allocation statically (no heap needed) and it has a proper C++ API, where I tried to mimic std::thread as much as possible.

This is almost as fast as ProtoThreads and the memory overhead is very small (20 bytes per thread if no stack space is needed). It is in my opinion better, as you can define local variables and there is no macro magic needed. Even an ATTiny85 with its 512B of RAM should be able to support >10 concurrent fibers and a SAMD21 CPU with 30KB RAM can handle thousands of fibers.

For the API, I was trying to not have to create stacks directly by passing stack size as template argument (e.g. modm::Fiber<512>(&f, arg1, arg2)) but it got to messy inside of the implementation (as I no longer can have generic pointers to fibers). I might give it a second attempt by creating a non-templated fiber base class.

Task status in rough priority order:

  • Supports yield()
  • Round-robin scheduler
  • Low overhead (~50 CPU cycles context switch and 20 bytes RAM / fiber on Cortex-M0)
  • Separates non-portable code and keeps it to a minimum to make it easy to port the library to other platforms
  • ISRs on their own stack
  • Support other Cortex-M/F CPUs (should be really simple, I have a blue pill I can use for testing)
  • Support hosted x64 CPUs
  • Add unit tests
  • Support yield from subroutines (properly store the LR in context switches)
  • Support fiber function return (remove it from scheduling)
  • main() function is a fiber
  • Fiber handler arguments (something like modm::Fiber(&stack, &f, arg1, arg2)
  • Mutex / Semaphore / Go-style channels
  • Delay
  • Support AVR CPUs
  • More exotic scheduling options

@salkinium salkinium marked this pull request as draft July 29, 2020 17:54
src/modm/processing/fiber/fiber.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/fiber.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/fiber.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/fiber.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/asm/arm_cm0.S Outdated Show resolved Hide resolved
@salkinium
Copy link
Member

The issue with Protothreads and also C++20 coroutines is that they all change the function signature.
Fibers don't affect the function signature, which makes compositing async drivers much easier.
But this means you can only "upgrade" from Protothreads/Resumbles to Fibers and not downgrade.

Even an ATTiny85 with its 512B of RAM should be able to support >10 concurrent fibers

With a <50B stack each? I'm not so sure about that, especially if you want to use stack hungry functions like printf in each of them. The protothreads overlay the maximum call stacks of each thread on the main stack, that's not possible with fibers, so memory requirement is ~linearly scaling with fiber count.

So if you add fiber based APIs for basic peripherals like UART/SPI/I2C you're basically excluding AVRs and low-memory Cortex-M devices. That's not acceptable I'm afraid.

Perhaps there is a way to have the Protothread/Resumables *_YIELD() and *_WAIT_*() macros call modm::yield() directly, either as an lbuild option or by checking at runtime in what context they run (if that's even possible).
Then the downsides of Protothread/Resumables would be reduced somewhat, at least from a user perspective.

So Fibers need to be kept optional in modm:platform and thus also modm:driver.

@henrikssn
Copy link
Contributor Author

The issue with Protothreads and also C++20 coroutines is that they all change the function signature.
Fibers don't affect the function signature, which makes compositing async drivers much easier.
But this means you can only "upgrade" from Protothreads/Resumbles to Fibers and not downgrade.

Agree, this is why I am looking for early feedback :) A second advantage is that you can yield from subroutines, which isn't possible with protothreads; nested resumeables cannot resume without rebuilding the call stack (which is slow).

Even an ATTiny85 with its 512B of RAM should be able to support >10 concurrent fibers

With a <50B stack each? I'm not so sure about that, especially if you want to use stack hungry functions like printf in each of them. The protothreads overlay the maximum call stacks of each thread on the main stack, that's not possible with fibers, so memory requirement is ~linearly scaling with fiber count.

Yes, if I am understanding the AVR calling convention properly, then minimum stack is 20 bytes on AVR (18 callee-saved registers + 2 bytes for PC).

You have a point in reducing usage of stack hungry functions. I see two options to this:

  1. Only use them in user-facing methods, letting them run on the user thread, which can have more stack as there is only one of them.
  2. Spawn a separate "printf" fiber and run the stack hungry methods there. We still need memory somewhere for the printf stack, but it doesn't necessarily need to be allocated statically or in the platform/drivers (e.g. have user allocate it on main stack and pass it as arg / do something equivalent with template magic).
  3. Rewrite printf so that it doesn't use so much stack, possibly by relying more on statically allocated memory.

So if you add fiber based APIs for basic peripherals like UART/SPI/I2C you're basically excluding AVRs and low-memory Cortex-M devices. That's not acceptable I'm afraid.

I think it is possible to implement lightweight fibers as a replacement for ProtoThreads and still support AVR / low-memory MCUs. For example, RIOT and FreeRTOS both has AVR ports and they use much more stack space than I do.

We should probably find a concrete use case in the current modm code base and compare the RAM requirements for that between fibers and resumeables. Do you have something in mind for that?

Perhaps there is a way to have the Protothread/Resumables *_YIELD() and *_WAIT_*() macros call modm::yield() directly, either as an lbuild option or by checking at runtime in what context they run (if that's even possible).
Then the downsides of Protothread/Resumables would be reduced somewhat, at least from a user perspective.

Hmm, won't the user then get the worst of all worlds? The coding style differs quite a bit if you have one huge main stack where stack space is "unlimited" but you have no persistence (protothreads) vs small stack where stack space is limited but you have persistence (fibers).

So Fibers need to be kept optional in modm:platform and thus also modm:driver.

If we are seriously worried about memory space inside of modm:platform / modm:driver, then we can mandate that all those operations run on a single IO fiber. Then the total overhead would be 20 bytes + as little as possible extra for local variables, which should be manageable on all platforms.

The main benefit of this is that:

  • It would fully abstract away the IO loop from the user

Drawbacks:

  • It would be less CPU efficient from having separate fibers as there is no way for the scheduler to maintain a ready list, but the IO fiber instead has to poll each state separately.
  • It is not very sleep friendly, as it would not be easy to understand when it is safe for the IO thread to idle (thus allowing the scheduler to put the CPU in standby). Ironically, that is the problem I tried to solve in the first place (:

@salkinium
Copy link
Member

then we can mandate that all those operations run on a single IO fiber.

No, since IO operations can run at different frequencies, and that won't work with just one thread. It that were possible to do, we wouldn't need Protothreads either, since you could just run in the main while(1) loop.

For example, RIOT and FreeRTOS both has AVR ports and they use much more stack space than I do.

RIOT and FreeRTOS don't support 2000+ targets out-of-box, particularly not small memory devices. modm works on every shrivelled up potato!1!! I'm contemplating ripping out newlib in favor of compiling our own modularized version of picolibc to save even more static memory. Though that's probably just an obsession on my part… 😆

Hmm, won't the user then get the worst of all worlds?

Well… do we have a choice? We're not rewriting all HAL implementations and all 60+ Protothread/Resumble-based drivers to use fibers. There must be a some compatibility or some optimized fiber-module via lbuild (like modm:platform:uart.fiber perhaps). Just thinking about this makes me nervous…

We should probably find a concrete use case in the current modm code base and compare the RAM requirements for that between fibers and resumeables. Do you have something in mind for that?

Well basically all of the examples should work fine with fibers. That's probably not a good test since most of them don't even use Protothreads ;-P

I don't actually have any numbers of what the runtime/stack size overhead for resumbables actually is. It's probably more than I think.

Off the top of my head, stack hungry functionality in modm is:

  • printf and IOStreams
  • Hardware I2C interrupts due to its complicated state machine.
  • Complicated I2C drivers with nested Resumables.
  • Our GUI library (oh boy).

I think it is possible to implement lightweight fibers as a replacement for ProtoThreads and still support AVR / low-memory MCUs.

I definitely want to have a Fiber implementation in modm, you cannot beat the comfort of it. But it should be optional, with alternative lbuild modules for fiber-enabled platform drivers in parallel to the existing ones. I think that would be very interesting for experimentation and benchmarking and very easy to not break backward compatiblity for now.

@salkinium
Copy link
Member

There are also a few other challenges we have to face in the next few years:

  • The GCC AVR backend being deprecated and removed in GCC11.
  • We may have to switch entirely to LLVM, since it is prioritized by ARM over GCC.
  • My unwillingness to support the new AVR devices by Microchip, due to their completely weird peripherals.

AVRs support in modm is not a priority for me anymore, and I'd much rather focus modm only on Cortex-M. However, there are Cortex-M devices with very low memory (STM32L011 has 2kB RAM), so there should still be a focus on memory.

@rleh
Copy link
Member

rleh commented Jul 30, 2020

excluding AVRs and low-memory Cortex-M devices

But who uses them (with >10 threads/fibres)?


So Fibers need to be kept optional in modm:platform and thus also modm:driver.

I don't agree. modm also has some drivers that rely on modm:debug (logger), that doesn't bother anyone.
I would rather prefer to drop Protothread/Resumables completely in favor of fibres.


The GCC AVR backend being deprecated and removed in GCC11. We may have to switch entirely to LLVM, since it is prioritized by ARM over GCC.

https://www.bountysource.com/issues/84630749-avr-convert-the-backend-to-mode_cc-so-it-can-be-kept-in-future-releases

@chris-durand
Copy link
Member

@salkinium I'd also like to encourage you to replace protothreads/resumables with a different concurrency concept. Last year I had to debug some innocent looking code with resumable macros, which behaved vastly different from what the reader would expect. It's not fun to look at this with a debugger, especially not very close to a deadline. Sadly, I don't remember many details. I think it was a nested function call within an RF_CALL expression.

In my opinion we should focus on having sane abstractions that are easy to use and hard to misuse. Also, the lack of local variables in resumable functions makes code less clear and harder to reason about.

It is rather surprising that resumables work so well in many cases because the GCC docs explicitly state our use of case labels in gnu statement expressions is not permitted and undefined behaviour. I wonder what happens when you compile resumables with clang/llvm. Its optimizer tends to be more aggressive than GCC's in some cases.

For me correctness and code readability are more important than optimizing the number of threads you can run on the tiniest L0.
Besides the academic exercise optimizing the memory footprint to the absolute minimum is mostly important for low-cost, high-volume products and I don't think there is someone using modm for that.

@salkinium
Copy link
Member

salkinium commented Jul 31, 2020

Ok, then let's fiberize modm. But Protothreads/Resumables will still have to be supported until the last driver has been rewritten into fibers and external code has had a chance to switch.

@salkinium
Copy link
Member

salkinium commented Jul 31, 2020

how did that escalate so quickly?

Good question…

modm also has some drivers that rely on modm:debug (logger), that doesn't bother anyone.

Yes, but that's technically optional. You can rip out modm:debug out of the drivers and the API doesn't change.
You cannot remove fibers anymore. That's what I meant with "you can only 'upgrade' from Protothreads/Resumbles to Fibers and not downgrade."

This is a major breaking API change, as well as a major change of the modm concurrency model and a major change in stack management (and overflow protection).

This isn't a bad thing, but stackful multitasking isn't without pitfalls either and certainly involves a lot of magic too. A few probable pain points:

  • stackful threads may not be as debuggable as you think, we may need to implement a GDB or OpenOCD plugin to properly recognize our thread format (FreeRTOS already has an OpenOCD plugin btw).
  • stack overflow can only be detected (using watermarks) not prevented. To make use of the MPU, we have to run in unpriviledged mode, which however cannot access certain always priviledged registers that we use (NVIC etc). Currently the main stack is "protected" by growing downwards towards the RAM origin, causing a bus fault -> hard fault on overflow. Not pretty, but detectable particularly in combination with modm:platform:fault.
  • It would be nice to have a stack size tool like https://github.com/japaric/cargo-call-stack. Altough it's not necessary.

I'd also like to encourage you to replace protothreads/resumables with a different concurrency concept

I would object that C++20 coroutines may fix the big pain points:

  • "lack of local variables"
  • "our use of case labels in gnu statement expressions is not permitted and undefined behaviour"
  • "It's not fun to look at this with a debugger"

But this PR is available now rather than C++20 in >1y so I guess we're going with fibers then. I'm quite frankly also not entirely sure how the guarantees on static memory management are on C++20 coroutines, so perhaps it's actually not that good of a solution either (requiring a heap would not help).

Besides the academic exercise optimizing the memory footprint to the absolute minimum is mostly important for low-cost, high-volume products and I don't think there is someone using modm for that.

It's not just low-cost, high-volume, it's also just physically small systems. For example the only STM32 in 3x3mm QFN package is the STM32L021, with only 2kB of RAM and 16kB of Flash. However, as I discovered, linking against newlib can allocate 2kB of static memory (if you just look at the reent struct), so the fiber stack size isn't actually my main concern. (picolibc has apparently removed the reent struct).

For me correctness and code readability are more important than optimizing the number of threads you can run on the tiniest L0.

Is there a consensus on dropping AVRs yet? It would make a lot of things easier for me (especially in modm-devices), and I don't personally use them anymore since cheap blue pill-kind of boards exist. There are a number of very interesting interface options that open up with a more flexible architecture like Cortex-M (callbacks, const data, C++ standard library).

@salkinium
Copy link
Member

Btw, I absolutely hate the fact that C++20 coroutines requires a change to the function signature. It poisons the entire call tree and you basically have to make every (template) function async now since you just don't know what could be async.

This is an issue for example in the GpioExtender class which tries to wrap an Resumable of the external IO driver (usually I2C or SPI) into a modm::Gpio API which is of course… blocking. 🤦‍♂️ This would not be a problem for fibers at all.

So we should switch to fibers. It's the grown-up solution.

@henrikssn
Copy link
Contributor Author

I am very interested in c++20 coroutines, but I don't get how they could possibly be implemented without heap. Take this quote as an example (from cppreference):

Coroutines are stackless: they suspend execution by returning to the caller and the data that is required to resume execution is stored separately from the stack.

As the coroutines support function arguments, those need to be stored somewhere, and I can't see how that could be implemented statically. This is why I am pushing for fibers instead.

Don't stay up all night waiting for this PR to get merged, it will take some time to get this right. If anyone wants to help out, let me know and I find a way to split the work into smaller PRs.

@chris-durand
Copy link
Member

As the coroutines support function arguments, those need to be stored somewhere, and I can't see how that could be implemented statically. This is why I am pushing for fibers instead.

Generally yes, but in certain cases the compiler can elide heap use. This can happen when it can prove the lifetime of the coroutine frame does not exceed the scope of the caller. Also, you can fully control allocations by overriding operator new in the coroutine promise type. You could force a linker error when heap elision doesn't happen by just declaring operator new, but not defining it. The standard does not guarantee the optimization to happen, but clang is said to be quite good at this.

@chris-durand
Copy link
Member

It's not just low-cost, high-volume, it's also just physically small systems. For example the only STM32 in 3x3mm QFN package is the STM32L021, with only 2kB of RAM and 16kB of Flash. However, as I discovered, linking against newlib can allocate 2kB of static memory (if you just look at the reent struct), so the fiber stack size isn't actually my main concern. (picolibc has apparently removed the reent struct).

If you like something tiny with some RAM go for an STM32G071EB, 2.3mm x 2.5mm package and 32 kByte usable SRAM 😉
Certainly more expensive than the L0 option.

@chris-durand
Copy link
Member

Is there a consensus on dropping AVRs yet? It would make a lot of things easier for me (especially in modm-devices), and I don't personally use them anymore since cheap blue pill-kind of boards exist. There are a number of very interesting interface options that open up with a more flexible architecture like Cortex-M (callbacks, const data, C++ standard library).

Personally, I wouldn't mind dropping AVR support. I wouldn't use a midsize or big AVR in a new design with the vast amount of affordable Cortex-M available.

@henrikssn
Copy link
Contributor Author

henrikssn commented Aug 7, 2020

I have changed the approach a bit, instead of pushing all registers to stack inside of a "jump subroutine", I now inline the assembler code into the caller and mark all registers as clobbered. That allows the compiler to only push registers to the stack which are actually in use (or simply re-read them from memory where possible), which is usually much more efficient than pushing all registers to the stack. This also meant that the assembly got simpler, which should allow the current code to run on all ARM-v6/v7 based processors (although I haven't tested it on any other than M0+).

The cool thing about this is that the minimum fiber stack size is now 4 bytes / 40 bytes (depending on if ISRs are used or not, by default it is on due to the modm SysTick ISR). It also means we can do context switching in ~25 CPU cycles (0.5us at 48MHz), which is pretty good if you ask me.

Finally, I got process/main stack switching to work to keep ISRs away from the fiber stacks. Next step is to create a port for x64 so that I can write unit tests.

@henrikssn
Copy link
Contributor Author

I have changed the approach a bit, instead of pushing all registers to stack inside of a "jump subroutine", I now inline the assembler code into the caller and mark all registers as clobbered. That allows the compiler to only push registers to the stack which are actually in use (or simply re-read them from memory where possible), which is usually much more efficient than pushing all registers to the stack. This also meant that the assembly got simpler, which should allow the current code to run on all ARM-v6/v7 based processors (although I haven't tested it on any other than M0+).

Turns out this wasn't compatible with the calling procedure, which becomes evident when yielding from subroutines (compiler doesn't properly push/pop registers from the caller). Falling back to pushing/popping all callee saved registers, and still inlining the context switching code to improve latency.

Finally, I got process/main stack switching to work to keep ISRs away from the fiber stacks. Next step is to create a port for x64 so that I can write unit tests.

This is now done, I would consider this ready for a pass from a second pair of eyes. The last commit has been tested on Cortex-M0+ (SAMD21), Cortex-M3 (blue pill) and X86_64 (hosted).

@henrikssn henrikssn force-pushed the feature/fiber branch 2 times, most recently from be3d1d2 to 441a691 Compare September 6, 2020 21:37
@henrikssn henrikssn marked this pull request as ready for review September 6, 2020 21:37
@salkinium
Copy link
Member

salkinium commented Sep 9, 2020

Ok, so I'm running this successfully on the STM32F469 (Cortex-M4F). A couple of changes:

  • Reformatting so I can read your mess ;-P
  • Too much use of modm_always_inline, it doesn't make sense to me to inline everything not even yield. I'm not even sure if modm_jumpcontext should be inlined, but I didn't delve deep enough to make that call yet.
  • I've added a .faststack section for the fiber stacks (or any static stacks in general) that is NOLOAD, thus won't increase binary size.
  • Optimized the push/pop for ARMv7-M high registers.
  • No need to find the prev fiber when registering.

There are a few general observations:

  • This is small and fast. I like this a lot.
  • I'm uncertain about using the FPU in hardfloat mode. I think we also need to save the FPU registers, but perhaps the registers get invalidated by a function call. I have to check the AACPS and run some tests.
  • modm::done() should simply be replaced by returning from the fiber. This requires placing a suitable LR value on the top of stack during modm_startcontext that calls modm_endcontext(). Not sure about return values, normally returned by r0, but unsure about returning large values that are passed as references to a struct on stack. Sounds complicated.
  • Haven't found a way to place fiber::Stack automatically into the .faststack section. Currently needs to be done by caller. I'm thinking this may require a macro MODM_FIBER(stack_size, function, args…) for static allocation.
  • main() should be the idle fiber, and the scheduler should automatically start on first yield from main. This way you can call complext initialization functions that depend on other fibers completing. This requires some changes to the startup script. Not sure about how to handle main stack size…

@salkinium
Copy link
Member

Another optimization could be to pass a function to yield that checks a condition. The scheduler could call this function and only if it returns true, switch to this fiber. Could be useful for peripherals, but not certain how annoying that would be.

modm::yield([]() { return UART->SR & UART_SR_DR; });

@salkinium
Copy link
Member

( I may have broken a few things during my wild refactorings, I'll fix it on the weekend )

src/modm/processing/fiber/context.hpp Outdated Show resolved Hide resolved
static Scheduler& scheduler() {
return scheduler_instance_;
}
static inline Scheduler scheduler;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but I believe this is wrong because initialization order of static variables across compilation units is undefined.

That said, we are probably safe because Scheduler can be compile time instantiated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wanted to remove the Scheduler completely and only have function calls, but lost motivation in the middle ;-P

Copy link
Member

@salkinium salkinium Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you wanted delayed static initialization, but then you need to move the static Scheduler scheduler_instance_ inside the function I think? I added the guards for that in #346.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right, delayed initialization was my intention. We still need somewhere to store the ready list and the current running fiber, given that this is not FreeRTOS I decided to use C++ object ;)

Btw I was considering permanently storing the scheduler address in R9, if I understand the ABI correctly then that register is intended for RTOS context state. It might give us a few tens of nS lower context switch latency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. I was aware of R9's role in PIC and TLS, but didn't know you could use it for anything you want.

The role of register r9 is platform specific. A virtual platform may assign any role to this register and must document this usage. For example, it may designate it as the static base (SB) in a position-independent data model, or it may designate it as the thread register (TR) in an environment with thread-local storage. The usage of this register may require that the value held is persistent across all calls. A virtual platform that has no need for such a special register may designate r9 as an additional callee-saved variable register, v6.

But it's not clear to me how to enable this function in the toolchain? I only found these two options in the ARM Options.

-msingle-pic-base
Treat the register used for PIC addressing as read-only, rather than loading it in the prologue for each function. The runtime system is responsible for initializing this register with an appropriate value before execution begins.

-mpic-register=reg
Specify the register to be used for PIC addressing. For standard PIC base case, the default is any suitable register determined by compiler. For single PIC base case, the default is ‘R9’ if target is EABI based or stack-checking is enabled, otherwise the default is ‘R10’.

It may be useful to use R11 for this purpose, since that makes the push/pop code smaller.

@henrikssn
Copy link
Contributor Author

henrikssn commented Sep 10, 2020

Very cool thanks for the contributions!

Ok, so I'm running this successfully on the STM32F469 (Cortex-M4F). A couple of changes:

  • Reformatting so I can read your mess ;-P

;) I have more mess incoming, what formatting tool are you using?

  • Too much use of modm_always_inline, it doesn't make sense to me to inline everything not even yield. I'm not even sure if modm_jumpcontext should be inlined, but I didn't delve deep enough to make that call yet.

True but it makes a huge difference in context switch latency. I wonder if keeping yield noinline and then aggressively inline everything inside of it is the best compromise?

  • I've added a .faststack section for the fiber stacks (or any static stacks in general) that is NOLOAD, thus won't increase binary size.
  • Optimized the push/pop for ARMv7-M high registers.
  • No need to find the prev fiber when registering.

There are a few general observations:

  • This is small and fast. I like this a lot.
  • I'm uncertain about using the FPU in hardfloat mode. I think we also need to save the FPU registers, but perhaps the registers get invalidated by a function call. I have to check the AACPS and run some tests.

Yes they need to be stored (if FPU is in use). The increase in context switch latency should be carefully weighted against the floating point performance improvement....

  • modm::done() should simply be replaced by returning from the fiber. This requires placing a suitable LR value on the top of stack during modm_startcontext that calls modm_endcontext(). Not sure about return values, normally returned by r0, but unsure about returning large values that are passed as references to a struct on stack. Sounds complicated.

What does "returning from a fiber" mean? We need to let scheduler do some cleanup, e.g. remove it from ready list and jump to next fiber, that has to be done somewhere. It is already handled by branch to LR on ARM.

  • Haven't found a way to place fiber::Stack automatically into the .faststack section. Currently needs to be done by caller. I'm thinking this may require a macro MODM_FIBER(stack_size, function, args…) for static allocation.

Could you ELI5 why this improves performance?

  • main() should be the idle fiber, and the scheduler should automatically start on first yield from main. This way you can call complext initialization functions that depend on other fibers completing. This requires some changes to the startup script. Not sure about how to handle main stack size…

Hmm, I would need to think about this. The main problem is that ISRs use the main stack, which is currently being exploited hard (note that we don't need to disable ISRs during yield). I would probably prefer to jump early into the scheduler and never return to the main stack again. We can make the main function be a fiber if that makes things easier to think about.

Another optimization could be to pass a function to yield that checks a condition. The scheduler could call this function and only if it returns true, switch to this fiber. Could be useful for peripherals, but not certain how annoying that would be.

modm::yield([]() { return UART->SR & UART_SR_DR; });

I was thinking about solving this use case with condition variables, where the fiber is moved out of the ready queue while it is blocked. It won't work for state that can change under your feet (like your example) but you can make it work if you set the CV from the corresponding ISR instead.

Re the cleanups, give me some time to merge jn my latest work (channel / semaphore / mutex and benchmark example) first. I'm sure that doesn't work either :) Depending what I do, I still get some segfaults so I think the context code is still not 100% ABI compliant.

@henrikssn
Copy link
Contributor Author

henrikssn commented Sep 10, 2020

Status update:

  1. The current state compiles and runs at least on armv7, I decided to undo the template you added because the code generation doesn't work well with my IDE and it isn't needed in this case. I also fixed the comment on the top to properly describe the register order used. I need to find my feather board so I can make sure it also works there, but it is probably ok. Would also be useful to test this properly on x86, it seems at least like the CI has some problems with it.

  2. I am still thinking about endcontext/startcontext, it might look a bit different soon so please keep that in mind.

  3. Last, I probably undid some of your nice formatting work, sorry about that, if you do it again then I promise to base future commits from your nice and tidy state.

Finally, if you do any changes to context / yield, consider rerunning the benchmark to make sure it doesn't affect performance negatively (or use it to show improvement!): modm/examples/generic/fiber

@chris-durand
Copy link
Member

Why is env[":target"].identifier.family = "darwin" in the linux CI and on my linux machine? I don't see where this PR could have changed that.

The linux example project.xml has <option name="modm:target">hosted-darwin</option>.

@rleh
Copy link
Member

rleh commented Jan 16, 2021

Ohhh. Thanks!

@chris-durand
Copy link
Member

CI fails because the hal compilation test exceeds maximum ram size on STM32L100C6U6, most likely also because of the 2kB main fiber stack.

 x   stm32l100c6u6          11.8s
/opt/gcc-arm-none-eabi-10-2020-q4-major/bin/../lib/gcc/arm-none-eabi/10.2.1/../../../../arm-none-eabi/bin/ld:modm/link/linkerscript.ld:201 cannot move location counter backwards (from 00000000200010a0 to 0000000020001000)
collect2: error: ld returned 1 exit status
scons: *** [build/release/all.elf] Error 1

@salkinium It is very suspicious that STM32L100C6U6-A compiles fine, it should also be 4kB of SRAM as the non-A variant. Its datasheet only specifies "up to 16kB SRAM" for any L100x6, but CubeMX and Digikey agree on 4kB. It seems to be a device file issue:

<memory device-pin="c" device-variant="" name="sram1" access="rwx" start="0x20000000" size="4096"/>
<memory device-size="6|8|b" device-variant="a" name="sram1" access="rwx" start="0x20000000" size="10240"/>

@rleh
Copy link
Member

rleh commented Jan 16, 2021

https://github.com/modm-io/modm-devices/blob/develop/tools/generator/dfg/stm32/stm.py#L572-L601:

'l1': {
    'start': {
        'flash': 0x08000000,
        'eeprom': 0x08080000,
        'sram': 0x20000000
    },
    'model': [
        {
            # CAT1 & 2
            'name': ['00x6', '00x8', '00xb', '51x6', '51x8', '51xb', '52x6', '52x8', '52xb'],
            'size': ['6', '8', 'b'],
            'memories': {'flash': 0, 'sram1': 0, 'eeprom': 4*1024}
        },{
            # CAT3
            'name': ['00xc', '51xc', '52xc', '62xc'],
            'size': ['c'],
            'memories': {'flash': 0, 'sram1': 0, 'eeprom': 8*1024}
        },{
            # CAT4
            'name': ['51xd', '52xd', '62xd'],
            'size': ['d'],
            'memories': {'flash': 0, 'sram1': 0, 'eeprom': 12*1024}
        },{
            # CAT5 & 6
            'name': ['51xe', '52xe', '62xe'],
            'size': ['e'],
            'memories': {'flash': 0, 'sram1': 0, 'eeprom': 16*1024}
        },
    ]
},

As far as I understand the information in dfg/stm32/stm.py is used to fix wrong or missing information from STCubeMX is the (hand-curated) source of information for the memory sizes.
@salkinium Could you please be so kind as to explain to me the meaning of 'flash': 0, 'sram1': 0,?

@salkinium
Copy link
Member

salkinium commented Jan 16, 2021

the meaning of 'flash': 0, 'sram1': 0,?

The CubeMX XML files only contain the total amount of Flash and total amound of RAM. So the zero is a token to indicate that this is the remainder of SRAM after subtracting all specified memories in the list. The algorithm for this is here. However, there is nothing to subtract here. On a side note: The size key in the filter is unused, instead the 00x8 is used. This could probably be refactored to allow abitrary keys, like the stm32_groups.py does it.

@salkinium
Copy link
Member

salkinium commented Jan 16, 2021

I'm not sure if variants with differing memory sizes are properly parsed in modm-devices. There is a mapping from DeviceID (did) to XML file, and perhaps the variant XML file is mismapped. (I'm checking, brb)

@salkinium
Copy link
Member

salkinium commented Jan 16, 2021

Nope, it's choosing the right files

[INFO] dfg.stm.reader: Parsing 'stm32l100c6u6'
[DEBUG] dfg.input.xml: Opening XML file 'STM32L100C6Ux.xml'
...
[INFO] dfg.stm.reader: Generating Device Tree for 'stm32l100c6u6'

[INFO] dfg.stm.reader: Parsing 'stm32l100c6u6a'
[DEBUG] dfg.input.xml: Opening XML file 'STM32L100C6UxA.xml'
...
[INFO] dfg.stm.reader: Generating Device Tree for 'stm32l100c6u6a'

@rleh
Copy link
Member

rleh commented Jan 16, 2021

CubeMX data says 10k RAM for the -A variants (see db/mcu/STM32L100C6UxA.xml):
grafik
grafik


But CubeMX GUI does not:

grafik


ST marketing also says 4kB:

grafik
(https://www.st.com/en/microcontrollers-microprocessors/stm32l100-value-line.html)


Datasheet (p.10) also states it's 4kB...

@chris-durand
Copy link
Member

We can probably assume 4 kB is correct. So we add a patch in the device file generator?

@chris-durand
Copy link
Member

There is still an issue with the ARM v7 implementation since modm uses hardware floating point. Even for the softfp ABI (FPU used, but floats passed in integer register across function call boundaries) it seems like we have to preserve d8-d15 fp registers.

In this example I clobbered all floating point registers with an inline asm statement and the compiler preserves d8-15 even with -mfloat-abi=softfp. I could also trigger this behavior by compiling a modified random math code excerpt from newlib. The compiler saves d8 in this one.

If I am not overlooking something, we have to save 64 byte of fp registers on a context switch when we use the FPU. If that's really true, we should have an lbuild option to choose between hardware floating point or faster context switching on Cortex M4F/M7 platforms.

@salkinium salkinium mentioned this pull request Jun 9, 2021
@chris-durand
Copy link
Member

I would like to continue to get this to a state where we can merge. I have rebased this here. There were some conflicts to resolve due to recent linker script refactorings.

@salkinium Should we continue here and I force-push or would you prefer a new PR?

@salkinium
Copy link
Member

salkinium commented Jul 18, 2021

Thank you for looking at this, I'm busy with exams for the next ~6w.

I'd prefer keeping this here, to preserve the full context, incl. previous reviews. Force-push away!

@@ -137,7 +137,7 @@ def run(self):
lbuild_command += ["avr.xml"]
shutil.copyfile("avr.cpp", os.path.join(tempdir, "main.cpp"))
else:
lbuild_command += ["cortex-m.xml", "-D:::main_stack_size=512"]
lbuild_command += ["cortex-m.xml", "-D:::main_stack_size=512", "-D:::main_fiber_stack_size=512"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was kinda intended to reuse the main_stack_size, since that is now ignored (since there is no more non-fiber main()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ignored. Isn't there still a main stack where the ISRs execute and all code before fibers run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at least on Cortex-M, AVRs don't have that option…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only thought about cortex-m. There is hosted as well and potential other platforms we might support in the future, so this has to go to some platform specific setting. It is still useful to be able to configure both stack sizes on cortex-m. What do you suggest?

Copy link
Member

@salkinium salkinium Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain about this feature, since we cannot make AVR and hosted main a fiber, since we don't control how their main() is called. It would be inconsistent and confusing if the main function is a fiber only on Cortex-M. The only issue is when modm::yield() is called before the scheduler is initialized, but perhaps we can just spin in that case or modm_assert on it. I would remove the main fiber and make everything explicit, even if it will require refactoring some examples in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure as well and would opt for consistency and removing the main fiber. That would also have the benefit of improved backwards compatibility for existing code with tiny 2-4kB RAM devices where splitting the stack to main + main fiber stack could be problematic.

docs generation is still broken: Error generating documentation for device at90can64-16mu: 'NoneType' object is not subscriptable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens when lbuild has an error, usually when module dependencies are not satifed. I need to improve error reporting here…

@salkinium
Copy link
Member

salkinium commented Jul 18, 2021

we have to preserve d8-d15 fp registers.

We can at least determine if the FPU has been used. There's also lazy stacking, but it's… convoluted.

@chris-durand
Copy link
Member

chris-durand commented Jul 18, 2021

we have to preserve d8-d15 fp registers.

I remembered that the FPU has lazy stacking and indeed it can be used in context switching as well. This is the shit that makes me love Cortex-M so much ;-P

Doesn't lazy stacking only happen on interrupts? Wouldn't we need to implement something like this manually for fibers? We could check the fpu used flag (FPCA) and set some bit in the stored context. Whenever this bit was set once, the fiber saves fpu registers. On context switched this is reset to the current value from the context. Not sure how efficient that would be.

EDIT: Didn't see you edit because of bad mobile internet.

@salkinium
Copy link
Member

I'm closing this in favor of #743.

@salkinium salkinium closed this Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants