-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delay_ns off by a factor of two #495
Comments
On the hunch that it must have something to do with instruction fetching, and may be dependent on instruction alignment, I did a little experiment in which I inserted Here, with six nops:
With the alignment above, the delay loop appears to take 7 cycles. If I add or subtract a nop, it goes back to taking 6 cycles. It seems likely that having the branch instruction and its target instruction on separate 64-bit words causes the extra delay. Not that this does anything to explain why the delay is double the expected value, but it seems to suggest that delay will vary based on alignment of the delay_us function. |
Adding |
Nanosecond delay is a difficult topic… I left the current implementation a little frustrated due to the limitations you discovered.
It won't be inline-able anymore, which means it will be significantly worse for very small delays for very fast bitbangs (400kHz for I2C for example). We could add an alignment restriction (
I think we should, the loop/latency correctness is worth more than small delay optimizations. Want to open a PR? However, We should probably move the |
Why not use the DWT cycle counter?
…On Tue, Oct 13, 2020, 10:51 PM Niklas Hauser ***@***.***> wrote:
Nanosecond delay is a difficult topic… I left the current implementation a
little frustrated due to the limitations you discovered.
Adding modm_fastcode attribute to delay_ns brings back to 3 cycles per loop
It won't be inline-able anymore, which means it will be significantly
worse for very small delays for very fast bitbangs (400kHz for I2C for
example). We could add an alignment restriction (.align 8) to the ASM and
fix the loop count. But then the overhead may still be comparable to
jumping to a RAM routine.
So...should delay_ns be moved to .fastcode section?
I think we should, the loop/latency correctness is worth more than small
delay optimizations. Want to open a PR?
However, .fastcode is only placed into RAM if a fast instruction RAM
exists. On the F303 there is a Core-Coupled Memory specifically for
instruction bus access, thus there's no additional delay due to access as
you discovered with word alignment due to the Flash Cache aka. ST ART
Accelerator.
We should probably move the .fastcode section into RAM at all times, it
may not always be faster, but the latency may be more consistent. I can do
this on your PR, since it's a bit involved.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#495 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFQFILPE2TTVLSC4UCFVJ3SKU36FANCNFSM4SPP37OQ>
.
|
It's not available on the Cortex-M0 (and I wasn't able to get it to work on the Cortex-M7 yet), and you have to load the address into a register and periodically poll it, which gives you even more cycles per loop. So you're not getting better resolution, only better overall latency. It would perhaps be interesting to have an inline check that jumps to a fast inline implementation for very small (but inaccurate) delays and otherwise to a RAM routines that has a bit more overhead, but is more accurate? |
Sorry, I didn't come back to this project for a while. The PR above is the simple fix that has been working for me on the F303. |
I think this must be a bug, though I haven't yet put my finger on what's going on.
Board: Nucleu-F303K8 (I'm using the modm board config, and haven't messed with clock setup, so it's running from HSI @ 64MHz)
My test:
I'm observing with scope that the delay is double the requested value. This is consistent for values ranging from 1500 to 15000 ns; it always runs twice as long as requested.
The delay_ns_per_loop appears to have been calculated correctly for 64MHz:
I have a timer PWM output that has the expected timing, so I believe that I am actually running with the correct clock frequency.
I've tried stepping through delay loop in the debugger, and monitoring
DWT.CYCCNT
. This only increments 2 per loop iteration, not the expected 3, which is an error in the wrong direction to explain this problem. I'm not sure though if the debugger is affecting branch timing.Disassembled function (from gdb) looks good to me:
According to https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions, the branch should take 1+P cycles, where P is "The number of cycles required for a pipeline refill. This ranges from 1 to 3 depending on the alignment and width of the target instruction, and whether the processor manages to speculate the address early.". Even at the max value of 3 here, it should still be only 5 cycles per loop and I'm seeing 6 pretty clearly.
Any ideas?
The text was updated successfully, but these errors were encountered: