-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix sync. vs. async. exception collision #327
Conversation
Bad news.. My FreeRTOS application crashes on #327, but runs on #326. I haven't had much time yet to dive into the reason why this is so. FreeRTOS runs into an assert; telling me that the scheduler was not suspended while trying to unsuspend it. So there is a problem with the order in which things happen. |
Hi Stephan.. I looked into this a bit further. I tried to run my test program in VHDL simulation again, and interestingly I cannot see any difference between my fix (#326) and your fix (#327). And yet, I tried once more to simply build the FPGA after a checkout of #326 and with #327, and I can confirm that one runs into an assert and the other doesn't with the exact same .elf file. With your fix the assert happens kind of right away, and with the other one I can 'ping' my device over the network for hours without hangup. (Both timer and network Rx cause interrupts). I am still in doubt how much time I am willing to put into this to figure out what the differences really are. How do you think I can be of best help? |
Hey @GideonZ. First of all, I have to say thank you again. You put so much time and extra work in finding this bug and I am very grateful for this. 👍
That's right. For some reason I messed up the priority of certain traps. I will have just fixed that. Furthermore, I did some tests where all "types" of traps (interrupt, exception, debug-mode entry) kick in at the same time to verify they are handled in the right way. This seems to work now. ✔️
The new version of this PR now also shows this behavior (#326 (comment)). There is no change of stepping over a hardcoded EBREAK instruction. I thought this was a bug in the hardware, but now I think this is a "feature" of GDB. I found this conversation https://sourceware.org/pipermail/gdb/2021-January/049125.html and it seems like that GDB cannot handle hardcoded EBREAKS out of the box. You need to declare it as a breakpoint to be able to step over it. I hope the new version is finally working as expected. It would be great if you could also check this using your setup. |
You are very welcome, @stnolting... Don't forget that it is in my own interest to get this CPU to work. Due to component shortages and some nasty tricks that Intel pulled on me, I had to change FPGA on a product. I switched to Lattice ECP5 from Cyclone IV; which also means that the application had to be ported from Nios-II to something else. Risc-V was a logical choice. I looked at other cores, but 1) I am not very fond of Verilog and the resulting mixed language simulation I would get myself into when choosing one of those cores, and 2) your project looks solid. It stands out in many ways: documentation, ecosystem (although I am not using that part), and the whole paradigm of 'strictness' you are following.
Unfortunately it doesn't. The behavior is different. And frankly, looking at the commit you made, I cannot see why it would behave differently, because the changes only seem to have effect on the debugger? I am not having any issues with the debugger; the program just fails to run. But as said, it behaves differently now. Before, FreeRTOS would tell me that a "ResumeAll" did not pair correctly with a "SuspendAll" call. Now, it's running into an unaligned data trap. Seems to be a corrupt stack frame that causes a data fetch to fail (mcause = 0x04). Allow me to some time to dig in deeper, so we can properly bring this to a good conclusion. |
Oh damn! 😅
I was expecting there was a problem with the debugger interfering with some exceptions, but it seems like this is not the case. Anyway, I had to fix that because that was also violating the debug-mode-related trap priority given by the RISC-V debug specs.
I do not think this is caused by the application itself, but by some exceptions triggering right after each other (which they might not do) corrupting the trap stack. I tried many different trap combinations and they all seem to work, so there must be something special about your setup - or maybe I just forgot something obvious. By the way, what kind of traps is your setup using at all?
|
Btw, mcause = 0x4 is a "load address misaligned" exception, so a data access, not an instruction fetch. |
I am not excluding the possibility that there is a problem with my application / setup, that's why I am also looking for the root cause and don't leave it all up to you. :-)
That is it. There is one interrupt, and this level-triggered interrupt is connected to the One thing that is probably different from your setup is that I am running from DDR2 memory. The latency of the memory is not great, so there are usually quite a few clocks between instructions. Interestingly, I tried printing a dot to the console in the interrupt. One of the debug lines in the application that usually says "Called xxx for 1 objects" now said "Called xxx for .6261982 objects". Note the dot just before printing the %d value? Interrupt comes, register value is broken and a completely different value prints.
I am currently suspecting an interrupt occurring during a branch. Going to build a test case for that later tonight or tomorrow. Did you notice that in instruction stepping mode you cannot take a jump? Just like ebreak, the debugger stays on the jump. |
Okay... Update.. So far, I didn't manage to get a very very simple FreeRTOS demo to crash, but now it does. Consists of two tasks, printing an incrementing number onto the console. So I'll load up this demo into the VHDL simulator and give it a good look. :-) I tried it twice. Once it came with an assert from FreeRTOS, the second time it ended up in the endless unhandled trap loop. I suspect both cases due to corrupted registers. Output with a high IRQ rate (2 kHz), task 2 has a higher priority, to task 1 never gets to print anything:
|
Does that mean you are using machine and user mode for your application?
That should not be a problem here. Btw, do you use the instruction cache?
in my test program even that works without problems. But I have not tested special "branch conditions" for example using |
I think I can confirm the "data corruption issue"! I just saw the corruption of one variable in my test program: a global variable is fetched from memory for evaluation but never reaches the register file. Seems like the memory access is incorrectly aborted if there is an interrupt pending. I will check that. |
abort memory access only if there is memory-access-related trap pending (access error or alignment error, both for load and store)!
There is indeed a bug in the memory abort logic! The modifications of the trap controller also requires to modify the abort condition(s). I just fixed that and now my variables are happy again. 😉 Hopefully, this also fixes your broken variable problem. |
Excellent find! Does this also explain why my fix did seem to work? Because I did not add an extra state to the state machine? |
It is not about the extra state but about the handling of the
and
The trap controller from this PR updates Your version only updates The latest fix ensures that a memory access is only canceled if there is memory-access-related exception: a bus error or a misaligned address - both for loads and stores.
Awesome! Fingers crossed! 😉 I will run all the test suites I have to ensure I did not break anything else... |
This fix seems to work fine! Been pinging the machine for hours, without any issue. This tough nut has been cracked! 😉 👍 |
Yeay! Finally!! 😄 🎉 This was really a hard one! Thanks again for all your help and effort! :) |
I still have one thing to nag about... the debugger doesn't step past a jump.... |
Right, this also needs to be fixed. When talking about "branches" do you mean all branches (conditional + unconditional) or just some specific cases like taken conditional branches? I just tested single-stepping using a very simple loop (unconditional branch):
And in this example single-stepping works without problems. |
I also tested a program with a conditional branch: int cnt = 0;
while (1) {
neorv32_cpu_store_unsigned_word((uint32_t)(&NEORV32_GPIO.OUTPUT_LO), cnt);
cnt++;
if (cnt >= 12) {
cnt = 0;
}
} The loop looks like this in the assembly output: 0x00000198 <+16>: li a5,0
0x0000019a <+18>: nop
0x0000019c <+20>: sw a5,0(a4)
0x0000019e <+22>: addi a5,a5,1
0x000001a0 <+24>: bne a5,a3,0x19c <blink_led_c+20>
0x000001a4 <+28>: j 0x198 <blink_led_c+16> And this is what GDB looks like when single-stepping through the code (I have removed the "uninteresting" console outputs): (gdb) stepi
=> 0x19c <blink_led_c+20>: sw a5,0(a4)
(gdb) stepi
=> 0x19e <blink_led_c+22>: addi a5,a5,1
(gdb) stepi
=> 0x1a0 <blink_led_c+24>: bne a5,a3,0x19c <blink_led_c+20>
(gdb) stepi
=> 0x19c <blink_led_c+20>: sw a5,0(a4)
(gdb) stepi
=> 0x19e <blink_led_c+22>: addi a5,a5,1
(gdb) stepi
=> 0x1a0 <blink_led_c+24>: bne a5,a3,0x19c <blink_led_c+20>
(gdb) stepi
=> 0x1a4 <blink_led_c+28>: j 0x198 <blink_led_c+16>
(gdb) stepi
=> 0x198 <blink_led_c+16>: li a5,0 This also works without problems. |
Maybe it only occurs when you are inside of an interrupt? I tested it with this fragment:
Hangs on 302c4. |
Ah I see. When being in a trap environment interrupts are globally disabled - except for the debug-mode "entry interrupts" (like the single-stepping command). Maybe there is still a bug in the "interrupt enable logic". I will relocate my loop into a trap environment and test that. |
🙈 😅
Sure, but I would prefer to understand the bug first. I have tried single-stepping the loop in a trap handler. You are, right there is something going wrong. My program does not hang, but the |
You might be surprised by my configuration:
|
Quite minimalistic 😉 But yeah, I just figured out this is a general problem and not related to the compressed extension. |
I am sorry to give you so much pain. |
Haha, no worries! Actually, I love debugging the CPU core and stepping through endless logic waveforms (even if I always tell the opposite). 😉 Right now, I doing GDB-like single-stepping through my test program. It takes ages, but I have found the cause of this bug. I will do a PR - and let's hope this does not re-introduce the concurrent exception-IRQ problem... |
I have opened #329. Single stepping now works as expected. And as far as I can see the concurrent trap issue did not resurrect 😉. Let's continue this discussion over there. |
OK! Allow me some days. Bit busy at work. ;-) |
Sure, no hurries 😉 |
This PR aims to fix the collision of synchronous exceptions and asynchronous exceptions (interrupt) as described in #325 by @GideonZ. PR #326 might also fix that issue.