Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix sync. vs. async. exception collision #327

Merged
merged 9 commits into from
May 29, 2022

Conversation

stnolting
Copy link
Owner

@stnolting stnolting commented May 22, 2022

This PR aims to fix the collision of synchronous exceptions and asynchronous exceptions (interrupt) as described in #325 by @GideonZ. PR #326 might also fix that issue.

@stnolting stnolting added bug Something isn't working as expected HW Hardware-related labels May 22, 2022
@stnolting stnolting self-assigned this May 22, 2022
@stnolting stnolting changed the title 🐛 Fix sync async exc collision 🐛 Fix sync. vs. async. exception collision May 22, 2022
@GideonZ
Copy link
Collaborator

GideonZ commented May 25, 2022

Bad news.. My FreeRTOS application crashes on #327, but runs on #326. I haven't had much time yet to dive into the reason why this is so. FreeRTOS runs into an assert; telling me that the scheduler was not suspended while trying to unsuspend it. So there is a problem with the order in which things happen.

@GideonZ
Copy link
Collaborator

GideonZ commented May 27, 2022

Hi Stephan.. I looked into this a bit further. I tried to run my test program in VHDL simulation again, and interestingly I cannot see any difference between my fix (#326) and your fix (#327). And yet, I tried once more to simply build the FPGA after a checkout of #326 and with #327, and I can confirm that one runs into an assert and the other doesn't with the exact same .elf file. With your fix the assert happens kind of right away, and with the other one I can 'ping' my device over the network for hours without hangup. (Both timer and network Rx cause interrupts). I am still in doubt how much time I am willing to put into this to figure out what the differences really are. How do you think I can be of best help?

@stnolting
Copy link
Owner Author

stnolting commented May 28, 2022

Hey @GideonZ.

First of all, I have to say thank you again. You put so much time and extra work in finding this bug and I am very grateful for this. 👍

So there is a problem with the order in which things happen.

That's right. For some reason I messed up the priority of certain traps. I will have just fixed that. Furthermore, I did some tests where all "types" of traps (interrupt, exception, debug-mode entry) kick in at the same time to verify they are handled in the right way. This seems to work now. ✔️

Interesting. I tried this, and you are right, you cannot step over the ebreak. I tried this in instruction stepping mode the debugger comes back on the same address every time and you can step as many times as you like, it doesn't hang per se. I see in SignalTap that it never leaves debug mode; PC stays in the FFFFF800 range, too.

The new version of this PR now also shows this behavior (#326 (comment)). There is no change of stepping over a hardcoded EBREAK instruction. I thought this was a bug in the hardware, but now I think this is a "feature" of GDB. I found this conversation https://sourceware.org/pipermail/gdb/2021-January/049125.html and it seems like that GDB cannot handle hardcoded EBREAKS out of the box. You need to declare it as a breakpoint to be able to step over it.

I hope the new version is finally working as expected. It would be great if you could also check this using your setup.

@GideonZ
Copy link
Collaborator

GideonZ commented May 28, 2022

First of all, I have to say thank you again. You put so much time and extra work in finding this bug and I am very grateful for this. +1

You are very welcome, @stnolting... Don't forget that it is in my own interest to get this CPU to work. Due to component shortages and some nasty tricks that Intel pulled on me, I had to change FPGA on a product. I switched to Lattice ECP5 from Cyclone IV; which also means that the application had to be ported from Nios-II to something else. Risc-V was a logical choice. I looked at other cores, but 1) I am not very fond of Verilog and the resulting mixed language simulation I would get myself into when choosing one of those cores, and 2) your project looks solid. It stands out in many ways: documentation, ecosystem (although I am not using that part), and the whole paradigm of 'strictness' you are following.

That's right. For some reason I messed up the priority of certain traps. I will have just fixed that.
I hope the new version is finally working as expected. It would be great if you could also check this using your setup.

Unfortunately it doesn't. The behavior is different. And frankly, looking at the commit you made, I cannot see why it would behave differently, because the changes only seem to have effect on the debugger? I am not having any issues with the debugger; the program just fails to run. But as said, it behaves differently now. Before, FreeRTOS would tell me that a "ResumeAll" did not pair correctly with a "SuspendAll" call. Now, it's running into an unaligned data trap. Seems to be a corrupt stack frame that causes a data fetch to fail (mcause = 0x04).

Allow me to some time to dig in deeper, so we can properly bring this to a good conclusion.

@stnolting
Copy link
Owner Author

stnolting commented May 28, 2022

Unfortunately it doesn't.

Oh damn! 😅

And frankly, looking at the commit you made, I cannot see why it would behave differently, because the changes only seem to have effect on the debugger?

I was expecting there was a problem with the debugger interfering with some exceptions, but it seems like this is not the case. Anyway, I had to fix that because that was also violating the debug-mode-related trap priority given by the RISC-V debug specs.

Now, it's running into an unaligned data trap. Seems to be a corrupt stack frame that causes a data fetch to fail (mcause = 0x04).

I do not think this is caused by the application itself, but by some exceptions triggering right after each other (which they might not do) corrupting the trap stack.

I tried many different trap combinations and they all seem to work, so there must be something special about your setup - or maybe I just forgot something obvious.

By the way, what kind of traps is your setup using at all?

  • the ecall instruction
  • the machine external interrupt MEI
  • and...?

@stnolting
Copy link
Owner Author

Now, it's running into an unaligned data trap. Seems to be a corrupt stack frame that causes a data fetch to fail (mcause = 0x04).

Btw, mcause = 0x4 is a "load address misaligned" exception, so a data access, not an instruction fetch.

@GideonZ
Copy link
Collaborator

GideonZ commented May 28, 2022

I tried many different trap combinations and they all seem to work, so there must be something special about your setup - or maybe I just forgot something obvious.

I am not excluding the possibility that there is a problem with my application / setup, that's why I am also looking for the root cause and don't leave it all up to you. :-)

By the way, what kind of traps is your setup using at all?

  • the ecall instruction
  • the machine external interrupt MEI
  • and...?

That is it. There is one interrupt, and this level-triggered interrupt is connected to the mext_irq_i signal. The user interrupt handler dispatches the interrupt further to the handlers per I/O device. I have my own timer that triggers 200 times per second, and other IRQs include network Rx, USB transmission completion and some other stuff that is not firing before the crash. Even with the Network Rx and USB interrupts disabled, the problem still occurs. I even eliminated the "vTaskSwitchContext" call in the interrupt altogether, which basically turns the interrupt in a bit of code that does almost nothing; it doesn't matter.

One thing that is probably different from your setup is that I am running from DDR2 memory. The latency of the memory is not great, so there are usually quite a few clocks between instructions.

Interestingly, I tried printing a dot to the console in the interrupt. One of the debug lines in the application that usually says "Called xxx for 1 objects" now said "Called xxx for .6261982 objects". Note the dot just before printing the %d value? Interrupt comes, register value is broken and a completely different value prints.

Btw, mcause = 0x4 is a "load address misaligned" exception, so a data access, not an instruction fetch.
Yes, as I mentioned, it is a failing data fetch. It is a load relative to an address in register a5, and it happens right after returning from IRQ. I have a SignalTap triggered on the PC reaching the endless unhandled trap loop.

I am currently suspecting an interrupt occurring during a branch. Going to build a test case for that later tonight or tomorrow.

Did you notice that in instruction stepping mode you cannot take a jump? Just like ebreak, the debugger stays on the jump.

@GideonZ
Copy link
Collaborator

GideonZ commented May 28, 2022

Okay... Update.. So far, I didn't manage to get a very very simple FreeRTOS demo to crash, but now it does. Consists of two tasks, printing an incrementing number onto the console. So I'll load up this demo into the VHDL simulator and give it a good look. :-) I tried it twice. Once it came with an assert from FreeRTOS, the second time it ended up in the endless unhandled trap loop. I suspect both cases due to corrupted registers.

Output with a high IRQ rate (2 kHz), task 2 has a higher priority, to task 1 never gets to print anything:

.Tas.k 2. 59.
Ta.Task 2 60
.Task. 2 .61
sk .Tas.k 2. 62.
1 .Tas.k 2. 63.
10.Tas.k 2. 64.

.Tas.k 2. 65.
.Task .2 6.6
.Task. 2 .67

@stnolting
Copy link
Owner Author

The user interrupt handler dispatches [...]

Does that mean you are using machine and user mode for your application?

One thing that is probably different from your setup is that I am running from DDR2 memory. The latency of the memory is not great, so there are usually quite a few clocks between instructions.

That should not be a problem here. Btw, do you use the instruction cache?

I am currently suspecting an interrupt occurring during a branch

in my test program even that works without problems. But I have not tested special "branch conditions" for example using mret to switch to user mode.

@stnolting
Copy link
Owner Author

I think I can confirm the "data corruption issue"!

I just saw the corruption of one variable in my test program: a global variable is fetched from memory for evaluation but never reaches the register file. Seems like the memory access is incorrectly aborted if there is an interrupt pending. I will check that.

abort memory access only if there is memory-access-related trap pending (access error or alignment error, both for load and store)!
@stnolting
Copy link
Owner Author

There is indeed a bug in the memory abort logic! The modifications of the trap controller also requires to modify the abort condition(s). I just fixed that and now my variables are happy again. 😉 Hopefully, this also fixes your broken variable problem.

@GideonZ
Copy link
Collaborator

GideonZ commented May 28, 2022

There is indeed a bug in the memory abort logic! The modifications of the trap controller also requires to modify the abort condition(s). I just fixed that and now my variables are happy again. wink Hopefully, this also fixes your broken variable problem.

Excellent find! Does this also explain why my fix did seem to work? Because I did not add an extra state to the state machine?
One of the reasons I'd like to pursue the fix of your fix, is that I don't like to leave things unexplained. I'll build an FPGA with your latest commit tomorrow and let you know the result.

@stnolting
Copy link
Owner Author

stnolting commented May 28, 2022

Does this also explain why my fix did seem to work? Because I did not add an extra state to the state machine?

It is not about the extra state but about the handling of the trap_ctrl.cause signal. Before the latest fix in this PR, the memory access got canceled when...

  • trap_ctrl.env_start is set; so the trap environment (triggered by an exception or an interrupt) is waiting to get started

and

  • trap_ctrl.cause defines a cause that is NOT an interrupt and NOT a debug-mode-related trap

The trap controller from this PR updates trap_ctrl.cause ALL the time until the trap environment is finally started by setting trap_ctrl.env_start_ack. So trap_ctrl.cause can still contain outdated data from a previous trap.

Your version only updates trap_ctrl.cause exactly ONCE when the trap environment is started by setting trap_ctrl.env_start_ack. There is no change of having invalid data in here.

The latest fix ensures that a memory access is only canceled if there is memory-access-related exception: a bus error or a misaligned address - both for loads and stores.

I'll build an FPGA with your latest commit tomorrow and let you know the result.

Awesome! Fingers crossed! 😉

I will run all the test suites I have to ensure I did not break anything else...

@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

This fix seems to work fine! Been pinging the machine for hours, without any issue. This tough nut has been cracked! 😉 👍

@stnolting
Copy link
Owner Author

Yeay! Finally!! 😄 🎉 This was really a hard one!

Thanks again for all your help and effort! :)

@stnolting stnolting marked this pull request as ready for review May 29, 2022 11:10
@stnolting stnolting merged commit b1bbe94 into main May 29, 2022
@stnolting stnolting deleted the fix_sync_async_exc_collision branch May 29, 2022 11:23
@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

I still have one thing to nag about... the debugger doesn't step past a jump....
And I can confirm that it got broken with this fix.
To reproduce: start up a debugger, do instruction stepping, and see that it keeps hanging on a jump. It doesn't go past that instruction.

@stnolting
Copy link
Owner Author

stnolting commented May 29, 2022

Right, this also needs to be fixed. When talking about "branches" do you mean all branches (conditional + unconditional) or just some specific cases like taken conditional branches?

I just tested single-stepping using a very simple loop (unconditional branch):

   0x00000198 <+16>:    andi    a4,a5,255
   0x0000019c <+20>:    sw      a4,0(a3)
=> 0x0000019e <+22>:    addi    a5,a5,1
   0x000001a0 <+24>:    j       0x198 <blink_led_c+16>

And in this example single-stepping works without problems.

@stnolting
Copy link
Owner Author

stnolting commented May 29, 2022

I also tested a program with a conditional branch:

  int cnt = 0;

  while (1) {
    neorv32_cpu_store_unsigned_word((uint32_t)(&NEORV32_GPIO.OUTPUT_LO), cnt);
    cnt++;
    if (cnt >= 12) {
      cnt = 0;
    }
  }

The loop looks like this in the assembly output:

   0x00000198 <+16>:    li      a5,0
   0x0000019a <+18>:    nop
   0x0000019c <+20>:    sw      a5,0(a4)
   0x0000019e <+22>:    addi    a5,a5,1
   0x000001a0 <+24>:    bne     a5,a3,0x19c <blink_led_c+20>
   0x000001a4 <+28>:    j       0x198 <blink_led_c+16>

And this is what GDB looks like when single-stepping through the code (I have removed the "uninteresting" console outputs):

(gdb) stepi
=> 0x19c <blink_led_c+20>:      sw      a5,0(a4)
(gdb) stepi
=> 0x19e <blink_led_c+22>:      addi    a5,a5,1
(gdb) stepi
=> 0x1a0 <blink_led_c+24>:      bne     a5,a3,0x19c <blink_led_c+20>
(gdb) stepi
=> 0x19c <blink_led_c+20>:      sw      a5,0(a4)
(gdb) stepi
=> 0x19e <blink_led_c+22>:      addi    a5,a5,1
(gdb) stepi
=> 0x1a0 <blink_led_c+24>:      bne     a5,a3,0x19c <blink_led_c+20>
(gdb) stepi
=> 0x1a4 <blink_led_c+28>:      j       0x198 <blink_led_c+16>
(gdb) stepi
=> 0x198 <blink_led_c+16>:      li      a5,0

This also works without problems.

@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

Maybe it only occurs when you are inside of an interrupt? I tested it with this fragment:

000302b0 <test_if_environment_call>:
   302b0:	00b00293          	li	t0,11
   302b4:	00551a63          	bne	a0,t0,302c8 <is_exception>
   302b8:	00108117          	auipc	sp,0x108
   302bc:	59c12103          	lw	sp,1436(sp) # 138854 <xISRStackTop>
   302c0:	715030ef          	jal	ra,341d4 <vTaskSwitchContext>
   302c4:	01c0006f          	j	302e0 <processed_source>

Hangs on 302c4.

@stnolting
Copy link
Owner Author

Ah I see. When being in a trap environment interrupts are globally disabled - except for the debug-mode "entry interrupts" (like the single-stepping command). Maybe there is still a bug in the "interrupt enable logic". I will relocate my loop into a trap environment and test that.

@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

I tested it with #326... and I hate to say this, but there it does work... :)
What'd you say if I remove the pain in your eye by rewriting #326 into an if/else construction?

@stnolting
Copy link
Owner Author

stnolting commented May 29, 2022

I tested it with #326... and I hate to say this, but there it does work... :)

🙈 😅

What'd you say if I remove the pain in your eye by rewriting #326 into an if/else construction?

Sure, but I would prefer to understand the bug first.

I have tried single-stepping the loop in a trap handler. You are, right there is something going wrong. My program does not hang, but the bne instruction from my example above never shows up in the debugger - so it does not halt there. Interestingly, this only happens when the C ISA extension is enabled. When using only 32-bit instructions everything runs smooth.

@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

You might be surprised by my configuration:

    i_cpu: entity neorv32.neorv32_top
    generic map(
        CLOCK_FREQUENCY              => g_frequency,
        HW_THREAD_ID                 => 0,
        INT_BOOTLOADER_EN            => true,
        ON_CHIP_DEBUGGER_EN          => g_jtag_debug,
        CPU_EXTENSION_RISCV_B        => false,
        CPU_EXTENSION_RISCV_C        => false,
        CPU_EXTENSION_RISCV_E        => false,
        CPU_EXTENSION_RISCV_M        => false,
        CPU_EXTENSION_RISCV_U        => false,
        CPU_EXTENSION_RISCV_Zfinx    => false,
        CPU_EXTENSION_RISCV_Zicsr    => true, -- for Interrupts
        CPU_EXTENSION_RISCV_Zicntr   => false,
        CPU_EXTENSION_RISCV_Zihpm    => false,
        CPU_EXTENSION_RISCV_Zifencei => g_jtag_debug,
        CPU_EXTENSION_RISCV_Zmmul    => false,
        CPU_EXTENSION_RISCV_Zxcfu    => false,
        FAST_MUL_EN                  => false,
        FAST_SHIFT_EN                => false,
        CPU_CNT_WIDTH                => 32,
        CPU_IPB_ENTRIES              => 2,
        PMP_NUM_REGIONS              => 0,
        HPM_NUM_CNTS                 => 0,
        MEM_INT_IMEM_EN              => false,
        MEM_INT_DMEM_EN              => true,   -- implement processor-internal data memory
        MEM_INT_DMEM_SIZE            => 2*1024, -- size of processor-internal data memory in bytes
        ICACHE_EN                    => true,
        ICACHE_NUM_BLOCKS            => 256,
        ICACHE_BLOCK_SIZE            => 8,
        ICACHE_ASSOCIATIVITY         => 1,
        MEM_EXT_EN                   => true,
        MEM_EXT_TIMEOUT              => 255,
        MEM_EXT_PIPE_MODE            => true,
        MEM_EXT_BIG_ENDIAN           => false,
        MEM_EXT_ASYNC_RX             => false,
        SLINK_NUM_TX                 => 0,
        SLINK_NUM_RX                 => 0,
        SLINK_TX_FIFO                => 0,
        SLINK_RX_FIFO                => 0,
        XIRQ_NUM_CH                  => 0,
        IO_GPIO_EN                   => false,
        IO_MTIME_EN                  => false,
        IO_UART0_EN                  => false,
        IO_UART1_EN                  => false,
        IO_SPI_EN                    => false,
        IO_TWI_EN                    => false,
        IO_PWM_NUM_CH                => 0,
        IO_WDT_EN                    => false,
        IO_TRNG_EN                   => false,
        IO_CFS_EN                    => false,
        IO_NEOLED_EN                 => false,
        IO_GPTMR_EN                  => false,
        IO_XIP_EN                    => false
    )
    port map (

@stnolting
Copy link
Owner Author

You might be surprised by my configuration:

Quite minimalistic 😉

But yeah, I just figured out this is a general problem and not related to the compressed extension.

@GideonZ
Copy link
Collaborator

GideonZ commented May 29, 2022

But yeah, I just figured out this is a general problem and not related to the compressed extension.

I am sorry to give you so much pain.

@stnolting
Copy link
Owner Author

stnolting commented May 29, 2022

I am sorry to give you so much pain.

Haha, no worries! Actually, I love debugging the CPU core and stepping through endless logic waveforms (even if I always tell the opposite). 😉

Right now, I doing GDB-like single-stepping through my test program. It takes ages, but I have found the cause of this bug. I will do a PR - and let's hope this does not re-introduce the concurrent exception-IRQ problem...

@stnolting
Copy link
Owner Author

stnolting commented May 29, 2022

I have opened #329. Single stepping now works as expected. And as far as I can see the concurrent trap issue did not resurrect 😉. Let's continue this discussion over there.

@GideonZ
Copy link
Collaborator

GideonZ commented May 30, 2022

Let's continue this discussion over there.

OK! Allow me some days. Bit busy at work. ;-)

@stnolting
Copy link
Owner Author

Sure, no hurries 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected HW Hardware-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instruction 'ecall' becomes 'nop' when external IRQ is active at the same time
2 participants