Possible issue with FIRQ pending interrupt clearing #818

robhancocksed · 2024-02-20T17:13:12Z

Describe the bug
I am working on some changes to add support for the NEORV32 XIRQ interrupt controller to Zephyr, and have a setup that is mostly working. (I am planning to submit it upstream once it's debugged.) However, I'm running into an issue where XIRQ interrupts seem to stop being processed, and have narrowed it down what I think is an issue with how the CSRC instruction is being handled. I added some debug probes into the synthesized design in Vivado to try to see what is going on, and I suspect the CPU is not handling the FIRQ pending interrupt clearing properly.

The PC value shown corresponds to the following instructions which are part of the XIRQ interrupt handler:

35774: 010007b7 lui a5,0x1000
35778: 3447b073 csrc mip,a5

which should be clearing bit 24 in the MIP register, corresponding to FIRQ 8 for XIRQ, as far as I can tell. However, the logic capture seems to be showing that this results in ALL pending FIRQs being cleared out. So I suspect what is happening is that when another FIRQ handler runs, such as for the UART, it can sometimes end up clearing an XIRQ interrupt before it has a chance to be processed, resulting in an XIRQ interrupt hang.

In the Vivado ILA capture above, the neorv32_SystemTop_ax_0/U0/neorv32_top_inst/core_complex.neorv32_cpu_inst/neorv32_cpu_control_inst/csr[mip_firq_nclr][8]_i_1_n_0 signal corresponds (I think) to the state of the mip_firq_nclr value for FIRQ 8 which this instruction should be clearing, and it does, but you can see that mip_firq_nclr is also being pulsed low for the other used FIRQs (0, 2, 3, 12, 15) as well, and presumably shouldn't be.

Environment:

Zephyr RTOS (main branch with extra patches)
Zephyr SDK 0.16.5 with RISC-V GCC 12.2.0

Hardware:

NEORV32 version: Git commit beab81e from Feb. 16
Using top-level AXI Lite instantiation
Extensions A, B, C, M, Zicntr enabled, fast multiply, fast shift
GPTMR, MTIME, TRNG, UART0, WDT, on-chip debug enabled
XIRQ enabled with 8 channels
Building with AMD Vivado 2023.2 for a Kintex-7 FPGA

The text was updated successfully, but these errors were encountered:

robhancocksed · 2024-02-20T18:15:17Z

I suspect this code, in neorv32_cpu_control.vhd inside csr_write_data, may not be doing what was intended:


    -- tiny ALU to compute CSR write data --
    case execute_engine.ir(instr_funct3_msb_c-1 downto instr_funct3_lsb_c) is
      when "10"   => csr.wdata <= csr.rdata or tmp_v; -- set
      when "11"   => csr.wdata <= csr.rdata and (not tmp_v); -- clear
      when others => csr.wdata <= tmp_v; -- write
    end case;

It seems csr.rdata is set to 0 on cycles when no CSR read operation is in progress, so it's possible that it does not have the right data when this is executing. The symptoms would match a value of 0 being used rather than the actual MIP register content, causing the entire register to be zeroed.

robhancocksed · 2024-02-20T19:20:37Z

It looks like doing a csr_write for the MIP register with the bit for the FIRQ being cleared as 0 and all other bits 1, rather than using csr_clear, can work around the problem in software, since writing 1 to bits in MIP has no effect. With this change I am only seeing the mip_firq_nclr bit for the desired interrupt going to 0 rather than all of them.

stnolting · 2024-02-20T20:25:12Z

Hey @robhancocksed

I am working on some changes to add support for the NEORV32 XIRQ interrupt controller to Zephyr, and have a setup that is mostly working. (I am planning to submit it upstream once it's debugged.)

That sounds great - thank you very much!

which should be clearing bit 24 in the MIP register, corresponding to FIRQ 8 for XIRQ, as far as I can tell. However, the logic capture seems to be showing that this results in ALL pending FIRQs being cleared out.

I will have a closer look at this, but I think you might have found a severe bug here 🙈

fixing #818

stnolting · 2024-02-20T22:42:40Z

This is really a mean bug... Thanks for identifying this!

stnolting · 2024-02-21T16:16:24Z

This issue should be fixed now.

Thanks again for finding this bug!

robhancocksed · 2024-02-21T23:28:50Z

@stnolting It looks like the problem is not solved unfortunately - if I remove my previous workaround to use csr_write rather than csr_clear, and update the core, the same XIRQ hang eventually occurred, whereas with the workaround the system survived a test run for a significant period of time. And the change would actually break the workaround I was using as well, since writing 1 bits into MIP would now actually trigger those interrupts..

stnolting · 2024-02-22T17:32:25Z

I've just tested the read/write/clear/set capabilities of mip's FIRQ bits and that seems to work fine.

Maybe this is just a software issue?

CSR write

neorv32_cpu_csr_write(CSR_MIP, data):

mip <= data;

This will override all writable bits of mip with data potentially clearing all pending FIRQs (if the according bits of data are zero).

CSR clear

neorv32_cpu_csr_clr(CSR_MIP, data):

mip <= mip & (~data);

This will clear bit i of mip if the according bit of data is 1. This can be used to clear specific pending FIRQs.

CSR set

neorv32_cpu_csr_set(CSR_MIP, data):

mip <= mip | data;

This will set bit i of mip if the according bit of data is 1.

⚠️ Non-atomic CSR access

uint32_t tmp;
tmp = neorv32_cpu_csr_read(CSR_MIP);
tmp &= ~(1 << some_firq_bit); // clear pending FIRQ
neorv32_cpu_csr_write(CSR_MIP, tmp);

Performing such a "non-atomic" access here is no good idea as mip might be updated with new pending IRQs "in background" (by the hardware). So the final write access here (neorv32_cpu_csr_write) could potentially override these background changes.

robhancocksed · 2024-02-22T17:42:03Z

The code was using the csr_clear function in Zephyr which generates the assembly:

35774: 010007b7 lui a5,0x1000
35778: 3447b073 csrc mip,a5

which should be correct in terms of using an atomic clear instruction. It seems like the problem might be fairly timing dependent however, so a basic sequential test might not pick up whatever is happening?

stnolting · 2024-02-22T17:59:57Z

35778: 3447b073 csrc mip,a5

This seems fine as this is an atomic CSR access.

Even if the according FIRQ source is firing right in that moment where mip is updated by software the FIRQ request should still be registered:

neorv32/rtl/core/neorv32_cpu_control.vhd

Lines 1455 to 1457 in 4155b72

    
           if (csr.mip_firq_we = '1') then -- write access to MIP(.FIRQ) CSR 
        
             trap_ctrl.irq_pnd(irq_firq_0_c+i) <= firq_i(i) or csr.mip_firq_wdata(i); -- keep buffering incoming FIRQs 
        
           else

What is the actual "symptom" of your use case? Does the CPU miss an incoming FIRQ trigger?

stnolting · 2024-02-22T18:44:00Z

I made some more tests and now I can confirm that a FIRQ trigger gets lost when it arrives exactly 1 or 2 cycles right before doing any mip write/set/clear access. This is because a CSR clear/set instruction has 2 cycles latency (read CSR -> modify the data -> write back CSR).

I'll try to come up with another (better) fix 😅

I'll also check all further CSRw that have implicit updates (by the hardware).

robhancocksed · 2024-02-22T18:52:07Z

@stnolting Basically yes, it appears that an XIRQ interrupt gets raised very close to the moment that a CSR clear to clear a different interrupt is happening, and that write incorrectly clears the XIRQ interrupt as well. That means the XIRQ ISR never runs, and so because the first interrupt source is never cleared in XIRQ, it never raises any more interrupts. I think the latest issue you discovered likely explains what is going on there.

stnolting · 2024-02-23T20:30:17Z

This was harder to fix then I have expected... 🙈 But I think mip works now as expected. Btw, mip is now clear-only again - so writing ones has no effect.

-> #829

robhancocksed · 2024-02-26T16:13:59Z

My stress test ran over the weekend with no errors without the software workaround in place, so I think it's safe to say this is now resolved.

stnolting · 2024-02-26T16:20:16Z

Thanks for the feedback! I'm very glad to hear that 😅

stnolting added a commit that referenced this issue Feb 20, 2024

🐛 [cpu] fix mip.firq CSR write access

8629aad

fixing #818

stnolting mentioned this issue Feb 20, 2024

🐛 Fix write access to mip.firq CSR bits #821

Merged

stnolting linked a pull request Feb 20, 2024 that will close this issue

🐛 Fix write access to mip.firq CSR bits #821

Merged

stnolting self-assigned this Feb 20, 2024

stnolting added bug Something isn't working as expected HW Hardware-related labels Feb 20, 2024

stnolting closed this as completed in #821 Feb 21, 2024

stnolting reopened this Feb 22, 2024

stnolting mentioned this issue Feb 23, 2024

🐛 fix atomic write/clear/set accesses of clear-only CSR bits #829

Merged

stnolting closed this as completed in #829 Feb 24, 2024

stnolting mentioned this issue May 4, 2024

Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash #893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with FIRQ pending interrupt clearing #818

Possible issue with FIRQ pending interrupt clearing #818

robhancocksed commented Feb 20, 2024

robhancocksed commented Feb 20, 2024

robhancocksed commented Feb 20, 2024

stnolting commented Feb 20, 2024

stnolting commented Feb 20, 2024

stnolting commented Feb 21, 2024

robhancocksed commented Feb 21, 2024

stnolting commented Feb 22, 2024 •

edited

Loading

robhancocksed commented Feb 22, 2024

stnolting commented Feb 22, 2024

stnolting commented Feb 22, 2024

robhancocksed commented Feb 22, 2024

stnolting commented Feb 23, 2024 •

edited

Loading

robhancocksed commented Feb 26, 2024

stnolting commented Feb 26, 2024

Possible issue with FIRQ pending interrupt clearing #818

Possible issue with FIRQ pending interrupt clearing #818

Comments

robhancocksed commented Feb 20, 2024

robhancocksed commented Feb 20, 2024

robhancocksed commented Feb 20, 2024

stnolting commented Feb 20, 2024

stnolting commented Feb 20, 2024

stnolting commented Feb 21, 2024

robhancocksed commented Feb 21, 2024

stnolting commented Feb 22, 2024 • edited Loading

CSR write

CSR clear

CSR set

⚠️ Non-atomic CSR access

robhancocksed commented Feb 22, 2024

stnolting commented Feb 22, 2024

stnolting commented Feb 22, 2024

robhancocksed commented Feb 22, 2024

stnolting commented Feb 23, 2024 • edited Loading

robhancocksed commented Feb 26, 2024

stnolting commented Feb 26, 2024

stnolting commented Feb 22, 2024 •

edited

Loading

stnolting commented Feb 23, 2024 •

edited

Loading