Skip to content

✨ [CFU] add support for custom R5-type instructions #452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 10, 2022
Merged
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@ mimpid = 0x01040312 => Version 01.04.03.12 => v1.4.3.12

| Date (*dd.mm.yyyy*) | Version | Comment |
|:-------------------:|:-------:|:--------|
| 09.12.2022 | 1.7.8.4 | :sparkles: new option to add custom **R5-type** (4 source registers, 1 destination register) instructions to **Custom Functions Unit (CFU)**; [#452](https://github.com/stnolting/neorv32/pull/452) |
| 08.12.2022 | 1.7.8.3 | :bug: fix interrupt behavior when in user-mode; minor core rtl fixes; do not check registers specifiers in CFU instructions (i.e. using registers above `x15` when `E` ISA extension is enabled); [#450](https://github.com/stnolting/neorv32/pull/450) |
| 03.12.2022 | 1.7.8.2 | :sparkles: new option to add custom R4-type RISC-V instructions to **CFU**; rework CFU hardware module, intrinsic library and example program; [#449](https://github.com/stnolting/neorv32/pull/449) |
| 03.12.2022 | 1.7.8.2 | :sparkles: new option to add custom **R4-type** RISC-V instructions to **Custom Functions Unit (CFU)**; rework CFU hardware module, intrinsic library and example program; [#449](https://github.com/stnolting/neorv32/pull/449) |
| 01.12.2022 | 1.7.8.1 | package cleanup; [#447](https://github.com/stnolting/neorv32/pull/447) |
| 28.11.2022 | [**:rocket:1.7.8**](https://github.com/stnolting/neorv32/releases/tag/v1.7.8) | **New release** |
| 14.11.2022 | 1.7.7.9 | minor rtl edits and code optimizations; [#442](https://github.com/stnolting/neorv32/pull/442) |
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ and *Privileged Architecture Specification* ([pdf](https://github.com/stnolting/
* implements **all** standard RISC-V exceptions and interrupts (including MTI, MEI & MSI)
* 16 fast interrupt request channels as NEORV32-specific extension
* custom functions unit ([CFU](https://stnolting.github.io/neorv32/#_custom_functions_unit_cfu) as `Zxcfu` ISA extension)
for up to 1024 R3-type and up to 8 R4-type _custom RISC-V instructions_
for _custom RISC-V instructions_ (R3-type, R4-type and R5-type)
* _intrinsic_ library for the `Zfinx` extension as it is not yet supported by upstream GCC

**Memory**
Expand Down
8 changes: 3 additions & 5 deletions docs/datasheet/cpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -659,13 +659,11 @@ generic, this ISA extensions adds the <<_custom_functions_unit_cfu>> to the CPU
allows to add **custom RISC-V instructions** to the processor core.

The CPU is implemented as additional ALU co-processor and is integrated right into the CPU's pipeline providing minimal
data transfer latency as it has direct access to the core's register file. The CFU supports **RISC-V R3-type** instructions
as well as **RISC-V R4-type** instructions. Up to 1024 custom R3-type instructions and up to 8 custom R4-type instruction
can be implemented within the CFU. These instructions are mapped to an opcode space that has been explicitly reserved by
the RISC-V spec for custom extensions.
data transfer latency as it has direct access to the core's register file. The CFU utilizes the RISC-V `custom` opcodes
that have been explicitly reserved by the RISC-V spec for custom extensions.

Software can utilize the custom instructions by using _intrinsic_, which are basically inline assembly functions that
behave like regular C functions and thqat evaluate to a single custom instruction word (not calling overhead at all).
behave like regular C functions but that evaluate to a single custom instruction word (not calling overhead at all).

[TIP]
For more detailed information regarding the CFU, it's hardware and the according software interface
Expand Down
151 changes: 97 additions & 54 deletions docs/datasheet/cpu_cfu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,41 +10,42 @@ The CFU is intended for operations that are inefficient in terms of performance,
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
use-cases might include:

* **AI:** sub-word / vector / SIMD operations like adding all four bytes of a 32-bit data word
* **AI:** sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel
* **Cryptographic:** bit substitution and permutation
* **Communication:** conversions like binary to gray-code; multiply-add operations
* **Image processing:** look-up-tables for color space transformations
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32

[NOTE]
The CFU is not intended for complex and CPU-independent functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be better implemented within the
The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped
<<_custom_functions_subsystem_cfs>>.
A comparison of all chip-internal hardware extension options is provided in the user guide section
A comparison of all NEORV32-specific chip-internal hardware extension options is provided in the user guide section
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].


:sectnums:
==== CFU Instruction Formats

The custom instructions executed by the CFU utilize a specific instruction space in the total `rv32` 32-bit instruction
The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("_Guaranteed Non-Standard
Encoding Space_"). The NEORV32 CFU uses the `custom-0` and `custom-1` opcodes to identify the custom instructions implemented
by the CFU and to differentiate between two instruction formats (note: these formats are common RISC-V instruction format types).
The custom-0 opcode is used to implement custom **R3-type** instructions while the custom-1 opcode is used to
implement custom **R4-type** instructions. The according binary encoding of these opcodes is shown below:
Encoding Space_"). The NEORV32 CFU uses the `custom-x` opcodes to identify the instructions implemented
by the CFU and to differentiate between the different instruction formats.
The according binary encoding of these opcodes is shown below:

* `custom-0`: `0001011` (R3-type instructions)
* `custom-1`: `0101011` (R4-type instructions)
* `custom-0`: `0001011` (R3-type instructions, RISC-V standard)
* `custom-1`: `0101011` (R4-type instructions, RISC-V standard)
* `custom-2`: `1011011` (R5-type instruction A, NEORV32-specific)
* `custom-3`: `1111011` (R5-type instruction B, NEORV32-specific)

.CFU Instructions - Exceptions
[IMPORTANT]
The CPU control logic will only analyze the opcode of the custom instructions to check if the
instruction word is valid. All remaining bit-fields are **not checked** by the CPU instruction decoding logic.
This also means that the MSB of the register fields is not evaluated even if the `E` ISA extension is enabled
(for standard RISC-V instructions this would cause an exception).
The CPU control logic only analyzes the opcode of the custom instructions to check if the _entire_
instruction word is valid. All remaining bit-fields are **not checked** at all.
This also means that the MSBs of the register fields are **not checked** even if the `E` ISA extension
is enabled (for standard RISC-V instructions this would cause an exception).
Hence, a custom CFU instruction can never raise an illegal instruction exception. If the CFU is not
implemented at all (`Zxcfu` ISA extension is not enabled) any instruction with opcode custom-0 or custom-1
implemented at all (`Zxcfu` ISA extension is not enabled) any instruction with `custom-x` opcode
will raise an illegal instruction exception.


Expand All @@ -61,11 +62,11 @@ Example operation: `rd <= rs1 xnor rs2`
.CFU R3-type instruction format
image::cfu_r3type_instruction.png[align=center]

* `funct7`: 7-bit immediate
* `rs2`: address of second source register
* `rs1`: address of first source register
* `funct3`: 3-bit immediate
* `rd`: address of destination register
* `funct7`: 7-bit immediate (further operand data or function select)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0001011` (RISC-V "custom-0" opcode)

.RISC-V compatibility
Expand All @@ -74,15 +75,15 @@ The CFU R3-type instruction format is compliant to the RISC-V ISA specification.

.Instruction encoding space
[NOTE]
By using the `funct7` and `funct3` entirely for selecting the actual operation a total of 1024 custom R3-type instructions
can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).
By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom R3-type
instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).


:sectnums:
==== CFU R4-Type Instructions

The R4-type CFU instructions operate on three source registers and return the processing result to the destination register.
The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediates can also be used to
The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediate can also be used to
pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual
functionality is entirely user-defined.

Expand All @@ -91,11 +92,11 @@ Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`
.CFU R4-type instruction format
image::cfu_r4type_instruction.png[align=center]

* `rs3`: address of third source register
* `rs2`: address of second source register
* `rs1`: address of first source register
* `funct3`: 3-bit immediate
* `rd`: address of destination register
* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0101011` (RISC-V "custom-1" opcode)

.RISC-V compatibility
Expand All @@ -104,52 +105,86 @@ The CFU R4-type instruction format is compliant to the RISC-V ISA specification.

.Unused instruction bits
[NOTE]
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bit are ignored
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits are ignored
by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve compatibility with
future implementations.

.Instruction encoding space
[NOTE]
By using the `funct3` entirely for selecting the actual operation a total of 8 custom R4-type instructions
By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type instructions
can be implemented (3-bit -> 8 different values).

.Hardware resource requirements
[WARNING]
Enabling the CFU and actually implementing R4-type instruction (or more precisely, using `rs3` inside the CFU hardware
module) will add another read port to the core's register file increasing resource requirements. For example, on a
FPGA platform that supports dual-port RAMs this will _double_ the number of required BRAMs for implementing the register
file.

:sectnums:
==== CFU R5-Type Instructions

The R5-type CFU instructions operate on three source registers and return the processing result to the destination register.
As all bits of the instruction word are used to encode the five registers and the opcode, no further immediate bits
are available to specify the actual operation. There are two different R5-type instruction with two different opcodes
available. Hence, only two R5-type operations can be implemented out of the box.

Example operation: `rd <= rs1 & rs2 & rs3 & rs4`

.CFU R5-type instruction A format
image::cfu_r5type_instruction_a.png[align=center]

.CFU R5-type instruction B format
image::cfu_r5type_instruction_b.png[align=center]

* `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data)
* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode)

.RS4 bit field
[NOTE]
The `rs4` bit-field is split into two instruction word fields `rs4.hi` and `rs4.lo`. This allows a simple
decoding logic as the location of the remaining register fields is identical to other R-type instructions.

.RISC-V compatibility
[IMPORTANT]
The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction
layout is NEORV32-specific.

.Instruction encoding space
[IMPORTANT]
There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely
by the opcode resulting in just two different operations out of the box. However, another CFU instruction
(like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by
writing operation information to a CFU-internal "command" register.


:sectnums:
==== Using Custom Instructions in Software

The custom instructions provided by the CFU are included into plain C code by using **intrinsics**. Intrinsics
The custom instructions provided by the CFU can be used in plain C code by using **intrinsics**. Intrinsics
behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly.
Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when including custom
instructions.
instructions. Each intrinsic will result in a single 32-bit instruction word providing maximum code efficiency.

The NEORV32 software framework provides two pre-defined prototypes for custom instructions, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h` - one for R3-type instruction and one for R4-type instructions:
The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h`:

.CFU instruction prototypes
[source,c]
----
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instruction
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instruction
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instructions
neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) // R5-type instruction A
neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) // R5-type instruction B
----

The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
when not needed. Each intrinsic function requires several arguments depending on the instruction type:

* `funct7` - 7-bit immediate (r3-type)
* `funct3` - 3-bit immediate (r3-type, r4-type)
* `rs3` - source operand 2, 32-bit (r4-type)
* `rs2` - source operand 2, 32-bit (r3-type, r4-type)
* `rs1` - source operand 1, 32-bit (r3-type, r4-type)
when not needed. Each intrinsic function requires several arguments depending on the instruction type/format:

[NOTE]
The literals (immediate bit-fields `funct3` and `funct7`) have to be **static at compile time**.
* `funct7` - 7-bit immediate (R3-type only)
* `funct3` - 3-bit immediate (R3-type, R4-type)
* `rs1` - source operand 1, 32-bit (R3-type, R4-type)
* `rs2` - source operand 2, 32-bit (R3-type, R4-type)
* `rs3` - source operand 2, 32-bit (R3-type, R4-type, R5-type)
* `rs4` - source operand 2, 32-bit (R4-type, R4-type, R5-type)

The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2` and `rs3`
arguments pass the actual data to the CFU. These register arguments can be populated with variables or literals.
Expand All @@ -159,14 +194,16 @@ The following example shows how to pass arguments when executing both CFU instru
[source,c]
----
uint32_t tmp = some_function();
...
uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);
uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, some_array[i]);
uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp);
----

.CFU Example Program
[TIP]
There is a simple example program for the CFU, which shows how to use the _default_ CFU hardware module.
The example program is located in `sw/example/demo_cfu`.
There is an example program for the CFU, which shows how to use the _default_ CFU hardware module.
This example program is located in `sw/example/demo_cfu`.


:sectnums:
Expand All @@ -181,6 +218,12 @@ The default CFU hardware module already implement some exemplary instructions th
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
is highly commented to explain the available signals and the handshake with the CPU pipeline.

.CFU hardware resource requirements
[WARNING]
Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using
the according operands for the CFU hardware) will add one or two additional read ports to the core's
register file increasing resource requirements.

CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
and may also include internal states and memories. The CFU's internal controller unit takes care of
Expand Down
Binary file added docs/figures/cfu_r5type_instruction_a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/cfu_r5type_instruction_b.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 8 additions & 3 deletions rtl/core/neorv32_cpu.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,16 @@ architecture neorv32_cpu_rtl of neorv32_cpu is
-- ----------------------------------------------------------------------------------------------

-- local constants --
constant regfile_rs3_en_c : boolean := CPU_EXTENSION_RISCV_Zxcfu or CPU_EXTENSION_RISCV_Zfinx; -- third register file read port (rs3)
constant regfile_rs3_en_c : boolean := CPU_EXTENSION_RISCV_Zxcfu or CPU_EXTENSION_RISCV_Zfinx; -- 3rd register file read port (rs3)
constant regfile_rs4_en_c : boolean := CPU_EXTENSION_RISCV_Zxcfu; -- 4th register file read port (rs4)

-- local signals --
signal ctrl : std_ulogic_vector(ctrl_width_c-1 downto 0); -- main control bus
signal imm : std_ulogic_vector(XLEN-1 downto 0); -- immediate
signal rs1 : std_ulogic_vector(XLEN-1 downto 0); -- source register 1
signal rs2 : std_ulogic_vector(XLEN-1 downto 0); -- source register 2
signal rs3 : std_ulogic_vector(XLEN-1 downto 0); -- source register 3
signal rs4 : std_ulogic_vector(XLEN-1 downto 0); -- source register 4
signal alu_res : std_ulogic_vector(XLEN-1 downto 0); -- alu result
signal alu_add : std_ulogic_vector(XLEN-1 downto 0); -- alu address result
signal alu_cmp : std_ulogic_vector(1 downto 0); -- comparator result
Expand Down Expand Up @@ -345,7 +347,8 @@ begin
generic map (
XLEN => XLEN, -- data path width
CPU_EXTENSION_RISCV_E => CPU_EXTENSION_RISCV_E, -- implement embedded RF extension?
RS3_EN => regfile_rs3_en_c -- enable third read port
RS3_EN => regfile_rs3_en_c, -- enable 3rd read port
RS4_EN => regfile_rs4_en_c -- enable 4th read port
)
port map (
-- global control --
Expand All @@ -359,7 +362,8 @@ begin
-- data output --
rs1_o => rs1, -- operand 1
rs2_o => rs2, -- operand 2
rs3_o => rs3 -- operand 3
rs3_o => rs3, -- operand 3
rs4_o => rs4 -- operand 4
);


Expand Down Expand Up @@ -387,6 +391,7 @@ begin
rs1_i => rs1, -- rf source 1
rs2_i => rs2, -- rf source 2
rs3_i => rs3, -- rf source 3
rs4_i => rs4, -- rf source 4
pc_i => curr_pc, -- current PC
imm_i => imm, -- immediate
-- data output --
Expand Down
Loading