|  | Adding Custom Instructions to the tool chain |  | Another one |
|--|----------------------------------------------|--|-------------|
|  |                                              |  |             |

# Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

Riadh Ben Abdelhamid<sup>1</sup>

<sup>1</sup> Postdoctoral researcher at the Novel Computing Technologies group, Heidelberg University, Germany

FPGA Ignite Summer School 2024

|  | Adding Custom Instructions to the tool chain |  |  |
|--|----------------------------------------------|--|--|
|  |                                              |  |  |

## Outline



### General Concepts

- BRISKI Barrel Processor
- Adding Custom Instructions to the tool chain
- 6 RTL support of Custom Instructions
- Testing it all

### Another one

| General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|------------------|----------------------------------------------|----------------|-------------|
|                  |                                              |                |             |

## Outline

### Introduction

### 2 General Concepts

#### BRISKI Barrel Processor

#### Adding Custom Instructions to the tool chain

#### 6 RTL support of Custom Instructions

### Testing it all

#### Another one

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|--|----------------------------------------------|----------------|-------------|
|                      |  |                                              |                |             |

### Lecture outline

# What I will teach

- Issues facing the efficiency of FPGA softcores.
- Fine-grain multi-threading and how it improves the efficiency of FPGA softcores.
- Adding custom instructions to riscv-gnu-toolchain and how they can improve the efficiency of the processor.
- Adding support for custom instructions at the RTL level.

### What you will learn

Well, That is up to you :)

| General Concepts | Adding Custom Instructions to the tool chain |  | Another one |
|------------------|----------------------------------------------|--|-------------|
|                  |                                              |  |             |

### **RISC-V** processors

#### Benefits of using RISC-V

- Linux of hardware (Open Source ISA (Instruction Set Architecture)).
- Rich and growing ecosystem and user base.
- Modular ISA with the possibility of using own custom instructions.

#### **RISC-V on FPGAs**

- Mapping on FPGAs is tricky.
- Conventional Micro-architectures are under performing.

#### **Possible Solutions**

- Barrel Processor architecture may yield high compute density.
- Context storage can be handled by on-chip memories.
- Simpler Deeper Pipeline  $\Rightarrow$  Higher throughput with less logic.

| Introduction<br>0000 | General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|------------------|----------------------------------------------|----------------|-------------|
|                      |                  |                                              |                |             |

# Efficiency of RISC-V softcores?

### What can we improve?

Multiple aspects :

- Speed clock speed, throughput, peak performance, sustained performance.
- Area LUTs, BRAMs, etc.  $\Rightarrow$  compute density.
- Power Power-efficiency.

## How can we improve?

- Architecture (ISA, memory Architecture, etc.).
- **Micro-architecture** (Efficient ISA implementation, Efficient mapping to target hardware, **deep pipelining**, etc.).
- Tools (Efficiently using tools like Vivado, Quartus, etc. )

| Introduction<br>0000 | General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|------------------|----------------------------------------------|----------------|-------------|
|                      |                  |                                              |                |             |

## Outline

### Introduction

### General Concepts

- BRISKI Barrel Processor
- Adding Custom Instructions to the tool chain
- 6 RTL support of Custom Instructions
- Testing it all

### Another one

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|--|----------------------------------------------|----------------|-------------|
|                      |  |                                              |                |             |

# Challenges of Deep Pipelining on FPGAs

To achieve a high  $F_{MAX}$ , a softcore processor requires a deep pipeline.

**Deep pipelining** is a technique used to **improve throughput** by breaking down instruction execution into **smaller stages**.

However deep pipelining faces some challenges on FPGAs :

#### Branch Prediction Penalty

- Deep pipelines suffer from branch prediction penalties, where pipeline stages are flushed when branches are mispredicted.
- FPGAs have **limited resources** to spend on complex branch prediction structures, making efficient prediction challenging.

#### Increased Forwarding Logic

Deep pipelines require extensive forwarding logic to propagate data between pipeline stages.  $\implies$  increases resource utilization and limits scalability on FPGAs.

# Branch Prediction Penalty and Impact on CPI

To formulate the penalty cost of a branch misprediction in terms of cycles per instruction (CPI), we need to consider the additional cycles incurred due to the misprediction.

#### **CPI** cost

The penalty cost can be expressed as :

$$CPI_{overall} = CPI_{correct} + P(misprediction) \times P_m$$
(1)

For example, for the case where  $CPI_{correct}=1$ , where the probability of misprediction is P(misprediction)=0.2 (20% misprediction rate) and where the misprediction penalty is  $P_m=5$  wasted cycles. The resulting  $CPI_{overall}$  would evaluate to :

$$PPI_{overall} = 1 + 0.2 \times 5 = 1 + 1 = 2.$$
<sup>(2)</sup>

This means that, on average, **it takes 2 clock cycles to execute each instruction**, considering both correctly predicted branches and the penalty for mispredictions, **which translates to a 100% loss in performance**.

#### Important Note

The higher the misprediction rate P(m) misprediction) and/or the Penalty  $P_m$ , the greater the impact on the overall CPI, indicating decreased performance due to branch mispredictions.

C

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain |  | Another one |
|----------------------|--|----------------------------------------------|--|-------------|
|                      |  |                                              |  |             |

# Data Forwarding (Register Bypassing)

#### Read After Write (RAW) Hazard (Data Hazard)

This occurs when an instruction needs to read a register that a previous instruction is writing to, and the read would otherwise happen before the write completes. Without bypassing, the pipeline would need to stall until the write completes, as the needed data would not yet be available.

- Instruction 1 : ADD x3, x1, x2;  $\implies$  x3 = x1 + x2
- Instruction 2 : SUB x4, x3, x5;  $\implies$  x4 = x3 x5

In this example, Instruction 2 needs the result of Instruction 1 for the SUB operation. If the processor waits until Instruction 1 writes the result to the register file before allowing Instruction 2 to proceed, this would create a pipeline stall.

#### Solution with Register Bypassing

Register bypassing allows the result from Instruction 1 (which will be available at the end of the execute stage) to be forwarded directly to the input of the execute stage of Instruction 2, without waiting for the result to be written back to the register file.

| Introduction<br>0000 | General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|------------------|----------------------------------------------|----------------|-------------|
|                      |                  |                                              |                |             |

# Why Register Bypassing is good?

## Why Register Bypassing is good

- Addresses **RAW hazards** by reducing or eliminating stalls in the pipeline.
- Solves RAW hazards by providing an immediate path for data from the output of one instruction to the input of the next.
- Bypassing improves performance by allowing subsequent instructions to use the results of earlier instructions as soon as they are computed.
- Helps maintaining high instruction throughput and efficient pipeline utilization.

# Why Register Bypassing is not good?

# Why Register Bypassing is not good

Data forwarding requires additional hardware in the processor design. Specifically :

- **Multiplexers** : These are used to select the correct data source (either from a register or from an earlier pipeline stage) for each operand of an instruction.
- Control Logic : Extra control logic is needed to detect when data forwarding should occur and to control the multiplexers accordingly.

# Barrel Processing (Fine-grain multi-threading)



#### NOTE

- $N_{\text{Hardware Threads}} \geq N_{\text{physical pipeline stages}}$ .
- A new Hart is fetched each clock cycle.
- A Hart is executed once every 16 cycles.
- By the time the same Hart is fetched again, all branches and data hazards are resolved  $\implies$  No need for branch prediction or Register forwarding  $\implies$  Better MIPS/LUT and higher number of cores is possible

Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|--|----------------------------------------------|----------------|-------------|
|                      |  |                                              |                |             |

# Barrel processor (Fine-Grain Multi-Threading) Advantage

To increase instruction throughput, you must aim for :

- low CPI ( $\leq 1$ )
- high maximum clock speed F<sub>MAX</sub>

Achieving a perfect branch prediction rate nearing 100% contributes to a better CPI. Additionally, to attain high  $F_{MAX}$ , the processor requires a deep pipeline, which typically necessitates register forwarding despite potentially constraining  $F_{MAX}$ .

Here, the barrel processor comes into play. Interleaving hardware threads every clock cycle :

- eliminates the need for branch prediction
- eliminates the need for register forwarding

This effectively allows

- deeper pipeline without paying increased branch and forwarding costs.
- Higher clock speed while maintaining low CPI

By removing the need for branch prediction and register forwarding, a barrel processor saves logic and results in a more compact implementation  $\implies$  Higher compute density (MIPS/LUT).

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|--|----------------------------------------------|----------------|-------------|
|                      |  |                                              |                |             |

### IPS as a performance metric

#### Instruction Per Second (IPS)

- The CPI metric is agnostic to the operating clock speed of a processor.
- The actual processor performance (Instruction throughput) can be measured by Equation (3), where Instruction Per Second (IPS) is the actual instruction throughput, Instruction Per Cycle (IPC) is the inverse of CPI (*IPC* = 1/*CPI*) and F<sub>MAX</sub> is the maximum operating clock speed of the processor.

$$IPS = IPC \times F_{MAX}$$

#### Important Note

There will be **no need** for register forwarding nor branch prediction, as the pipeline goes deeper, because **when a hart (hardware thread) is re-enabled again**, all data hazards and all branches would be **already resolved**.

(3)

| General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|------------------|----------------------------------------------|----------------|-------------|
|                  |                                              |                |             |

## Outline



### 2) General Concepts

### BRISKI Barrel Processor

Adding Custom Instructions to the tool chain

Instructions RTL support of Custom Instructions

### Testing it all

#### Another one

# Design of BRISKI (Barrel RISC-V for Kilo-core Implementations)



Figure - BRISKI Barrel Processor Architecture.

### NOTE

- 16 RegisterFiles/ProgramCounters for 16 Hardware Threads.
- Fewer than 800 LUTs and fewer than 1K FFs (near 1-to-1 ratio).
- 650+ MHz (elastic pipeline) ⇒ 650 MIPS (CPI=1) ⇒ ~0.82 MIPS/LUT

Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

Introduction General Concepts BRISKI Barrel Processor

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another 000 000

# FPGA Resource Layout and the Need for elasticity pipeline



Figure – 12 Columns containing BlockRAM resources (180 BRAMs/Col for a total of 2160 BRAM (2160 RAMB36 or 4320 RAMB18)).



Figure – 4 Columns containing UltraRAM resources (240 URAMs/Col for a total of 960 UltraRAMs).



Figure – 19 Columns containing DSP resources (360 DSPs/-Col for a total of 6840 DSPs).



Figure – Columns containing PCIe resources

# Design of BRISKI (Barrel RISC-V for Kilo-core Implementations)

### NOTE

- One BRAM for Data / Instructions.
- One BRAM for 16 register files.
- Memory Mapped Interface to translate between load/store and control signals.



Figure – BRISKI CoreTop Interface wrapper.

Introduction

BRISKI Barrel Processor

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another 000 000

# Design of BRISKI (Barrel RISC-V for Kilo-core Implementations)



Figure - Register File implementation using two RAMB18 primitives.

NOTE

- 16 register files (2 RAMB18 in SDP mode).
- RAMB18 instances with 512 by 32-bit space fully utilized.

Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

| Introduction<br>0000 | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|-------------------------|----------------------------------------------|----------------|-------------|
|                      |                         |                                              |                |             |

# Compute density

#### Why Compute density matters

- Compute density : Million of Instruction Per Second Per LUT (MIPS/LUT) ratio
- The short sight goal is to increase MIPS and reduce LUTs  $\implies$  Improve Local compute density.
- The long sight goal is to be able to flood a single FPGA with hundreds of cores without compromising MIPS.
   Improve Global compute density.
- The end result would deliver 100s of GIPS on a single FPGA chip.

$$(MIPS/LUT)_{FPGA} = (MIPS/LUT)_{CORE} \times \#CORES$$
(4)

In simpler terms, **increasing** Compute density means : More LUTs  $\implies$  **Increasingly** More MIPS.

|  |  | General Concepts |  | Adding Custom Instructions to the tool chain |  | Testing it all | Another one |
|--|--|------------------|--|----------------------------------------------|--|----------------|-------------|
|--|--|------------------|--|----------------------------------------------|--|----------------|-------------|

### State of the art softcore implementations :

|                             | (          | TABLE I<br>COMPARISON WITH RECENT RISC-V RELATED WORKS. |                     |                   |  |  |  |  |
|-----------------------------|------------|---------------------------------------------------------|---------------------|-------------------|--|--|--|--|
|                             | 000        | bit-serial                                              | in-order            | barrel processor  |  |  |  |  |
| Ref                         | [1]        | [2]                                                     | [3]                 | BRISKI [4]        |  |  |  |  |
| Year                        | 2019       | NA                                                      | 2016                | 2024              |  |  |  |  |
| FPGA                        | XC7Z020    | Artix7                                                  | VU9P                | VU9P              |  |  |  |  |
| LUT                         | $\sim 15k$ | 125                                                     | 320                 | 789               |  |  |  |  |
| FlipFlop (FF)               | $\sim 8k$  | 164                                                     | Not reported        | 855               |  |  |  |  |
| BRAM                        | 6          | Not reported                                            | 0.5-1**             | 2 (RAMB18)        |  |  |  |  |
| ISA                         | RV32IM     | RV32I                                                   | RV32I+lr/sc-bshift* | RV32I+lr/sc+csrrs |  |  |  |  |
| Fmax(MHz)                   | 95.3       | 220                                                     | 375                 | 650               |  |  |  |  |
| CPI                         | NA         | >32                                                     | 1.6                 | 1                 |  |  |  |  |
| MIPS=(Fmax/CPI)             | 194        | ~7                                                      | ~234                | 650               |  |  |  |  |
| Compute Density(MIPS / LUT) | 0.012      | 0.055                                                   | 0.73                | 0.82              |  |  |  |  |

\* The work [3] reports that Multiply-shift and load/store byte-align sign-extension logic are implemented but shared by core pairs in a cluster.

\*\* The work [3] reports a BRAM utilization of 4 to 8 in a cluster of 8 cores which leads to 0.5 to 1 BRAM for a single core.

[1] S. Mashimo, A. Fujita, R. Matsuo, S. Akaki, A. Fukuda, T. Koizumi, J. Kadomoto, H. Irie, M. Goshima, K. Inoue, and R. Shioya, "An open source fpga-optimized out-of-order risc-v soft processor," in 2019 International Conference on Field-Programmable Technology (ICFPT), 2019, pp. 63–71 [2] O. Kindgren. bit-serial risc-v. [Online] https://github.com/ olofk/serv

[3] J. Gray. (2017) GRVI Phalanx : A Massively Parallel RISC- V FPGA Accelerator Framework A 1680-core, 26 MB Parallel Processor Overlay for Xilinx UltraScale+ VU9P. [Online] : https://carrv.github.io/2017/papers/gray-phalanx-carrv2017.pdf

[4] https://github.com/riadhbenabdelhamid/BRISKI

| Introduction<br>0000 | General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|------------------|----------------------------------------------|----------------|-------------|
|                      |                  |                                              |                |             |

# BRISKI enables single-FPGA Kilo Core designs

#### **BRISKI\* Barrel Processor Core**

- BRISKI implements full RV32I user mode + atomic extension subset (LR.W/SC.W) + CSRRS).
- 650+ MHz on a VU9P FPGA.
- Fewer than 800 LUTs in most implementations (Fewer than 700 LUTs with area optimized directive on a VU9P).
- Current implementation interleaves **16 Hardware Threads**.
- > 0.8 MIPS/LUT

[\*] https://github.com/riadhbenabdelhamid/BRISKI.

## SPARKLE (Scalable Parallel Architecture for RISC-V Kernel-Level Execution)



Figure – SPARKLE floorplan on a VU9P FPGA.



Figure – SPARKLE's Fully placed and routed design, on a VU9P FPGA, with 1,024 BRISKI cores (16,384 Hardware Threads) @400 MHz.

#### SPARKLE\*\* : 1,024 BRISKI cores @ 400 MHz on a VU9P

- SPARKLE is a scalable many-core architecture (scales up and down).
- Currently running on a VU9P with 1,024 BRISKI cores @400MHz and delivering 400 RV32I GIPS.
- This implementation uses around 800K LUTs, 2085 BRAMs, 60 URAMs and 1,150K FFs.
- > 0.5 MIPS/LUT

[\*\*] Riadh Ben Abdelhamid, Vladislav Valek, and Dirk Koch. SPARKLE : A 1024-Core/16,384-Thread single FPGA many-core RISC-V barrel processor Overlay. ASAP 2024.

Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

|  | BRISKI Barrel Processor | RTL support of Custom Instructions | Testing it all | Another one |
|--|-------------------------|------------------------------------|----------------|-------------|
|  |                         |                                    |                |             |

# Outline

### Introduction

#### 2 General Concepts

#### BRISKI Barrel Processor

### Adding Custom Instructions to the tool chain

#### BTL support of Custom Instructions

### Testing it all

#### Another one

| Introduction | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|--------------|-------------------------|----------------------------------------------|----------------|-------------|
|              |                         |                                              |                |             |

# PreLab preparation

#### (Optional) A recommended read

This is a nice guide to add custom instructions, however some of the contents are outdated : https://pcotret.gitlab.io/riscv-custom/sw\_toolchain.html

#### Set Up the Tools

- The provided Virtual Machine comes with verilator and riscv-gnu-toolchain pre-installed.
- If you do not prefer to use or can not use the VM, make sure to have these tools downloaded and installed.

#### Clone the BRISKI core repo and switch to FPGAIgnite24 branch

- git clone https ://github.com/riadhbenabdelhamid/BRISKI.git
- cd BRISKI
- git switch FPGAIgnite24



## **BRISKI Repo structure**



Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another or 000 000

# Convert lower cases to upper cases with fine-grain multi-threading

#### Open the file ../BRISKI/software/assembly/lower\_upper\_byte.s

Listing - Data section

| 1  | .section .data                                                                 |
|----|--------------------------------------------------------------------------------|
| 2  | .align 4                                                                       |
| 3  | # Data sections for each hart, each containing an array of 32 ASCII characters |
| 4  | hart0_data: .ascii "AbcDefGhijKlmNop@#%\$&*!()+=012"                           |
| 5  | hart1_data: .ascii "zXyWVutsrQpOnMLkjihgfEDCBA987654"                          |
| 6  | hart2_data: .ascii "PqRsTuvWXYzabcdefghij!@#\$%^&*()_"                         |
| 7  | hart3_data: .ascii "lmnOpQrStUvWxYzABCDEFGHIJ{} ;:<>"                          |
| 8  | hart4_data: .ascii "KLMNoPQrStUVWXyz0123456789~`-=_+"                          |
| 9  | hart5_data: .ascii "abcdefghijKLMNOPQR2345678901*&~%"                          |
| 10 | hart6_data: .ascii "1234567890abcdefGHIJKLMnoPQRSTuv"                          |
| 11 | hart7_data: .ascii "yzABCDEFghijKLMNOpqrst0123456789"                          |
| 12 | hart8_data: .ascii "ghijKLMNOPQRSTuvwxyZ!@#\$%^&*()12"                         |
| 13 | hart9_data: .ascii "ABCDefghijklmnopQRSTuvWXYZ012345"                          |
| 14 | hart10_data: .ascii "mnopQRSTUVWXyzab@#%\$&*!()+=012"                          |
| 15 | hart11_data: .ascii "xyz1234567890ABCD%\$&*(!)_+=-{} ["                        |
| 16 | hart12_data: .ascii "wxyZABCDEfghijklmnopQRSTuvWXYZ12"                         |
| 17 | hart13_data: .ascii "abcdefghijklmNOPQRSTUVWXyz012345"                         |
| 18 | hart14_data: .ascii "PQRSTuvWXYzabcdefghijklmnop!@#\$%"                        |
| 19 | hart15_data: .ascii "1234567890ABCDXYZefghijklmnopQRS"                         |
| 20 |                                                                                |
| 21 | .align 4                                                                       |
| 22 | shared_counter: .word 0 # Shared counter for barrier synchronization           |
|    |                                                                                |

BRISKI Barrel Processo

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another of 000

# Convert lower cases to upper cases with fine-grain multi-threading

Listing - Text section : Initialization

| 1  | .section .text                                                    |
|----|-------------------------------------------------------------------|
| 2  | .globl _start                                                     |
| 3  |                                                                   |
| 4  | _start:                                                           |
| 5  | li t0, 32 # Length of the ASCII array (32 characters)             |
| 6  | la t1, hart0_data # Load address of hart0 data                    |
| 7  | la t2, shared_counter # Load address of shared counter            |
| 8  |                                                                   |
| 9  | # Determine hart id (for simplicity, using a fixed base register) |
| 10 | csrr a0, mhartid # Read the hart ID                               |
| 11 | slli a0, a0, 5 # Each hart's data starts 32 bytes apart           |
| 12 | add t1, t1, a0 # Calculate start of this hart's data section      |
|    |                                                                   |

ntroduction

BRISKI Barrel Process

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Anoth

# Understanding ASCII characters

| CHAR  | DEC | HEX | CHAR | DEC | HEX | CHAR | DEC | HEX | CHAR  | DEC | HEX | CHAR  | DEC | HEX | CHAR | DEC | HEX        | CHAR | DEC | HEX        | CHAR | DEC | HEX |
|-------|-----|-----|------|-----|-----|------|-----|-----|-------|-----|-----|-------|-----|-----|------|-----|------------|------|-----|------------|------|-----|-----|
| [NUL] | 0   | 00  |      | 32  | 20  | Q    | 64  | 40  |       | 96  | 60  | e     | 128 | 80  |      | 160 | A0         | À    | 192 | C0         | à    | 224 | E0  |
| [SOH] | 1   | 01  | 1    | 33  | 21  | A    | 65  | 41  | a     | 97  | 61  | (n/a) | 129 | 81  | 1    | 161 | A1         | Á    | 193 | C1         | á    | 225 | E1  |
| [STX] | 2   | 02  |      | 34  | 22  | В    | 66  | 42  | b     | 98  | 62  |       | 130 | 82  | ¢    | 162 | A2         | Â    | 194 | C2         | â    | 226 | E2  |
| [ETX] | 3   | 03  | #    | 35  | 23  | C    | 67  | 43  | с     | 99  | 63  | f     | 131 | 83  | £    | 163 | A3         | Ă    | 195 | C3         | ã    | 227 | E3  |
| [EOT] | 4   | 04  | s    | 36  | 24  | D    | 68  | 44  | d     | 100 | 64  |       | 132 | 84  | 10   | 164 | A4         | Ă    | 196 | C4         | ä    | 228 | E4  |
| [ENQ] | 5   | 05  | %    | 37  | 25  | E    | 69  | 45  | e     | 101 | 65  |       | 133 | 85  | ¥    | 165 | A5         | Å    | 197 | C5         | à    | 229 | E5  |
| [ACK] | 6   | 06  | &    | 38  | 26  | F    | 70  | 46  | f     | 102 | 66  | +     | 134 | 86  | 1    | 166 | A6         | Æ    | 198 | C6         | æ    | 230 | Eő  |
| [BEL] | 7   | 07  |      | 39  | 27  | G    | 71  | 47  | g     | 103 | 67  | 1     | 135 | 87  | ş    | 167 | A7         | ç    | 199 | C7         | ç    | 231 | E7  |
| [BS]  | 8   | 08  | (    | 40  | 28  | Н    | 72  | 48  | h     | 104 | 68  | ^     | 136 | 88  |      | 168 | A8         | È    | 200 | C8         | è    | 232 | E8  |
| [HT]  | 9   | 09  | )    | 41  | 29  | 1    | 73  | 49  | i     | 105 | 69  | %0    | 137 | 89  | 0    | 169 | A9         | Ė    | 201 | C9         | é    | 233 | E9  |
| [LF]  | 10  | 0A  |      | 42  | 2A  | J    | 74  | 4A  | j     | 106 | 6A  | Š     | 138 | 8A  |      | 170 | AA         | É    | 202 | CA         | ê    | 234 | EA  |
| [VT]  | 11  | 0B  | +    | 43  | 2B  | K    | 75  | 4B  | k     | 107 | 6B  | <     | 139 | 8B  |      | 171 | AB         | Ë    | 203 | CB         | ë    | 235 | EB  |
| [FF]  | 12  | 0C  |      | 44  | 2C  | L    | 76  | 4C  | 1     | 108 | 6C  | Œ     | 140 | 8C  | _    | 172 | AC         | 1    | 204 | CC         | ì    | 236 | EC  |
| [CR]  | 13  | 0D  |      | 45  | 2D  | M    | 77  | 4D  | m     | 109 | 6D  | (n/a) | 141 | 8D  |      | 173 | AD         | Í    | 205 | CD         | í    | 237 | ED  |
| [SO]  | 14  | OE  |      | 46  | 2E  | N    | 78  | 4E  | n     | 110 | 6E  | Ž     | 142 | 8E  | 8    | 174 | AE         | 1    | 206 | CE         | î    | 238 | EE  |
| [SI]  | 15  | OF  | 1    | 47  | 2F  | 0    | 79  | 4F  | 0     | 111 | 6F  | (n/a) | 143 | 8F  | -    | 175 | AF         | Y    | 207 | CF         | ï    | 239 | EF  |
| [DLE] | 16  | 10  | 0    | 48  | 30  | P    | 80  | 50  | р     | 112 | 70  | (n/a) | 144 | 90  | •    | 176 | <b>B</b> 0 | Đ    | 208 | <b>D</b> 0 | ð    | 240 | F0  |
| [DC1] | 17  | 11  | 1    | 49  | 31  | Q    | 81  | 51  | 9     | 113 | 71  |       | 145 | 91  |      | 177 | B1         | Ñ    | 209 | D1         | ñ    | 241 | F1  |
| [DC2] | 18  | 12  | 2    | 50  | 32  | R    | 82  | 52  | r     | 114 | 72  |       | 146 | 92  |      | 178 | B2         | 0    | 210 | D2         | ò    | 242 | F2  |
| [DC3] | 19  | 13  | 3    | 51  | 33  | S    | 83  | 53  | \$    | 115 | 73  |       | 147 | 93  | 1    | 179 | B3         | 0    | 211 | D3         | ó    | 243 | F3  |
| [DC4] | 20  | 14  | 4    | 52  | 34  | Т    | 84  | 54  | t     | 116 | 74  |       | 148 | 94  | 1.1  | 180 | B4         | Ô    | 212 | D4         | ô    | 244 | F4  |
| [NAK] | 21  | 15  | 5    | 53  | 35  | U    | 85  | 55  | u     | 117 | 75  | •     | 149 | 95  | μ    | 181 | B5         | ð    | 213 | D5         | õ    | 245 | F5  |
| [SYN] | 22  | 16  | 6    | 54  | 36  | V    | 86  | 56  | v     | 118 | 76  | -     | 150 | 96  | 1    | 182 | B6         | Ö    | 214 | D6         | ö    | 246 | F6  |
| [ETB] | 23  | 17  | 7    | 55  | 37  | W    | 87  | 57  | W     | 119 | 77  | -     | 151 | 97  | 1.1  | 183 | B7         | ×    | 215 | D7         | +    | 247 | F7  |
| [CAN] | 24  | 18  | 8    | 56  | 38  | X    | 88  | 58  | X     | 120 | 78  |       | 152 | 98  |      | 184 | B8         | Ø    | 216 | D8         | ø    | 248 | F8  |
| [EM]  | 25  | 19  | 9    | 57  | 39  | Y    | 89  | 59  | У     | 121 | 79  | TM    | 153 | 99  | 1    | 185 | B9         | Ŭ    | 217 | D9         | ù    | 249 | F9  |
| [SUB] | 26  | 1A  | 1.1  | 58  | 3A  | Z    | 90  | 5A  | Z     | 122 | 7A  | š     | 154 | 9A  | •    | 186 | BA         | Ú    | 218 | DA         | ú    | 250 | FA  |
| [ESC] | 27  | 1B  |      | 59  | 3B  | [    | 91  | 5B  | {     | 123 | 7B  | )     | 155 | 9B  | 39   | 187 | BB         | Û    | 219 | DB         | ú    | 251 | FB  |
| [FS]  | 28  | 1C  | <    | 60  | 3C  | 1    | 92  | 5C  |       | 124 | 7C  | œ     | 156 | 9C  | 1/4  | 188 | BC         | Ŭ    | 220 | DC         | ü    | 252 | FC  |
| [GS]  | 29  | 1D  | -    | 61  | 3D  |      | 93  | 5D  | }     | 125 | 7D  | (n/a) | 157 | 9D  | 1/2  | 189 | BD         | Ý    | 221 | DD         | ý    | 253 | FD  |
| [RS]  | 30  | 1E  | >    | 62  | 3E  | ^    | 94  | 5E  | ~     | 126 | 7E  | ž     | 158 | 9E  | 3/4  | 190 | BE         | Þ    | 222 | DE         | þ    | 254 | FE  |
| [US]  | 31  | 1F  | ?    | 63  | 3F  | -    | 95  | 5F  | [DEL] | 127 | 7F  | Ŷ     | 159 | 9F  | i    | 191 | BF         | ß    | 223 | DF         | 5    | 255 | FF  |

Figure - Example encoding of ASCII characters.

11. https://www.ascii-code.net/

Boosting the efficiency of RISC-V cores: Fine-grain multi-threading and custom instructions, from concepts to implementation

BRISKI Barrel Processor

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another one 000 000

# Convert lower cases to upper cases with fine-grain multi-threading

Listing – Text section : Convert loop

```
1
        # Character Conversion Loop
2
    convert loop:
        1b a1, O(t1) # Load character from array
3
        #begz a1. finish # End of string (null character). exit loop
4
        li a2. 'a' # Load 'a'
        li a3, 'z' # Load 'z'
6
7
        blt a1, a2, next_char # If char < 'a', not a lowercase letter
8
        bgt a1, a3, next char # If char > 'z', not a lowercase letter
9
10
        # Convert to uppercase
        li a4. 32 # ASCII difference between upper and lower case
11
        sub a1, a1, a4 # Convert to uppercase
12
        sb a1, 0(t1) # Store back converted character
13
14
15
    next char:
16
        addi t1. t1. 1 # Move to next character
        addi t0, t0, -1 # Decrease character count
17
18
        bnez t0, convert loop # Continue loop if more characters
```

Adding Custom Instructions to the tool chain

RTL support of Custom Instructions

Testing it all Another or 000

# Convert lower cases to upper cases with fine-grain multi-threading

Listing - Text section : Barrier and Termination

| # Barrier Synchronization                                               |
|-------------------------------------------------------------------------|
| finish:                                                                 |
| li t6, 16 # Total number of harts                                       |
| li t3, 1 # Atomic increment value                                       |
| barrier:                                                                |
| lr.w t4, O(t2) # Load current counter value                             |
| add t4, t4, t3 # Increment counter                                      |
| sc.w t5, t4, 0(t2) # Store conditionally                                |
| bnez t5, barrier # Retry if SC failed                                   |
| exit_barrier:                                                           |
| lw t4, O(t2) # Total number of harts                                    |
| bne t4, t6, exit_barrier # Wait until all harts have reached this point |
|                                                                         |
| # Termination                                                           |
| ecall # End program (simulated halt for each hart)                      |
|                                                                         |

### Using a custom instruction in the convert loop

Open the file ../BRISKI/software/assembly/lower\_upper\_byte\_custom.s

Listing - Text section : Convert loop with custom instruction

```
# Character Conversion Loop
1
2
    convert_loop:
        lb a1, 0(t1) # Load character from array
3
4
        #begz a1. finish # End of string (null character). exit loop
        #li a2. 'a' # Load 'a'
5
        #li a3. 'z' # Load 'z'
6
7
        #blt a1, a2, next_char # If char < 'a', not a lowercase letter</pre>
8
        #bgt a1. a3. next char # If char > 'z'. not a lowercase letter
9
        # Convert to uppercase
10
        #li a4. 32 # ASCII difference between upper and lower case
11
        #sub a1, a1, a4 # Convert to uppercase
12
        lotoupcase a1. a1. x0 # Custom instruction: a1 = lotoupcase(a1)
13
        sb a1. 0(t1) # Store back converted character
14
15
16
    next_char:
        addi t1. t1. 1 # Move to next character
17
        addi t0, t0, -1 # Decrease character count
18
        bnez t0, convert_loop # Continue loop if more characters
19
```

| Introduction<br>0000 | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|-------------------------|----------------------------------------------|----------------|-------------|
|                      |                         |                                              |                |             |

## Linker Description Script (.lds file)

Listing - Defining Memory layout for text and data (lower\_upper\_byte\_custom.lds)

```
/* Define memory regions */
 1
 2
    MEMORY
 3
        /* Define RAM and ROM memory regions with specific addresses and sizes */
 4
        RAM (rwx) : ORIGIN = 0x00000200, LENGTH = 3072
 5
        6
 7
    3
8
9
    /* Define the sections and their placement */
10
    SECTIONS
    Ł
11
12
        /* Place the .text section in ROM */
        .text : {
13
           *(.text) /* All .text sections from input files */
14
       } >ROM
15
16
17
        /* Place the .data section in RAM */
        .data : {
18
           *(.data) /* All .data sections from input files */
19
        } >RAM
20
21
        /* Additional sections can be added here */
22
23
    7
```

# Makefile commands to generate executable instructions (.inst file)

### Custom path of your configured toolchain

- Open BRISKI/software/Makefile
- Update USR\_BIN to where your custom install path for the riscv-gnu-toolchain (\$(HOME)/summer\_school/riscv-custom/newlib/bin)

Listing - Makefile commands to generate executable instructions (.inst file)

```
PROG?=lower upper byte
    RUN DIR?=runs
2
    #USR_BIN?=/usr/bin
3
4
    USR BIN?=/home/riadh/tools/riscy-newlib-installpath/bin
5
    hex_gen: clean compile_link objdump_elf
6
7
            python3 hexgen.py $(RUN DIR)/$(PROG).asm $(RUN DIR)/$(PROG).inst
8
    compile_link:
9
            mkdir -p $(RUN_DIR)
10
            cd $(RUN DIR) && $(USR BIN)/riscv64-unknown-elf-gcc -march=rv32iazicsr -mabi=ilp32 -ffreestanding -nostdlib
11
                  -o $(PROG).elf -T ../assembly/$(PROG).lds ../assembly/$(PROG).s
12
13
    obidump elf: compile link
14
            cd $(RUN DIR) && $(USR BIN)/riscv64-unknown-elf-objdump -mriscv:rv32 -d -j .text -s -j .data $(PROG).elf > $
                  (PROG) asm
```

# Cloning and configuring the riscv-gnu-toolchain

#### Important Note

4

Skip this if you are using the provided Virtual Machine!

#### Listing - Pre-requisite packages

sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev device-tree-compiler

Listing - Cloning the riscv-gnu-toolchain

git clone --recurse-submodules https://github.com/riscv/riscv-gnu-toolchain.git

# Cloning and configuring the riscv-gnu-toolchain

Skip this if you are using the provided Virtual Machine! (prefix has already been configured to /home/user/summer\_school/riscv-custom/newlib)

Listing - the toolchain is assumed to be built in /opt/riscvcustom :

- cd riscv-gnu-toolchain
- 2 ./configure --prefix=/home/user/summer\_school/riscv-custom/newlib
- 3 make -j\$(nproc)

4

Listing - Check the cross-compiler version

/home/user/summer\_school/riscv-custom/newlib/bin/riscv64-unknown-elf-gcc --version

Listing - The riscv-opcodes directory should contain all opcodes

git clone https://github.com/riscv/riscv-opcodes

| Introduction<br>0000 | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|-------------------------|----------------------------------------------|----------------|-------------|
|                      |                         |                                              |                |             |

# Understanding RISC-V base instruction formats

| 31                       | $25 \ 24$ | $20 \ 1$ | 9 1 | $15 \ 14$ | 12  11 | 7             | 6      | 0      |
|--------------------------|-----------|----------|-----|-----------|--------|---------------|--------|--------|
| funct7                   | r         | 52       | rs1 | funct     | 3      | rd            | opcode | R-type |
|                          |           |          |     |           |        |               |        |        |
| im                       | m[11:0]   |          | rs1 | funct     | 3      | $\mathbf{rd}$ | opcode | I-type |
|                          |           |          |     |           |        |               |        |        |
| $\operatorname{imm}[11:$ | 5] r:     | 52       | rs1 | funct     | 3 in   | nm[4:0]       | opcode | S-type |
|                          |           |          |     |           |        |               |        |        |
|                          | imm       | [31:12]  |     |           |        | $\mathbf{rd}$ | opcode | U-type |
| L                        |           |          |     |           |        |               | -      |        |

Figure 2.2: RISC-V base instruction formats. Each immediate subfield is labeled with the bit position (imm[x]) in the immediate value being produced, rather than the bit position within the instruction's immediate field as is usually done.

2

<sup>2.</sup> https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf

| Introduction<br>0000 | General Concepts | BRISKI Barrel Processor | RTL support of Custom Instructions | Testing it all | Another one |
|----------------------|------------------|-------------------------|------------------------------------|----------------|-------------|
|                      |                  |                         |                                    |                |             |

# Understanding RISC-V base instruction formats

| 31 30 25                                                | 24 21 20                        | 19 1 | 15 14 12 | 2 11 8 7                                           | 6 0    |        |
|---------------------------------------------------------|---------------------------------|------|----------|----------------------------------------------------|--------|--------|
| funct7                                                  | rs2                             | rs1  | funct3   | rd                                                 | opcode | R-type |
|                                                         |                                 |      | _        |                                                    |        | -      |
| imm[1                                                   | 1:0]                            | rs1  | funct3   | rd                                                 | opcode | I-type |
|                                                         | -                               |      |          |                                                    |        | 1 ~    |
| $\operatorname{imm}[11:5]$                              | rs2                             | rs1  | funct3   | $\operatorname{imm}[4:0]$                          | opcode | S-type |
|                                                         | 2                               |      |          |                                                    |        | 1      |
| $[\operatorname{imm}[12]]$ $[\operatorname{imm}[10:5]]$ | rs2                             | rs1  | funct3   | $ \operatorname{imm}[4:1]  \operatorname{imm}[11]$ | opcode | B-type |
|                                                         | [01.10]                         |      |          | 1                                                  |        | 1      |
|                                                         | $\operatorname{imm}[31:12]$     |      |          | rd                                                 | opcode | U-type |
|                                                         |                                 |      | [10.10]  | 1                                                  | 1      | 1 .    |
| $\operatorname{imm}[20]$ $\operatorname{imm}[1$         | $0:1]$ $\operatorname{imm}[11]$ | ımm  | [19:12]  | rd                                                 | opcode | J-type |

Figure 2.3: RISC-V base instruction formats showing immediate variants.

3

<sup>3.</sup> https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf

| Introduction<br>0000 | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | RTL support of Custom Instructions | Testing it all | Another one |
|----------------------|-------------------------|----------------------------------------------|------------------------------------|----------------|-------------|
|                      |                         |                                              |                                    |                |             |

### Understanding RISC-V custom instruction encoding

| inst[4:2] | 000    | 001      | 010                   | 011      | 100    | 101      | 110               | 111        |
|-----------|--------|----------|-----------------------|----------|--------|----------|-------------------|------------|
| inst[6:5] |        |          |                       |          |        |          |                   | (> 32b)    |
| 00        | LOAD   | LOAD-FP  | <mark>custom−0</mark> | MISC-MEM | OP-IMM | AUIPC    | OP-IMM-32         | 48b        |
| 01        | STORE  | STORE-FP | custom-1              | AMO      | OP     | LUI      | OP-32             | 64b        |
| 10        | MADD   | MSUB     | NMSUB                 | NMADD    | OP-FP  | reserved | custom-2/ $rv128$ | 48b        |
| 11        | BRANCH | JALR     | reserved              | JAL      | SYSTEM | reserved | custom- $3/rv128$ | $\geq 80b$ |

Table 24.1: RISC-V base opcode map, inst[1:0]=11

4

<sup>4.</sup> https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf

| Introduction<br>0000 | General Concepts | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | RTL support of Custom Instructions | Testing it all | Another one |  |
|----------------------|------------------|-------------------------|----------------------------------------------|------------------------------------|----------------|-------------|--|
|                      |                  |                         |                                              |                                    |                |             |  |

Listing - Some of the opcodes in /home/user/summer\_school/riscv-custom/riscv\_opcodes/rv\_i :

| 1  | # rv_i                                             |
|----|----------------------------------------------------|
| 2  | lui rd imm20 62=0x0D 10=3                          |
| 3  | auipc rd imm20 62=0x05 10=3                        |
| 4  | jal rd jimm20 62=0x1b 10=3                         |
| 5  | jalr rd rs1 imm12 1412=0 62=0x19 10=3              |
| 6  | beq bimm12hi rs1 rs2 bimm12lo 1412=0 62=0x18 10=3  |
| 7  | bne bimm12hi rs1 rs2 bimm12lo 1412=1 62=0x18 10=3  |
| 8  | blt bimm12hi rs1 rs2 bimm12lo 1412=4 62=0x18 10=3  |
| 9  | bge bimm12hi rs1 rs2 bimm12lo 1412=5 62=0x18 10=3  |
| 10 | bltu bimm12hi rs1 rs2 bimm12lo 1412=6 62=0x18 10=3 |
| 11 | bgeu bimm12hi rs1 rs2 bimm12lo 1412=7 62=0x18 10=3 |
| 12 |                                                    |
| 13 | add rd rs1 rs2 3125=0 1412=0 62=0x0C 10=3          |
| 14 | sub rd rs1 rs2 3125=32 1412=0 62=0x0C 10=3         |
| 15 | sll rd rs1 rs2 3125=0 1412=1 62=0x0C 10=3          |
| 16 | slt rd rs1 rs2 3125=0 1412=2 62=0x0C 10=3          |
| 17 | sltu rd rs1 rs2 3125=0 1412=3 62=0x0C 10=3         |
|    |                                                    |

We will follow the example of add opcode with 3 operands (rd, rs1 and rs2) :

Listing - Adding a custom instruction in /home/user/summer\_school/riscv-custom/riscv\_opcodes/rv\_i :

1 #custom 0

2 lotoupcase rd rs1 rs2 31..25=1 14..12=0 6..2=2 1..0=3

We have to generate MASK and MATCH for the custom instruction

Listing – the opcodes in riscv\_opcodes/rv\_i :

make

This will generate /home/user/summer\_school/riscv-custom/riscv\_opcodes/encoding.out.h Check that file for :

Listing - from /home/user/summer\_school/riscv-custom/riscv\_opcodes/encoding.out.h :

1 #define MATCH\_LOTOUPCASE 0x200000b

2 #define MASK\_LOTOUPCASE 0xfe00707f

| Introduction | General Concepts | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|--------------|------------------|-------------------------|----------------------------------------------|----------------|-------------|
|              |                  |                         |                                              |                |             |

#### MASK

- Bits set to 1 in the MASK indicate positions that are significant and should be matched exactly, while bits set to 0 indicate positions that can vary.
- For example, consider an instruction with a 32-bit encoding. A MASK might look like 0xFFFFF000, which means that the first 20 bits (from the left) of the instruction are significant for the purpose of matching. The last 12 bits can vary without affecting the recognition of the instruction.

### MATCH

- When decoding an instruction, the relevant bits (as indicated by the MASK) are extracted from the instruction, and if they match the bits specified by the MATCH value, the instruction is recognized as a specific operation.
- For example, if an instruction's encoding is to be matched against a specific operation, the combination of the MASK and MATCH will be used to identify whether the instruction corresponds to that operation.

#### How the cross compiler recognizes that an instruction is matched?

When an instruction is encountered, its relevant bits (as filtered by the MASK) are compared with the MATCH value. If (instruction & MASK) == MATCH, then the instruction is recognized as the specific custom instruction.

Let's Modify the binutils files : /home/user/summer\_school/riscv-custom/riscv-gnu-toolchain/binutils/include/opcode/riscv-opc.h should be updated to add : (The + sign is indicating added lines and should not be added in your file)

Listing - adding the instruction to the riscv-opc.h

1 /\* Instruction opcode macros. \*/
2 + #define MATCH\_LOTOUPCASE 0x200000b
3 + #define MASK\_LOTOUPCASE 0x200000b
3 + #define MATCH\_SLUI\_RV32 0x1013
5 // [...]
6 #endif /\* RISCV\_ENCODING\_H \*/
7 #ifdef DECLARE\_INSN
8 + DECLARE\_INSN

| Introduction<br>0000 | General Concepts |  | RTL support of Custom Instructions | Testing it all | Another one |
|----------------------|------------------|--|------------------------------------|----------------|-------------|
|                      |                  |  |                                    |                |             |

The related C source file (/home/user/summer\_school/riscv-custom/riscv-gnu-toolchain/binutils/opcodes/riscv-opc.c) needs to be updated too : (The + sign is indicating added lines and should not be added in your file)

Listing - adding the instruction to the riscv-opc.c (under riscv-opcodes struct)

1 /\* name, xlen, isa, operands, match, mask, match\_func, pinfo. \*/
2 + {"lotoupcase", 0, INSN\_CLASS\_I, "d,s,t", MATCH\_LOTOUPCASE, MASK\_LOTOUPCASE, match\_opcode, 0 },

| Introduction | BRISKI Barrel Processor | RTL support of Custom Instructions | Testing it all | Another one |
|--------------|-------------------------|------------------------------------|----------------|-------------|
|              |                         |                                    |                |             |

# Implementing the custom instruction in the cross-compiler

Listing - rerun make

2 cd /home/user/summer\_school/riscv-custom/riscv-gnu-toolchain make clean

3

1

make -j\$(nproc) Λ

If you assigned 'nproc=4' processors to your VM, you can set : make -j 4

This will take a while, Grab a coffe!

| Introduction<br>0000 | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|-------------------------|----------------------------------------------|----------------|-------------|
|                      |                         |                                              |                |             |

# Checking the custom instruction using the updated cross-compiler

Listing - Sample test

```
//Use this sample code to test your custom instruction:
1
     #include <stdio.h>
 2
     int main(){
3
         int a,b,c;
 4
 5
         a = 'a':
 6
         h = 0
 7
        asm volatile
 8
9
         "lotoupcase %[z], %[x], %[y] \n\t"
         : [z] "=r" (c)
10
         : [x] "r" (a), [v] "r" (b)
11
12
        ):
13
        return 0:
14
     3
```

Listing - compile using the newly added custom instruction

```
1 /home/user/summer_school/riscv-custom/newlib/bin/riscv64-unknown-elf-gcc prog.c -o prog
2 file prog
```

#### Congratulations !!! you just compiled a program using your first custom instruction !

| Introduction<br>0000 | General Concepts | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | RTL support of Custom Instructions | Testing it all | Another one |
|----------------------|------------------|-------------------------|----------------------------------------------|------------------------------------|----------------|-------------|
|                      |                  |                         |                                              |                                    |                |             |

### Hands-on Lab

Add another custom instruction and recompile the toolchain

Replay the previous steps to implement a custom instruction that performs the opposite computation : Converting ASCII characters from Upper to lower case.

Recompiling the toolchain will take some 30 mins depending on your machines.

Lets launch the recompilation before the coffee break !

#### Files that you will use/modify

- ../riscv-custom/riscv\_opcodes/rv\_i
- ../riscv-custom/riscv-opcodes/encoding.out.h
- ../riscv-custom/riscv-gnu-toolchain/binutils/include/opcode/riscv-opc.h
- ../riscv-custom/riscv-gnu-toolchain/binutils/opcodes/riscv-opc.c

|  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|--|----------------------------------------------|----------------|-------------|
|  |                                              |                |             |

### Outline

### Introduction

### General Concepts

#### BRISKI Barrel Processor

#### Adding Custom Instructions to the tool chain

### 6 RTL support of Custom Instructions

### Testing it all

#### Another one

# Remember BRISKI? (Barrel RISC-V for Kilo-core Implementations)



Figure - BRISKI Barrel Processor Architecture.

#### RTL modules to be updated

- (control\_unit.sv) and (alu\_control.sv) and (alu.sv)
- and do not forget riscv-pkg.sv where all parameters reside.

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain |  | Another one |
|----------------------|--|----------------------------------------------|--|-------------|
|                      |  |                                              |  |             |

# Modifying riscv-pkg.sv

Listing - riscv\_pkg.sv

| 1  | //                                                                               |  |  |  |  |  |  |  |  |  |
|----|----------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|--|
| 2  | // ALU specific params                                                           |  |  |  |  |  |  |  |  |  |
| 3  | //                                                                               |  |  |  |  |  |  |  |  |  |
| 4  | <pre>parameter int ALUOP_WIDTH = 4;</pre>                                        |  |  |  |  |  |  |  |  |  |
| 5  | <pre>parameter logic [ALUOP_WIDTH-1:0] ADD_OP = 4'b0000;</pre>                   |  |  |  |  |  |  |  |  |  |
| 6  | <pre>parameter logic [ALUOP_WIDTH-1:0] SUB_OP = 4'b0001;</pre>                   |  |  |  |  |  |  |  |  |  |
| 7  | <pre>parameter logic [ALUOP_WIDTH-1:0] OR_OP = 4'b1000;</pre>                    |  |  |  |  |  |  |  |  |  |
| 8  | <pre>parameter logic [ALUOP_WIDTH-1:0] AND_OP = 4'b1001;</pre>                   |  |  |  |  |  |  |  |  |  |
| 9  | <pre>parameter logic [ALUOP_WIDTH-1:0] XOR_OP = 4'b0101;</pre>                   |  |  |  |  |  |  |  |  |  |
| 10 | <pre>parameter logic [ALUOP_WIDTH-1:0] PASS_OP = 4'b1010;</pre>                  |  |  |  |  |  |  |  |  |  |
| 11 | <pre>parameter logic [ALUOP_WIDTH-1:0] SLT_OP = 4'b0011;</pre>                   |  |  |  |  |  |  |  |  |  |
| 12 | <pre>parameter logic [ALUOP_WIDTH-1:0] SLTU_DP = 4'b0100;</pre>                  |  |  |  |  |  |  |  |  |  |
| 13 | <pre>parameter logic [ALUOP_WIDTH-1:0] SLL_OP = 4'b0010;</pre>                   |  |  |  |  |  |  |  |  |  |
| 14 | <pre>parameter logic [ALUOP_WIDTH-1:0] SRL_OP = 4'b0110;</pre>                   |  |  |  |  |  |  |  |  |  |
| 15 | <pre>parameter logic [ALUOP_WIDTH-1:0] SRA_OP = 4'b0111;</pre>                   |  |  |  |  |  |  |  |  |  |
| 16 | parameter logic [ALUOP_WIDTH-1:0] LOTOUPC_OP = 4'b1011; //Lower to Upper case OP |  |  |  |  |  |  |  |  |  |
|    |                                                                                  |  |  |  |  |  |  |  |  |  |

| General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|------------------|----------------------------------------------|----------------|-------------|
|                  |                                              |                |             |

# Modifying control\_unit.sv

Listing - control\_unit.sv

```
1
2 // custom-0-type instructions
3 7'b0001011: begin
4 o_WBSel = 2'b01; //we need to select the output from the ALU.
5 o_rregWE = 1'b1; // We need to enable writes to register file
6 o_ALUctrl = 3'b100; //we need to select the desired custom instruction on the alu_control decoder.
7 end
```

| Introduction<br>0000 |  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|--|----------------------------------------------|----------------|-------------|
|                      |  |                                              |                |             |

# Modifying alu\_control.sv

1

4

7

11



```
3'b100: //custom-0
2
          case (i_funct3)
 3
            3'b000:
 5
            case (i funct7)
                    7'b0000001: o_ALUOp <= LOTOUPC_OP; // lotoupcase
 6
                    default: o_ALUOp <= '0;</pre>
 8
            endcase
            default: o_ALUOp <= '0; // Undefined operation</pre>
9
10
          endcase
          default: o_ALUOp <= '0: // Undefined operation
```

|  | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|--|----------------------------------------------|----------------|-------------|
|  |                                              |                |             |

### Modifying alu.sv

Listing – alu.sv

```
logic [DWIDTH-1:0] o_result_lotoupc;
 1
 2
3
      always comb begin
 4
        shamt = i op2[4:0]:
 5
       end
6
7
      always_comb begin
8
        o_result_add = (i_aluop == ADD_OP) ? i_op1 + i_op2 : 0;
9
        o result sub = (i aluop == SUB OP) ? i op1 - i op2 : 0:
        o_result_sll = (i_aluop == SLL_OP) ? i_op1 << shamt : 0;
10
11
        o_result_xor = (i_aluop == XOR_OP) ? i_op1 ^ i_op2 : 0;
        o result or = (i aluop == OR OP) ? i op1 | i op2 : 0:
12
        o result and = (i_aluop == AND_OP) ? i_op1 & i_op2 : 0;
13
        o_result_pass = (i_aluop == PASS_OP) ? i_op2 : 0;
14
        o result srl sra = (i aluop == SRL OP || i aluop == SRA OP) ? temp : 0:
15
        o result lotoupc = (i aluop == LOTOUPC OP)? (((i op1 < 97) || (i op1 > 122))? i op1 : i op1-32) : 0:
16
17
      end
18
19
      always_ff @(posedge clk) begin
        o result <= o result add ^ o result sub ^ o result sll ^ o result xor ^ o result srl sra ^ o result or ^
20
               o result and ^ o result pass ^ o result lotoupc:
21
       and
```

|  | Adding Custom Instructions to the tool chain | Testing it all<br>●00 | Another one |
|--|----------------------------------------------|-----------------------|-------------|
|  |                                              |                       |             |

### Outline

### Introduction

### General Concepts

#### BRISKI Barrel Processor

#### Adding Custom Instructions to the tool chain

#### 6 RTL support of Custom Instructions

### Testing it all

#### Another one

| Introduction<br>0000 | General Concepts | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|----------------------|------------------|-------------------------|----------------------------------------------|----------------|-------------|
|                      |                  |                         |                                              |                |             |

### Adding the custom instruction to the software simulator for testing

Listing - Where to add a custom instruction in the BRISKI software simulator.

```
switch (opcode) {
1
                // -- This a custom instruction that reads the contents of register[rs1], check if it is a lower case
2
                      letter then
                // subtracts 32 to make it upper case
3
            case OxOB: // custom-0 type (opcode = 0b0001011)
4
                    switch (funct3) {
5
                           case 0x0: //(funct3 = 0b000)
6
7
                                   if (funct7=0x01) \int \frac{1}{(funct7=0b0000000)} lower to upper case byte
                                       //if ((registers[hart_id][rs1] > 'z') && (registers[hart_id][rs1] <'a')) {// not a</pre>
8
                                             lower case
                                       if ((registers[hart id][rs1] > 122) || (registers[hart id][rs1] < 97)) {// not a
9
                                             lower case
10
                                       } else {
                                           registers[hart_id][rd] = registers[hart_id][rs1] - 32;
11
12
                                       3
13
14
                                   break:
                           default : : break:
15
                    3
16
17
                pc[hart_id]+=4:
18
                break:
```

| Introduction<br>0000 | General Concepts | BRISKI Barrel Processor | Adding Custom Instructions to the tool chain | Testing it all<br>00● | Another one |
|----------------------|------------------|-------------------------|----------------------------------------------|-----------------------|-------------|
|                      |                  |                         |                                              |                       |             |

### Adding the custom instruction to the software simulator for testing

Listing - running the automated check

| 1 |                                            |
|---|--------------------------------------------|
| 2 | \$ pwd                                     |
| 3 | /[BRISKI]                                  |
| 4 | <pre>\$ cd hardware/simul/verilator/</pre> |
| 5 | <pre>\$ make check_all</pre>               |
|   |                                            |

#### How it is checked?

- If everything is correctly compiled, the last command should generate a memory dump of the rtl design by using verilator (rtl\_memory.txt) and another memory dump of the software simulator memory (./simulation\_model/memory.txt) after g++ compilation.
- A simple diff command is called to compare both memory dumps.
- If everything is matching you will get an OK, otherwise, it will display a failing message.
- If it fails, try vimdiff to check which memory addresses differs. This can give you hints to debug. GOOD LUCK !
- If it succeeds, you succeeded in adding your first custom instruction. CONGRATULATIONS !

| General Concepts | Adding Custom Instructions to the tool chain | Testing it all | Another one |
|------------------|----------------------------------------------|----------------|-------------|
|                  |                                              |                |             |

### Outline

### Introduction

### General Concepts

### BRISKI Barrel Processor

### Adding Custom Instructions to the tool chain

#### 6 RTL support of Custom Instructions

### Testing it all

### Another one

|  | Adding Custom Instructions to the tool chain | Testing it all | Another one<br>O●O |
|--|----------------------------------------------|----------------|--------------------|
|  |                                              |                |                    |

# Advanced follow-up Lab

#### A more challenging example

Take this challenge if :

- Adding a custom instruction was a piece of cake for you.
- You crave challenges and enjoy struggles in your life.

Your next task would be to add a more efficient custom instruction. This instruction allows you to :

- Select either upper-to-lower or lower-to-upper-case. You can use the second register rs2, to specify the desired behavior.
- Convert up to four bytes, in one go. You can use the second register rs2, to specify how many bytes to convert from your provided word aligned address in rs1.

Happy Hacking!

|  | Adding Custom Instructions to the tool chain |  | Another one<br>OO● |
|--|----------------------------------------------|--|--------------------|
|  |                                              |  |                    |

# Thank you for your attention!