

#### **Practical System-on-Chip**

#### OSHUG #17: 29 March 2012 Julius Baxter, Jeremy Bennett, opencores.org



#### Overview

- Introduction
  - Systems On A Chip, IP cores, HDL, Implementation technologies (FPGA)
- OpenCores & OpenRISC Project
- Using ORPSoC
- Compiling software for OpenRISC bare metal



#### System On A Chip



## System on a Chip (SoC)

- Integrate many functions onto a single silicon die
- Result of increased IC consolidation
- Enabled by improvements in VLSI process
- Modern high-end SoCs integrate:
  - µ/DSP/graphics processors
  - Memory controllers
    - DRAM, flash
  - Communications
    - USB, ethernet, i2c
  - Bespoke processing, I/O



Image source: http://spectrum.ieee.org/semiconductors/design/crossroads-for-mixedsignal-chips



### Chip Implementation Process

- ASIC: Application Specific Integrated
   Circuit
- Modern process for purely digital ASICs (no major analog circuitry on chip) relatively straight forward
- Most chip design houses are *fabless* – they do not own and operate own manufacturing facility





Copyright © 2012 Julius Baxter and Jeremy Bennett. Freely available under a Creative Common Sources http://www.geek.com/images/procspecs/p4/p4-13wafer.jpg



Freely available under a Creative Commons license



#### FPGA Development Process, In Comparison





#### IP Cores in SoCs



#### System Design with IP cores

- Decide on function and rough breakdown of how to achieve that
- Select units of semiconductor intellectual property (*IP cores*) for the job at hand



A simplified block diagram of Fujitsu's Mobile WiMAX baseband SoC.

IP cores are units intended to provide a specific functionality and:

- usually provide a control mechanism via a standardised protocol (typically over a memory-mapped bus)
- are delivered in an electronic format (normally files with hardware description language, or HDL, code) which can be built and tested with the rest of the system
- are ultimately synthesised (combined) into the overall chip design so become part of the whole chip



#### **IP** Cores

- IP core design houses aim to design IP for maximum reuse (an industry mantra) this means:
  - Very configurable (heavy use of parameterised options)
  - Try to be applicationgeneric

- Along with reuse and configurability are requirements for:
  - a demonstrably verified and 'proven' design (bugs causing a re-fab can cost \$10k->\$1M+)
  - Maximum area and power efficiency



### Modern Implementations

 Typically µ-processor based, with on-chip communication done via internal bus



- Many on-chip bus standards:
  - Wishbone
  - ARM AMBA
  - OCP
  - IBM CoreConnect
- A growing number of network-on-chip (NoC) interconnects, too





SoC Interconnect

- The SoC interconnect is what ties the IP blocks together
- Usually a memorymapped bus providing access to control registers and memories





#### SoC Development Summary

- A large part of SoC design work is combination and verification (testing) of a set of IP
  - Confirm it's implementable
  - Confirm it works in your particular configuration
  - Confirm it has sufficient capacity for the intended application



Source: http://www.eurekatech.com/



#### **Describing The Hardware**



## Describing the design

- IP blocks for implementation in modern digital VLSI process are usually designed using a *hardware description language*
  - VHDL
  - Verilog
  - SystemVerilog increasingly



From Computer Desktop Encyclopedia @ 2004 The Computer Language Co. Inc.



### Hardware Description Languages

- Crucially allow the description of
  - Synchronous elements
  - Combinatorial logic



- Synchronous elements such as registers (flip-flops) store the state of the incoming signal based on the rising/falling of the clock
- Combinatorial elements comprise the logical functions between synchronous elements



## **OpenCores** A synchronous element in Verilog

- Always sampling on each rising edge of the clock, output to 'D' after a short time afterward, valid until next rising clock edge
- The simplest memory element, also considered a 1-cycle delay

```
wire logic_a_output; /* assigned elsewhere */
wire logic b input;
reg q;
```

```
always @(posedge clock)
  q <= logic_a_output; /* the 'D' input */
```

```
assign logic b input = q;
```





#### Combinatorial logic in Verilog

 Essentially any logic, arithmetic, conditional operator

assign d = !(a | b); assign e = b & c; assign q = d | e; Assign f = e ? a : q;





## Verilog HDL

- Provides ways of describing the propagation of time and signal dependencies (sensitivity)
- Block-based structure





### Verilog Module Instantiations

- As design for re-use is emphasised, modular design is important.
- Abstraction and organisation is achieved by organising smaller, repeatable parts of a design into modules, much like functions in any other language

```
Edit Options Buffers Tools Statements Verilog Help
                                         i q 🖴 💥 🙆
                                     E
  module oshug instantiatons (
                input clock.
                input reset,
                input enable.
                input [15:0] operand a,
                input [15:0] operand b,
                output [15:0] combined output,
                output output valid
                ):
     wire [15:0]
                       result0, result1;
     wire
                       valid0, valid1;
     oshug oshug0 (.clock(clock),
                   .reset(reset),
                   .enable(enable),
                   .operand a(operand a),
                   .operand b(16'h5555),
                   .calculated output(result0),
                   .output valid(valid0));
     oshug oshug1
                  (.clock(clock),
                   .reset(reset),
                   .enable(enable),
                   .operand a(operand b),
                   .operand b(16'hcccc),
                   .calculated output(result1),
                   .output valid(valid1));
     assign combined output = {result1[15:8], result0[7:0]};
     assign output valid = valid0 & valid1;
  endmodule // oshug instantiatons
-U:--- mymodule.v
                                 (Verilog)-----
                      Top L1
🖶 Wrote /tmp/mymodule.v
```



#### Levels of Abstraction

- Verilog could be used to describe a design at a low or high level of abstraction
- An example of quite low level Verilog is a gatelevel netlist which describes each and every atomic cell and their interconnections in a design
- Higher level design is achieved through the use of Verilog's arithmetic operators which can infer rather complex logic (multipliers, adders) or case statements on wide busses which can also infer large amounts of logic (multiplexors)



### Register Transfer Level

- Most common level at which design is described is the *register transfer level* (RTL)
- Synthesisable Verilog is commonly referred to as RTL
- RTL? A description of the values signals should take on when the clock, or other signals, change their values
- Synchronous, or clock-based behaviour, results in flipflops/registers being used to implement the design
- Combinatorial logic, or descriptions of logical functions, are implemented in fundamental logic components in hardware (not, and, or, xor etc.)





#### Implementation



# Getting the design onto the silicon

 Code in a synthesisable subset of the language supported by the synthesis tool in



- You are ultimately intending to implement your logic with a particular *technology*
- Technology in this case is used to refer to the ASIC process or FPGA which will have the design 'put on it' – each provide a library of cells which can be used to implement the design



FPGA Technology

- Field Programmable Gate
   Array
- An array of multi-purpose logic which can be configured to create (within reason) arbitrary logical functions and interconnections between them



 American FPGA vendor Xilinx uses thousands of lookup tables (LUTs) which have signals routed through to emulate the logic functions the designer describes in the HDL



FPGAs

- FPGAs contain arrays of the bread-andbutter of digital logic (combinatorial and sequential logic elements in the LUTs) as well as:
  - Routing interconnect
  - RAMs
  - I/O hardware
  - Clock generation

- They are reconfigurable so can be applied for a wide variety of designs
- Have area, power, operating frequency disadvantages when compared to ASIC
- Often used for ASIC
   prototyping



**FPGAs** 

- Are now big enough to implement systems-on-chip (consiting of processor, memory, I/O, accelerators) capable of running Linux distros
- How do we get from Verilog to an FPGA configuration file?





HDL Synthesis

- Each ASIC process, **FPGA** generation and family provide implementation components in the way of gates and macro cells (RAMs, adders, multipliers)
- The synthesis tool must be aware of this and optimise and map the design described in the HDL to the targeted technology
- The synthesised design is described in the synthesisable subset of the HDL



# SoC design with HDL source

- The technology an IP block will be used on is determined by SoC designers
- IP designers don't know this and must design accordingly

- IP blocks have a lot of 'tunability' for synthesis (removing unwanted or unnecessary or unimplementable features.)
  - Verilog has a preprocessor, so features can be selected with a set of defines, equivalent to C #defines
  - Verilog parameters are similar and are a 'compiletime' (synthesis-time) option



#### **Open Source Digital Design**



## **OpenCores and OpenRISC**



#### Overview of OpenCores

 147,001 registered users reported as of 28 March 2012

919 projects as of 28 March 2012





### The OpenRISC 1000 Project

- Objective to develop a *family* of open source RISC designs
  - 32 and 64-bit architectures
  - floating point support
  - vector operation support
- Key features
  - fully free and open source
  - linear address space
  - register-to-register ALU operations
  - two addressing modes
  - delayed branches
  - Harvard or Stanford memory MMU/cache architecture
  - fast context switch
- Looks rather like MIPS or DLX



## The OpenRISC 1200



- 32-bit Harvard RISC architecture
  - MIPS/DLX like instruction set
  - first in OpenRISC 1000 family
  - originally developed 1999-2001
- Open source under the
  - GNU Lesser General Public License
  - allows reuse as a component
- Configurable design
  - caches and MMUs optional
  - core instruction set
- Source code Verilog 2001
  - approx 32k lines of code
- Full GNU tool chain and Linux port
  - various RTOS ported as well





- Combined reference implementation and board adaptations
- Reference implementation minimal SoC for processor testing, development
  - compilable into cycle-accurate model
- Boards ports target multiple technologies
- Lowers barrier to entry for OpenRISC-based SoC design
  - Push-button compile flow
  - Largely utilises open-source EDA tools



### Hardware Development

- Objective is to use an open source EDA tool chain
  - back end tools for FGPA all proprietary
    - free (as in beer) versions available
  - front end tools now have open source alternatives
- OpenRISC 1000 simulation models
  - Or1ksim: golden reference ISS
    - C/SystemC interpreting ISS, 2-5 MIPS
  - Verilator cycle accurate model from the Verilog RTL
    - 130kHz in C++ or SystemC
  - Icarus Verilog event driven simulation
    - 1.4kHz, 50x slower than commercial alternatives
- All OpenRISC 1000 simulation models suitable for SW use
  - all support GDB debug interface



#### The OpenRISC 1000 Tool Chain



# The Software Tool Chain

- A standard GNU tool chain
  - binutils 2.20.1
  - gcc 4.5.1
  - gdb 7.3 (for BCS use only!)
  - C and C++ language support
- Library support
  - static libraries only
  - newlib 1.18.0 for bare metal (or32-elf-\*)
  - uClibc 0.9.32 for Linux applications (or32-linux-\*)
- Testing
  - regression tested using Or1ksim (both tool chains)
  - or32-linux-\* regression tested on hardware
  - or32-elf-\* regression tested on a Verilator model



### Board and OS Support

- Boards with BSP implementations
  - Or1ksim
  - Xilinx ML501, Terasic Altera DE-2, DE0-nano, ...
- RTOS support
  - FreeRTOS, RTEMS and eCos all ported
- Linux support
  - adopted into Linux 3.1 kernel mainline
  - some limitations (kernel debug, ptrace)
  - BusyBox as application environment
- Debug interfaces
  - JTAG for bare metal
  - *gdbserver* over Ethernet for Linux applications



#### Software Development Remote Connection to GDB





#### (gdb) target remote :51000



# Building the Tool Chain

• Download the source:

```
svn co http://opencores.org/ocsvn/openrisc/openrisc/trunk/orlksim
svn co http://opencores.org/ocsvn/openrisc/openrisc/trunk/gnu-src
cd gnu-src; git clone git://git.openrisc.net/jonas/uClibc; cd ..
cd gnu-src; git clone git://git.openrisc.net/jonas/linux; cd ..
```

• Build and install Or1ksim

```
cd orlksim; mkdir bd; cd bd
../configure --target=or32-elf32 --prefix=/opt/orlksim-new
make; make install; make pdf; cd ../..
```

• Build and install the tool chains into /opt/or32-new

- You can then use the tools to build BusyBox and Linux
  - see http://opencores.org/or1k/OR1K:Community\_Portal



- Stefan Wallentowitz at TUM
  - multicore version of OpenRISC 1200
  - student working on LLVM
- Pete Gavin
  - bringing the GNU tool chain up to date
- Ruben Diez
  - automated nightly builds
  - common test platform for models and HW



#### ORPSoC OpenRISC Reference Platform System-on-Chip



## Two Sides Of ORPSoC

- "Reference build"
  - Processor testing platform
  - Not technology targeted
  - Can build fast cycle-accurate model

- "Board builds"
  - Targeted at particular FPGA boards
  - Live in own subproject
  - Intended to provide "push-button" synthesis flows



### Intended Audience And Uses Of ORPSoC

- Provides framework for users to experience digital design and, hopefully, get commodity FPGA development boards running an open source system
  - FPGA majors provide non-free (libre and beer) SoC implementations :(
- Potential uses are numerous but use cases for bespoke processing or I/O are common
  - Develop an IP for SHA256/DSP/motor control and instantiate multiple on FPGA along with OR1200 running Linux, connected via ethernet



## **ORPSoC** Reference Build

- Simplest useful system for processor verification
  - "On-chip" memory
  - Debug interface
    - Can master bus
  - UART
  - Interrupt generator



46



#### Using ORPSoC



# Running the reference design

 Can execute a basic test, running the CPU through bootup, into a main() loop and immediately exiting

orpsocv2\$ cd sim/run
run\$ make rtl-test TEST=or1200-simple VCD=1

 The VCD=1 will create a dump of the internal signals which can be viewed in a waveform viewer such as GTKWave

# **OpenCores** GTKWave viewing or1200simple.vcd

| 😢 🗇 🗊 GTKWave - out/or1200-simple.vcd                                             |                        |                                                                                                                                              |
|-----------------------------------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| File Edit Search Time Markers View Help                                           |                        |                                                                                                                                              |
| 🔏 💼 💼 🔍 🔍 🔍 👆 խ 🐳 🦆 🔶 From: O sec 🛛 To: 222630 ns 🛛 🥑 Marker:   Cursor: 221277 ns |                        |                                                                                                                                              |
| ▼ SST                                                                             | Signals                | Waves                                                                                                                                        |
| - iwb_biu                                                                         | Time                   | 221 us 221100 ns 221200 ns 221300 ns 221400 ns 221500 ns 221600 ns 221700 n                                                                  |
| 中 🚠 or1200_cpu                                                                    | clk pad i              |                                                                                                                                              |
| - 🚠 or 1200_alu<br>- 🚠 or 1200_cfgr                                               | wbm_i_or12_ack_i       |                                                                                                                                              |
| - 🚠 or 1200_ctrl                                                                  | wbm_i_or12_adr_o[31:0] | <u>)</u> (000+)(+)(+)(+)(00001040) <u>(000+)(+)(+)(+)(00001050)</u> (+)(+)(+)(+)(+)(00001000) <u>(+)(+)(+)(+)(00+)(+)(+)(+)(+)(00001000)</u> |
| or1200_except                                                                     | wbm_i_or12_dat_i[31:0] | 00000000 <u>+ (00000000 )+ (+ (+ )+ (15000000 )+ (+ )+ (+ )</u> 18608000                                                                     |
| Type Signals                                                                      | wbm_i_or12_stb_o       |                                                                                                                                              |
| wire a_eq_b                                                                       | id pc[31:0]            | 0+ \00+ \00001038 \+ \000+\+ \+ \0000104C \00001050 \00001000 \+ \0000100                                                                    |
| wire a_t_b                                                                        | ex pc[31:0]            | 0+ )+ )00001034 )+ )000+)+ )+ )00001048 )0000104C )00001050 )00+ )0000100                                                                    |
| wire b[31:0]                                                                      | wb_pc[31:0]            | = 0+ \+ \+ \00001034 \000+\+ \+ \00001044 \00001048 \00001046 \00001046                                                                      |
| wire b_mux[31:0]                                                                  |                        |                                                                                                                                              |
| wire comp_b[31:0]                                                                 | a[31:0]                | 000066+ <u>000066D0</u> <u>00000000000000000000000000000</u>                                                                                 |
| wire cy_sub                                                                       | b[31:0]                | 0+)+)+)(+)(00000000))(+)(000+)(+)(00000000                                                                                                   |
|                                                                                   | ex insn[31:0]          | 1+ \+ \+ \D4050000 \+ \13F+\+ \+ \E0800004 \07FFFED \15000000 \+ \+ \+ \D7E14FF                                                              |
|                                                                                   | alu_dataout[31:0]      | 0+ /+ /+ /000066D0 /000+/+ /0000000 /0000FFED /0000000 /+ /0000960                                                                           |
|                                                                                   | flag_we_alu            |                                                                                                                                              |
| Filter: b                                                                         |                        |                                                                                                                                              |
|                                                                                   | wb_freeze              |                                                                                                                                              |
| Append Insert Replace                                                             |                        |                                                                                                                                              |
|                                                                                   |                        |                                                                                                                                              |



**ORPSoC** software

- Running a test will cause software to be built. This includes:
  - A simple C library containing basically rand() and printf()
  - Low-level CPU functions for features like interrupts and timers
  - The boot code (OR1K assembly, crt0.S)
  - Some application code
- It is all compiled and converted into appropriate format for loading into sim.



# ORPSoC simulation directories

sim\$ tree -L 2

- All sims launched from sim/run but output generated in sim/out
- Some intermediate files generated in sim/run
- VCD in sim/out
- Memory image of program we executed was in sim/run/sram.vmem (symlink to it)





#### **ORPSoC Board Ports**

- A sub-project of ORPSoC intended to be built and run on a specific FPGA system
- Contained under boards/ directory, then sorted by FPGA vendor







### Adapting ORPSoC

- Inherent modularity of SoC designs makes it relatively straight forward to add or remove features
- Removing features is usually as simple as commenting out a `define

```
`define JTAG_DEBUG
// `define RAM_WB
// `define XILINX_SSRAM
`define CFI_FLASH
`define XILINX_DDR2
`define UART0
`define GPIO0
// `define SPI0
`define I2C0
`define I2C1
`define ETH0
`define ETH0_PHY_RST
```

A defines file in boards/xilinx/ml501/rtl/verilog/include/orpsoc-defines.v



# Configuring IP Cores

- IP blocks usually have own configuration information
- A Verilog `defines header is usually used to store config
  - Parameters on the instantiations are preferred as it allows multiple instances with differing configurations

// Do not implement Data cache //`define OR1200\_NO\_DC

// Do not implement Insn cache //`define OR1200\_NO\_IC

// Do not implement Data MMU //`define OR1200\_NO\_DMMU

// Do not implement Insn MMU //`define OR1200\_NO\_IMMU

// Size/type of insn/data cache if implemented
// (consider available FPGA memory resources)
//`define OR1200\_IC\_1W\_16KB
`define OR1200\_DC\_1W\_16KB
`define OR1200\_DC\_1W\_32KB

// Implement optional I.div/I.divu instructions
// By default divide instructions are not implemented
// to save area.
`define OR1200\_DIV\_IMPLEMENTED



# Adding new cores is a little more involved

- Instantiate the core in top-level
- Attach to bus
- Provide some software (driver/test)
- Attach I/O (if any)
  - Add constraints to back-end scripts





**OpenCores** Adapting ORPSoC for new boards

- Be aware of FPGA technology (family, variant) and whether existing clocking and memory components can be used
- Ensure new pin mapping applied (which signals from ORPSoC go to which pins on the device/board)
- Check design fits on device (quick synthesis check)



# Debug infrastructure on boards

- Two debug options
  - "Mohor" SoC debug IF
  - Advanced debug IF
- Both use JTAG physical layer
  - Mohor adds own JTAG TAP and needs 4 extra pins
  - adv\_debug\_if can use FPGA's TAP and save pins

 Be sure to determine debug solution!



Diagram of Mohor Debug Interface Connecting to ORPSoC



### **ORPSoC** synthesis flow

 Example flow based on Xilinx tools





### Compiling Software For The Bare Metal



### Tool chain

- GNU tool chain for both bare metal and Linux userspace programs
- Bare metal tol chain relies on newlib for its C library

- Newlib's libgloss handles low level interaction (supposed to implement syscall support.)
- OR1K libgloss is designed for bare metal usage



# Adding new port to or32 libgloss

- A single object file must be compiled which contains some symbols defining, eg.
  - Clock frequency of design
  - UART address on bus

cat gnu-src/newlib-1.18.0/libgloss/or32/ml501.S

\* Define symbols to be used during startup \* file is linked at compile time

\*/

.global \_board\_mem\_base .global \_board\_mem\_size .global \_board\_clk\_freq

\_board\_mem\_base: .long0x0 \_board\_mem\_size: .long0x800000

\_board\_clk\_freq: .long66666666

/\* Peripheral information - Set base to 0 if not present\*/ .global \_board\_uart\_base .global \_board\_uart\_baud .global \_board\_uart\_IRQ

\_board\_uart\_base: .long0x90000000 \_board\_uart\_baud: .long115200 \_board\_uart\_IRQ: .long2



### **OpenCores** Compiling with new board library

- Once the file is compiled with the correct values, it should be archived and placed along side the rest of the newlib board support files:
- The
  - -mboard=boardna me switch can be now passed to the compiler and software should initialise correctly for the board

\$TOOLCHDIR/or32-elf/lib/boards/<boardname>/libboard.a



# Run "helloworld" in the simulator

- Create a basic helloworld C file: #include <stdio.h> int main(void) { printf("Hello world!\n"); return 0; }
- Compile it:

or32-elf-gcc hello.c -o hello

Run it in or1ksim

\$ or32-elf-sim -m8M hello

Seeding random generator with value ... Or1ksim 2012-03-23 Building automata... done

Section: .jcr, vaddr: 0x000089bc,... Section: .data, vaddr: 0x000089c0, ... Hello world! exit(0) @reset : cycles 0, insn #0 @exit : cycles 3692, insn #2842 diff : cycles 3692, insn #2842

 Note: GCC defaults to use the"or1ksim" board