CS250 VLSI Systems Design
Fall 2020
The University of California, Berkeley
This design was built with Openlane tag mpw-one-a
, a.k.a. rc5
, and likewise the user_project_wrapper
was hardened with Caravel tag mpw-one-a
(the one-a
branch of this repo). However, the master
branch of this repo is rebased on mpw-one-b
of the Caravel repo for submission to the first shuttle run. If you're going to re-harden the blocks, please do so in the one-a
branch.
FPGA250 is an FPGA designed by UC Berkeley's CS250 class of Fall 2020:
The FPGA comprises three basic tile types:
The current FPGA design includes a 4x3 grid of CLBs. Larger fabrics, which include the MAC and SRAM tiles, should be possible with tooling help to improve layout density.
To reduce strain on the interconnect, as few wires as possible were added to the SLICEM.
{NUM_LUTS*2{addr[3:0]}}
The verilator testbenches for the slicel module can be found in sim/behavioral
and can be run with make slicel_tb.vcd
. The slicel tests themselves can be found in sim/behavioral/slicel_tb.ccp
. There are two slicel tests test_slicel_crand
and test_slicel_directed
.
test_slicel_crand(argc, argv, env, mode, seed, configs, iterations, verbosity)
: A constrained random test that runs configs
independent tests for iterations
clock cycles.test_slicel_directed(argc, argv, env, mode, seed, iterations, verbosity)
: A constrained random test that runs a single test for iterations
clock cycles. (Used for debugging once a failing seed has been found). Optionsargc, argv, env
: The arguments to the main
function should be forwarded to verilator.mode
: The style of test to runRAND
: Fully random configurations and inputs.BASIC_S44
: Luts are configured in s44 mode with random look up tables. Higher order muxes and carry chain is disabled. Inputs are random.BASIC_FRAC
: Luts are configured as two 4-luts with random look up tables. Higher order muxes and carry chain is disabled. Inputs are random.ADDER
: The slicel is configured to add 2 4-bit numbers. Inputs are constrained to add two 4-bit numbers per cycle.seed
: The random seed from which tests are generated. The crand
test prints the seed for each individual test for use in directed
.configs
: crand
only: The number of test to run. Each run has a different configuration.iterations
: How long to run the test for a single configuration. Ie the number of inputs to pass the slicel.verbosity
: The level of detail of the output when running a test. (min = 100
, max = 400
).In directed tests, the dut is manually configured and passed inputs, and the outputs can be read directly. This is useful for verifying cross-communication with other blocks. The basic options when writing a directed test can be found in the basic skeleton/example: sim/behavioral/slicel_directed.cpp
.
This can be run from the sim/behavioral
directory with make slicel_directed.vcd
There are a few ways to configure the slicel module in the testbench framework. The configuration options available to the slicel module are:
lut[0-3]
: A 32-bit truth table for each s44 lut. The highest bit corresponds to the highest bit of the upper 4-lut.soft
: A 4-bit value giving whether a lut is configured as 2 fractured 4-luts or an s44 lut. (0=s44, 1=fractured). The upper bit corresponds to lut3.cc
: A single bit value giving whether the carry chain should be used to drive slicel outputs. (1=carry_chain enabled)inter_lut_muxes
: A two bit value {f8_enable, f7_enable}
that determines if the f7 and f8 muxes should be treated as muxes or simply pass their lower input. (0=pass_through, 1=enabled)register_reset
: An eight bit value giving the reset values of each register. The upper bit corresponds to reg7. A nonzero value is returned if configuration fails. This should only occur with invalid configurations: (eg. cc=1 && inter_lut_muxes!=0
)To use an automatically generated configuration, use test->generate_config(mode)
, where mode is as described in #Testbench.
You can alternatively generate a specific configuration without having to manually construct a bitstream with this option.
test->assemble_config(lut0, lut1, lut2, lut3, soft, cc, inter_lut_muxes, register_reset)
Use this option to configure the slicel with a bitstream. The bitstream is given as an argument as an int cfg[5]
list, where cfg[4] gives the upper bits of the bitstream (and has its upper bits unused).
int cfg[5] = {0,0,0,0,0}; // set cfg here dut->configure(cfg, 5);
The configuration bitstream is defined as follows (big-endian).
{register_reset, cc, f8_enable, f7_enable, lut3_s, lut3, lut2_s, lut2, lut1_s, lut1, lut0_s, lut_0}
There are two ways to generate inputs for the slicel module in the testbench framework. The inputs are
lut_inputs
: A 32 bit value giving the inputs to each s44 lut. {lut3_in, lut2_in, lut1_in, lut0_in}
{upper_lut_in, lower_lut_in}
. Bit 3 is discarded in S44 mode and replaced with the output of the upper_lut
.carry_in
: A 1-bit value giving the carry_in for the carry chain.reg_ce
: A 1-bit value serving as the write_enable for the output registers of the slicel.ho_addr
: A 2-bit value {f8_select, f7_select}
serving as the select signal for the inter_lut_muxes when they are enabled.Randomly generates inputs based on the mode. See (Testbench) for details on each mode.
int lut_inputs; bool carry_in, reg_ce; char ho_addr; test->generate_inputs(RAND, lut_inputs, carry_in, reg_ce, ho_addr); // Options: RAND, ADDER; dut->input_set(lut_inputs, carry_in, reg_ce, ho_addr);
Manually sets inputs.
dut->input_set(lut_inputs, carry_in, reg_ce, ho_addr);
It is recommended to use ticktock
to progress the simulation, but if you want to check intermediate values, or change inputs at finer resolution you can use tick
and tock
. These do not affect the configuration clock cclk
.
dut->ticktock(); // @(posedge clk); @(negedge clk); dut->tick(); // @(posedge clk); dut->tock(); // @(negedge clk);
It is recommended to use dut->info
for all prints. This behaves exactly as printf
but with additional simulation information in the beginning. The slicel has 3 outputs:
Co
: the carry out of the carry chain. Get it with dut->get_carry_out()
out
: the combinational outputs of the slicel as an 8bit word. Get it with dut->get_out()
sync_out
: the registered output of the slicel as an 8bit word. Get it with dut->get_sync_out()
This is a width-configurable MAC “block” that can calculate 4-8bit*8bit, 2-16bit*16-bit, or 1-32bit*32bit MAC or multiply operations in a single cycle. The operation and operation bitwidth can be configurated at runtime.
Below is a basic block diagram of the inputs and outputs of the MAC from an external point of view. More a closer look at the MAC, please refer to the section MAC Cluster Block Diagram.
For IO, the whole MAC has 64 bits of total input (32 bits for each input) and 128 bits of total output. As the smallest input bitwidth is 8, the input wires are divided as such. To account for larger input bitwidths (16, 32), the inputs will span more wires. Similarly, the output is divided into 4 32-bit values as there will be 4 different accumulate outputs in the smallest bitwidth case. For larger bitwidths, the output will span more wires.
Below is a table of the mapping of inputs/outputs to bitwidth configuration. The wires are specified in the above block diagram.
The whole MAC cluster takes in 132 bits of configuration. The first 128 bits are used for initial accumulator values (4x32) and the last 4 bits are the function configuration that sets the signed operation (unsigned or signed), function (MAC or multiply) and the bitwidth (8-single, 16-dual, 32-quad). The bit layout is as follows: {32'acc3_init, 32'acc2_init, 32'acc1_init, 32'acc0_init, 1'signed, 1'function, 2'bitwidth}
For the signed operation configuration bit, a 0
encodes unsigned while a 1
encodes signed operations. For the function configuration bit, a 0
encodes multiply only and a 1
encodes multiply-accumulate. For the bitwidth, a 00
encodes an operation bitwidth of 8 (single), a 01
encodes an operation bitwidth of 16 (dual), and 10
encodes an operation bitwidth of 32 (quad).
We also supply a testbench to verify the functionality of the MAC.
To build, run: make-clean && make
To run the testbench, run: ./simulator-mac_test_harness +cfg=<val> +num_tests=<val>
where cfg is the 3-bit function configuration and num_tests is the number of tests. The different configuration codes are as follows:
0 = 4 8x8 mults, multiply only (unsigned) 1 = 2 16x16 mults, multiply only (unsigned) 2 = 1 32x32 mult, multiply only (unsigned) 4 = 4 8x8 mults, MAC (unsigned) 5 = 2 16x16 mults, MAC (unsigned) 6 = 1 32x32 mult, MAC (unsigned) 8 = 4 8x8 mults, multiply only (signed) 9 = 2 16x16 mults, multiply only (signed) 10 = 1 32x32 mult, multiply only (signed) 12 = 4 8x8 mults, MAC (signed) 13 = 2 16x16 mults, MAC (signed) 14 = 1 32x32 mult, MAC (signed)
For those who are interested in the design of the MAC, the below diagram briefly details the different components. The MAC “block” is actually designed as a cluster, where the operations are done in a distributed fashion across multiple blocks. This design was chosen as it allows the blocks to be moved around and custom placed to avoid any wire congestion, as the MAC cluster has a large number of inputs and outputs.
Regarding the main components, the multiply, combiner, and accumulator blocks provide the MAC cluster's core functionality. Each multiply block preforms unsigned 8x8 multiply operations with up to 4 parallel multiplies at a time. For the single width (8-bit) configuration, each multiply block performs their own independent 8x8 multiply. For larger configurations such as the 16 and 32-bit inputs, the multiply blocks perform multiple operations in parallel to compute the cross-products to multiply larger bitwidths. The combiner block then takes those cross-products and assembles them into the correct result. Once the inputs have been properly multiplied, they are then passed into the accumulator which will then accumulate the values or forward the values depending on the configured function.
For the MAC to support signed operations, negator blocks are cleverly used to manipulate the sign of the multiplication inputs and outputs such that we can re-use the unsigned multipliers to avoid extra hardware. The first negator block will take the absolute value of all inputs to allow for unsigned multiplication, while the second negator will assign the correct sign after the multiplication has completed. This double negation layer strategy allows us to properly multiply signed inputs without having the need for signed multipliers.
For more exact block diagrams of each component, you can check out the diagrams directory.
I'm a resident of the looking-glass world!! (Every figure in this page is mirrored by y-axis)
The modules are in src directory. There is a testbench for each module in testbench directory.
The unidirectional version is in unidir_src directory. Its README is also in that directory.
I define a big-tile as a figure below. It has one column for MAC and MEM in the middle. There may be multiple MAC modules and MEM modules in one big-tile, then the MAC modules are placed upper than all of the MEM modules. On the edges of FPGA, there are io blocks, one for each SB on the edges, two for each SB at the corners. There are also data io blocks for DSBs.
fpga.v in fpgatop branch implements this, but it has problems and yet to be simulated. iverilog ran into internal errors (buffer overflaw) when I increased some parameters. verilator seems to have problems in 2D array. Anyway it might help you understand how these parameters should be treated.
The signals starting with “c” represent configuration bits.
I use this cell when one of the data is not inout. This enables verilator to compile the modules. The cell design of this module would be the same as transmission_gate.v.
universal_switch_box.v
clb_switch_box.v
One universal switch box for single lines and the following circuit for double lines.
This is used as a switch box where there is no horizontal line in layered interconnect.
The number of inputs of CLB is CLBIN, but only first CLBIN0(CLBIN1) bits are connected to the tracks. When CLBX, a boolean parameter, is 1, there are direct connections. The number of switches for each output is limited by a parameter, and the place of switches is shifted per output. The amount of the last shift is passed to the next connection block as a bias. Please notice the places of c31 and c32.
For data input/output for MAC/MEM.
For control (address) input for MAC/MEM.
Just connecting wires under modulus. Short circuit should not happen because the other end of wire can be 1'bz in switch boxes.
The config_tile module holds the configuration bits for every configurable option in a given tile (excluding SRAM bits). The config_tile
module is designed to be fed from either a hardwired shift chain that spans a column of tiles, or from the fabric, allowing for arbitrarily granular dynamic reconfiguration of tiles.
shift_enable
signal to freeze the bitstream in place in the chain.set
signal on each column which will load the bits into their respective configuration latchesshift_in
and set
inputs away from the hardwired column inputs, and to inputs that come from the regular device interconnect.The config_sram_data
module performs boot-time initilization of the SRAM Blocks with contents, and resides within every SRAM Block tile.
Each block has a shift register with an address segment followed by a data segment. Each segment feeds the appropriate signal of the write port on the SRAM Block. When the set
signal of the config_sram_data
module is asserted, the module asserts the write_enable
signal on its associated SRAM Block which writes the data contained in the shift register at the appropriate address. In order to configure a column, every config_sram_data
module in a column is linked end to end to create one long shift register. The RISC-V pump in address-data pairs and assert the set
signal to load data into the every SRAM block. The config team is building a set of modules to enable the initilization of the device, and to enable dynamic reconfiguration of the device. Behavioral verilog modules are stored in src/verilog
with testbenches for each in sim/
. (Hopefully) Up to date diagrams of each module are stored in the docs/
folder, in both svg and xcircuit Postscript files.
The config_tile module holds the configuration bits for every configurable option in a given tile (excluding SRAM bits). The config_tile
module is designed to be fed from either a hardwired shift chain that spans a column of tiles, or from the fabric, allowing for arbitrarily granular dynamic reconfiguration of tiles.
shift_enable
signal to freeze the bitstream in place in the chain.set
signal on each column which will load the bits into their respective configuration latchesshift_in
and set
inputs away from the hardwired column inputs, and to inputs that come from the regular device interconnect.The config_sram_data
module performs boot-time initilization of the SRAM Blocks with contents, and resides within every SRAM Block tile.
Each block has a shift register with an address segment followed by a data segment. Each segment feeds the appropriate signal of the write port on the SRAM Block. When the set
signal of the config_sram_data
module is asserted, the module asserts the write_enable
signal on its associated SRAM Block which writes the data contained in the shift register at the appropriate address. In order to configure a column, every config_sram_data
module in a column is linked end to end to create one long shift register. The RISC-V pump in address-data pairs and assert the set
signal to load data into the every SRAM block.