Reference Hardware Implementations of Bit Extract/Deposit Instructions

The BEXT (bit extract, aka parallel extract, PEXT) instruction takes as arguments a value and a bit mask. The bits in the mask indicate which bits in the value should be extracted. All extracted bits are then packed into the LSB end of the result:

uint32_t bext32(uint32_t v, uint32_t mask)
{
    uint32_t c = 0, m = 1;
    while (mask) {
        uint32_t b = mask & -mask;
        if (v & b)
            c |= m;
        mask -= b;
        m <<= 1;
    }
    return c;
}

The BDEP (bit deposit, aka parallel deposit, PDEP) instruction performs the inverse operation. It takes the LSB bits of the value and places them in the positions marked in the mask argument:

uint32_t bdep32(uint32_t v, uint32_t mask)
{
    uint32_t c = 0, m = 1;
    while (mask) {
        uint32_t b = mask & -mask;
        if (v & m)
            c |= b;
        mask -= b;
        m <<= 1;
    }
    return c;
}

Simple 16-bit example:

value 0011_0110_0000_0101
mask  0011_1100_0000_0000
bext  0000_0000_0000_1101
bdep  0001_0100_0000_0000

This instructions may be valuable features in bit-manipulation ISA extensions. However, they are not trivial to implement in an efficient manner. This repository contains permissively licensed Verilog reference implementations for 32-bit and 64-bit implementations of those instructions using different architectures:

  • Pipelined prallel extract/deposit
  • Single-stage parallel extract/deposit
  • Sequential in chunks with fixed cycle count
  • Sequential with one cycle per set mask bit

Additionally, some of the cores do also implement the generalised bit reversal (GREV) instruction, since the GREV instruction can in some cases share resources with BEXT/BDEP. GREV can be used to reverse the bit order in a word, or swap the two halfs of a word, or any in-between operation (like endianess convertion), or combinations of them:

uint32_t grev32(uint32_t x, int k)
{
    if (k &  1) x = ((x & 0x55555555) <<  1) | ((x & 0xAAAAAAAA) >>  1);
    if (k &  2) x = ((x & 0x33333333) <<  2) | ((x & 0xCCCCCCCC) >>  2);
    if (k &  4) x = ((x & 0x0F0F0F0F) <<  4) | ((x & 0xF0F0F0F0) >>  4);
    if (k &  8) x = ((x & 0x00FF00FF) <<  8) | ((x & 0xFF00FF00) >>  8);
    if (k & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}

The Verilog code for the cores can be found in bextdep.v. The Verilog file bextdep_pps.v is also needed. It is generated by ppsmaker.py (make bextdep_pps.v) and contains prefix-sum cores that are used by most of the cores in bextdep.v.

Evaluation

This are the 64 bit Verilog cores:

NameGatesRelRel+GDelayDescription
bextdep64g396722.002.00223-stage pipeline with GREV
bextdep64p388461.832.17223-stage pipeline
bextdep64g280751.671.67402-stage pipeline with GREV
bextdep64p271401.481.82402-stage pipeline
bextdep64g164061.331.3349single-stage with GREV support
bextdep64p155151.141.4851single-stage implementation
bextdep64sh43240.901.24334-cycles sequential implementation
bextdep64sb35890.741.08248-cycles sequential implementation
bextdep64sn31920.661.001716-cycles sequential implementation
bextdep64sx31230.650.9971up-to-64-cycles sequential implementation
bextdep64go16410.340.3415single-stage GREV-only core
vscalealu6448251.001.34131ALU from V-Scale CPU (for comparison)

And this are the 32 bit Verilog cores:

NameGatesRelRel+GDelayDescription
bextdep32g342141.911.91163-stage pipeline with GREV
bextdep32p338591.752.08163-stage pipeline
bextdep32g234511.571.57282-stage pipeline with GREV
bextdep32p230621.391.72272-stage pipeline
bextdep32g126161.191.1938single-stage with GREV support
bextdep32p121981.001.3336single-stage implementation
bextdep32sb19760.901.22244-cycles sequential implementation
bextdep32sn16700.761.09178-cycles sequential implementation
bextdep32sx15430.701.0339up-to-32-cycles sequential implementation
bextdep32go7210.330.3313single-stage GREV-only core
vscalealu3222031.001.3368ALU from V-Scale CPU (for comparison)

The gate counts in the tables above are for the cores mapped to NAND, NOR, and NOT gates, and DFFs, with NOT counting as 0.5 gates and DFFs counting as 4 gates. (See Makefile for synthesis scripts.)

The “Rel” column lists the size relative to the the ALU of a V-Scale RISC-V processor. (The 64-bit version is just the 32 bit ALU with extended bit-width, it does not support the RV64 *W opcodes. See vscalealu.v.) The “Rel+G” column lists the relative size when the size of a GREV core is added to the cores that don't implement GREV.

The “Delay” column lists the number of date delays in the longest path from a primary input or FF output to a primary output or FF input.

The sequential cores require an additional cycle for initialization. So the initiation interval (II) of the bextdep32sb core is 5 and the II of the bextdep64sx core is up to 65 (when all mask bits in the input are set).

Here are the sizes and timings for the cores when mapped to a Xilinx UltraScale Kintex device (speed grade -3) with Vivado 2017.3. This evaluation also includes the Rocket “TinyConfig” core for size comparison:

NameLUTsRelRel+GMax Freq.
bextdep64g37740.800.80451 MHz
bextdep64p36750.700.90462 MHz
bextdep64g27620.790.79430 MHz
bextdep64p26780.700.91406 MHz
bextdep64g18540.890.89277 MHz
bextdep64p17330.760.96285 MHz
bextdep64sh5530.570.78351 MHz
bextdep64sb4370.450.66543 MHz
bextdep64sn3140.330.53668 MHz
bextdep64sx3090.320.52421 MHz
bextdep64go1950.200.201189 MHz
vscalealu649641.001.00713 MHz
NameLUTsRelRel+GMax Freq.
bextdep32g33220.750.75521 MHz
bextdep32p32790.650.84537 MHz
bextdep32g23180.740.74497 MHz
bextdep32p22760.640.83497 MHz
bextdep32g13700.860.86350 MHz
bextdep32p12830.660.85352 MHz
bextdep32sb2630.610.80545 MHz
bextdep32sn1680.390.58686 MHz
bextdep32sx1580.370.56427 MHz
bextdep32go830.190.191116 MHz
vscalealu324311.001.00819 MHz
tinyrocket42809.93-----211 MHz

Interfaces

All cores implement the same interface (XLEN=32 or XLEN=64):

input clock, reset;

input             din_valid;
output            din_ready;
input  [     1:0] din_mode;
input  [XLEN-1:0] din_value;
input  [XLEN-1:0] din_mask;

output            dout_valid;
input             dout_ready;
output [XLEN-1:0] dout_result;

The din* signals are a valid/ready-style input stream. Data is transferred in to the core in cycles where valid and ready are high. The core will raise ready when it is ready to receive data without waiting for an external valid signal, and ready is guaranteed to stay high until a transfer happens (or the core is reset).

The dout* signals are a valid/ready-style output stream. Data is transferred out of the core in cycles where valid and ready are high. The core will raise valid when it is ready to transmit data without waiting for an external ready signal, and valid is guaranteed to stay high until a transfer happens (or the core is reset).

The assigned values for din_mode are (GREV mode is only valid for cores with GREV support):

ModeDescription
2'b00BEXT
2'b01BDEP
2'b10GREV
2'b11reserved

All cores have FFs near the dout_result output, but may borrow from the timing path of the incomming din_mode, din_value, and din_mask signals.

Why do you call those instructions BEXT/BDEP and not PEXT/PDEP like everyone else?

Because the P in PEXT/PDEP stands for parallel and thus implies a certain implementation. That's just bad ISA naming. Imo an instruction name should describe what the instruction is doing, but leave it to the implementation to decide how to do it.

Limitations, Todos and Future Work

  • The 3-stage pipelined cores don't have very well balanced pipeline stages in Vivado synthesis. This might be worth looking into.

References