Reference Hardware Implementations of Bit Extract/Deposit Instructions

The BEXT (bit extract, aka parallel extract, PEXT) instruction takes as arguments a value and a bit mask. The bits in the mask indicate which bits in the value should be extracted. All extracted bits are then packed into the LSB end of the result:

uint32_t bext32(uint32_t v, uint32_t mask)
{
    uint32_t c = 0, m = 1;
    while (mask) {
        uint32_t b = mask & -mask;
        if (v & b)
            c |= m;
        mask -= b;
        m <<= 1;
    }
    return c;
}

The BDEP (bit deposit, aka parallel deposit, PDEP) instruction performs the inverse operation. It takes the LSB bits of the value and places them in the positions marked in the mask argument:

uint32_t bdep32(uint32_t v, uint32_t mask)
{
    uint32_t c = 0, m = 1;
    while (mask) {
        uint32_t b = mask & -mask;
        if (v & m)
            c |= b;
        mask -= b;
        m <<= 1;
    }
    return c;
}

Simple 16-bit example:

value 0011_0110_0000_0101
mask  0011_1100_0000_0000
bext  0000_0000_0000_1101
bdep  0001_0100_0000_0000

This instructions may be valuable features in bit-manipulation ISA extensions. However, they are not trivial to implement in an efficient manner. This repository contains permissively licensed Verilog reference implementations for 32-bit and 64-bit implementations of those instructions using different architectures:

Pipelined prallel extract/deposit
Single-stage parallel extract/deposit
Sequential in chunks with fixed cycle count
Sequential with one cycle per set mask bit

Additionally, some of the cores do also implement the generalised bit reversal (GREV) instruction, since the GREV instruction can in some cases share resources with BEXT/BDEP. GREV can be used to reverse the bit order in a word, or swap the two halfs of a word, or any in-between operation (like endianess convertion), or combinations of them:

uint32_t grev32(uint32_t x, int k)
{
    if (k &  1) x = ((x & 0x55555555) <<  1) | ((x & 0xAAAAAAAA) >>  1);
    if (k &  2) x = ((x & 0x33333333) <<  2) | ((x & 0xCCCCCCCC) >>  2);
    if (k &  4) x = ((x & 0x0F0F0F0F) <<  4) | ((x & 0xF0F0F0F0) >>  4);
    if (k &  8) x = ((x & 0x00FF00FF) <<  8) | ((x & 0xFF00FF00) >>  8);
    if (k & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
    return x;
}

The Verilog code for the cores can be found in bextdep.v. The Verilog file bextdep_pps.v is also needed. It is generated by ppsmaker.py (make bextdep_pps.v) and contains prefix-sum cores that are used by most of the cores in bextdep.v.

Evaluation

This are the 64 bit Verilog cores:

Name	Gates	Rel	Rel+G	Delay	Description
`bextdep64g3`	9672	2.00	2.00	22	3-stage pipeline with GREV
`bextdep64p3`	8846	1.83	2.17	22	3-stage pipeline
`bextdep64g2`	8075	1.67	1.67	40	2-stage pipeline with GREV
`bextdep64p2`	7140	1.48	1.82	40	2-stage pipeline
`bextdep64g1`	6406	1.33	1.33	49	single-stage with GREV support
`bextdep64p1`	5515	1.14	1.48	51	single-stage implementation
`bextdep64sh`	4324	0.90	1.24	33	4-cycles sequential implementation
`bextdep64sb`	3589	0.74	1.08	24	8-cycles sequential implementation
`bextdep64sn`	3192	0.66	1.00	17	16-cycles sequential implementation
`bextdep64sx`	3123	0.65	0.99	71	up-to-64-cycles sequential implementation
`bextdep64go`	1641	0.34	0.34	15	single-stage GREV-only core
`vscalealu64`	4825	1.00	1.34	131	ALU from V-Scale CPU (for comparison)

And this are the 32 bit Verilog cores:

Name	Gates	Rel	Rel+G	Delay	Description
`bextdep32g3`	4214	1.91	1.91	16	3-stage pipeline with GREV
`bextdep32p3`	3859	1.75	2.08	16	3-stage pipeline
`bextdep32g2`	3451	1.57	1.57	28	2-stage pipeline with GREV
`bextdep32p2`	3062	1.39	1.72	27	2-stage pipeline
`bextdep32g1`	2616	1.19	1.19	38	single-stage with GREV support
`bextdep32p1`	2198	1.00	1.33	36	single-stage implementation
`bextdep32sb`	1976	0.90	1.22	24	4-cycles sequential implementation
`bextdep32sn`	1670	0.76	1.09	17	8-cycles sequential implementation
`bextdep32sx`	1543	0.70	1.03	39	up-to-32-cycles sequential implementation
`bextdep32go`	721	0.33	0.33	13	single-stage GREV-only core
`vscalealu32`	2203	1.00	1.33	68	ALU from V-Scale CPU (for comparison)

The gate counts in the tables above are for the cores mapped to NAND, NOR, and NOT gates, and DFFs, with NOT counting as 0.5 gates and DFFs counting as 4 gates. (See Makefile for synthesis scripts.)

The “Rel” column lists the size relative to the the ALU of a V-Scale RISC-V processor. (The 64-bit version is just the 32 bit ALU with extended bit-width, it does not support the RV64 *W opcodes. See vscalealu.v.) The “Rel+G” column lists the relative size when the size of a GREV core is added to the cores that don't implement GREV.

The “Delay” column lists the number of date delays in the longest path from a primary input or FF output to a primary output or FF input.

The sequential cores require an additional cycle for initialization. So the initiation interval (II) of the bextdep32sb core is 5 and the II of the bextdep64sx core is up to 65 (when all mask bits in the input are set).

Here are the sizes and timings for the cores when mapped to a Xilinx UltraScale Kintex device (speed grade -3) with Vivado 2017.3. This evaluation also includes the Rocket “TinyConfig” core for size comparison:

Name	LUTs	Rel	Rel+G	Max Freq.
`bextdep64g3`	774	0.80	0.80	451 MHz
`bextdep64p3`	675	0.70	0.90	462 MHz
`bextdep64g2`	762	0.79	0.79	430 MHz
`bextdep64p2`	678	0.70	0.91	406 MHz
`bextdep64g1`	854	0.89	0.89	277 MHz
`bextdep64p1`	733	0.76	0.96	285 MHz
`bextdep64sh`	553	0.57	0.78	351 MHz
`bextdep64sb`	437	0.45	0.66	543 MHz
`bextdep64sn`	314	0.33	0.53	668 MHz
`bextdep64sx`	309	0.32	0.52	421 MHz
`bextdep64go`	195	0.20	0.20	1189 MHz
`vscalealu64`	964	1.00	1.00	713 MHz

Name	LUTs	Rel	Rel+G	Max Freq.
`bextdep32g3`	322	0.75	0.75	521 MHz
`bextdep32p3`	279	0.65	0.84	537 MHz
`bextdep32g2`	318	0.74	0.74	497 MHz
`bextdep32p2`	276	0.64	0.83	497 MHz
`bextdep32g1`	370	0.86	0.86	350 MHz
`bextdep32p1`	283	0.66	0.85	352 MHz
`bextdep32sb`	263	0.61	0.80	545 MHz
`bextdep32sn`	168	0.39	0.58	686 MHz
`bextdep32sx`	158	0.37	0.56	427 MHz
`bextdep32go`	83	0.19	0.19	1116 MHz
`vscalealu32`	431	1.00	1.00	819 MHz
`tinyrocket`	4280	9.93	-----	211 MHz

Interfaces

All cores implement the same interface (XLEN=32 or XLEN=64):

input clock, reset;

input             din_valid;
output            din_ready;
input  [     1:0] din_mode;
input  [XLEN-1:0] din_value;
input  [XLEN-1:0] din_mask;

output            dout_valid;
input             dout_ready;
output [XLEN-1:0] dout_result;

The din* signals are a valid/ready-style input stream. Data is transferred in to the core in cycles where valid and ready are high. The core will raise ready when it is ready to receive data without waiting for an external valid signal, and ready is guaranteed to stay high until a transfer happens (or the core is reset).

The dout* signals are a valid/ready-style output stream. Data is transferred out of the core in cycles where valid and ready are high. The core will raise valid when it is ready to transmit data without waiting for an external ready signal, and valid is guaranteed to stay high until a transfer happens (or the core is reset).

The assigned values for din_mode are (GREV mode is only valid for cores with GREV support):

Mode	Description
2'b00	BEXT
2'b01	BDEP
2'b10	GREV
2'b11	reserved

All cores have FFs near the dout_result output, but may borrow from the timing path of the incomming din_mode, din_value, and din_mask signals.

Why do you call those instructions BEXT/BDEP and not PEXT/PDEP like everyone else?

Because the P in PEXT/PDEP stands for parallel and thus implies a certain implementation. That's just bad ISA naming. Imo an instruction name should describe what the instruction is doing, but leave it to the implementation to decide how to do it.

Limitations, Todos and Future Work

The 3-stage pipelined cores don't have very well balanced pipeline stages in Vivado synthesis. This might be worth looking into.

Reference Hardware Implementations of Bit Extract/Deposit Instructions

Evaluation

Interfaces

Why do you call those instructions BEXT/BDEP and not PEXT/PDEP like everyone else?

Limitations, Todos and Future Work

References