In late 1996 I developed a recursive hardware multiplier, and presented it at the Synopsys User Group conference in 1998. I recently ran across Karatsuba algorithm for fast multiplication, where its recursive application reminded me of my recursive multiplier. I was mainly after increasing performance for fairly small multipliers, ease of pipelining, and not in reducing the number of multipliers as Karatsuba was. Here is the recreated VHDL implementation of my recursive multiplier:

-- Author: Victor J. Duvanenko 1/5/1997 library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.STD_LOGIC_ARITH.all; entity mult_splt1 is port ( a: in unsigned( 7 downto 0 ); b: in unsigned( 7 downto 0 ); z: out unsigned( 15 downto 0 )); end mult_splt1; architecture behavior of mult_splt1 is -- Multiplier width at which recursion should terminate constant BottomLevelWidth: integer := 4; -- Split NxN multiplication into four N/2xN/2 multiplications and three additions -- Assumes that N is even. function SplitMult( a, b: unsigned ) return unsigned is variable A0B0: unsigned( a'length - 1 downto 0 ); -- N/2xN/2 => N-bit result variable A1B0: unsigned( a'length - 1 downto 0 ); -- N/2xN/2 => N-bit result variable A0B1: unsigned( a'length - 1 downto 0 ); -- N/2xN/2 => N-bit result variable A1B1: unsigned( a'length - 1 downto 0 ); -- N/2xN/2 => N-bit result variable Tmp0: unsigned( a'length downto 0 ); -- N/2xN/2 + N/2xN/2 => (N+1)-bits variable Tmp1: unsigned( a'length + a'length - 1 downto 0 ); variable R: unsigned( a'length + a'length - 1 downto 0 ); -- Number of bits in ( N/2 + 1 ) value - e.g. 8x8 => 5 => 3 bits to represent 5 constant NumBitsInN_2Plus1: integer := ( a'length / 2 ) + 1; begin if ( a'length <= BottomLevelWidth ) then return( a * b ); else A0B0 := SplitMult( conv_unsinged( a( a'length/2 - 1 downto 0 ), a'length/2 ), conv_unsigned( b( a'length/2 - 1 downto 0 ), a'length/2 ); A1B0 := SplitMult( conv_unsinged( a( a'length - 1 downto a'length/2 ), a'length/2 ), conv_unsigned( b( a'length/2 - 1 downto 0 ), a'length/2 ); A0B1 := SplitMult( conv_unsinged( a( a'length/2 - 1 downto 0 ), a'length/2 ), conv_unsigned( b( a'length - 1 downto a'length/2 ), a'length/2 ); A1B1 := SplitMult( conv_unsinged( a( a'length - 1 downto a'length/2 ), a'length/2 ), conv_unsigned( b( a'length - 1 downto a'length/2 ), a'length/2 ); Tmp0 := conv_unsigned( A1B0, Tmp0'length) + conv_unsigned( A0B1, Tmp0'length); Tmp1 := A1B1 & A0B0; R := SHL( conv_unsigned( Tmp0, R'length ), conv_unsigned( a'length/2, NumBitsInN_2Plus1 )) + Tmp1; return R; end if; end; begin process( a, b ) begin z <= SplitMult( a, b ); end process; end;

The following is the paper from the conference.

## Abstract

We demonstrate performance improvements of multiplier. We code a modified multiplier structure that is nearly as fast as Booth coded Wallace tree implementation. We also demonstrate the power and simplicity of synthesizable VHDL recursion coding style.

## Faster Multiplication

In the Basic DesignWare package, Synopsys provides a classic implementation of multiplication by a 2-D array of full adders. In the Advanced Math DesignWare package a Booth-coded Wallace tree multiplier implementation is provided. The Booth-Wallace implementation provides a significant performance increase, at the cost of a higher gate count, of course.

In some instances, designers may want a multiplier that is faster than the 2-D array multiplier implementation, but that is not as costly in gates as the Booth-Wallace implementation. Several options are available to designers at this point:

- Pipeline the multiplier, if your application can tolerate the increase in latency due to pipelining. This can be done manually, or by using “balance_register” command, or by instantiating pipelined multiplier in one of the DesignWare libraries.
- Use the old trick of splitting a multiply into 4 quarter-size multiplies and 3 additions [1], which is the method that we’ll explain.

For simplicity of explanation, let’s start with an NxN multiplier where N is of even number of bits (e.g. 8×8). The multiplicand A and multiplier B (of the A*B multiplication operation) are each split into two parts:

A = A0 + A1 * 2^W

B = B0 + B1 * 2^W

Where W=N/2 (e.g. 4). Thus, for our example, A0 represents the lower 4 bits of the multiplicand, and (A1 * 2^W) represents the upper 4 bits. Then,

A * B = A0 * B0 + (A0 * B1 + A1 * B0) * 2^W + A1 * B1 * 2^2W

So, instead of doing an NxN multiplication, we’ve done four N/2 by N/2 multiplications followed by 3 additions to combine them into a single result. Of course, the 3 additions can be done in a binary tree fashion as

A * B = (A0 * B1 + A1 * B0) * 2^W + (A0 * B0 + A1 * B1 * 2^2W)

Actually, this is only one of several possibilities, but is an interesting one, because the right-most addition does not need to be performed at all, since there is no overlap between the addends. This addition is replaced by concatenation in the VHDL implementation. One also realizes that the middle addition does not need to be full length, since the first addend is shifted left 2^W bits. In other words, the additions are not as costly as they may seem at first. Additions can be arranged as a binary tree.

## Results in 350 nm Technology

The results compare 2-D array, Booth-Wallace, and 4-way split multiplier implementations:

- 2-D array of full-adders grows linearly in speed O(N) and as square in area O(N²). For example, a 16×16 multiplier is nearly twice as slow as an 8×8 and the area is four times larger. 2-D array always uses the least area.
- 4-way split multiplier is always faster than the 2-D array of full adders, even down to 4×4. For the 2×2 result, the 4-way split is equivalent to 2-D array to within “synthesis noise floor” – i.e. +/- 1 gate delay.
- For small multiplier sizes (less than or equal to 6×6) the split multiplier is the fastest.
- Booth-Wallace multiplier is fastest starting near 7×7 multiplication, with nearly O(logN) speed growth and nearly O(N²) area growth.
- 4-way split mutiplier is smaller than Booth-Wallace up to nearly 16×16. Thus, up to 16×16 the split multiplier offers space savings over Booth-Wallace, as well as speed saving over 2-D array implementations. This is the sweet spot for the split multiplier.
- Booth-Wallace beats the split implementation starting at 16×16, in speed and in area. Thus, there is no technical advantage offered by the split multiplier past that point. This break-even point will vary with technologies.

## Using Recursion

One may wonder if the split multiplier technique can be applied recursively to obtain further speed improvements and possibly O(logN) implementation. For example, we could split a 16×16 multiplication into four 8×8 multiplications, and then split each 8×8 into four 4×4 (for a grand total of 16 4×4 multiplications). A recursive implementation is shown above.

Notice that the SplitMult() calls itself four times and then combines the results. It knows to stop splitting when the arguments have been reduced to or below BottomLevelWidth constant that the user can set. For example, setting BottomLevelWidth := 4 will split a 16×16 multiplier into 16 4×4 multipliers.

Applying recursion to the Split multiplier did not produce encouraging results. For example, 16×16 multiplier split down to 4×4 ran only slightly faster and used more gates than a split down to 8×8. Splitting down to 2×2 yielded faster speed, but used more area, but was not as good as Booth-Wallace. Synthesis times increased substantially.

The split multiplier lends itself very well to manual or automated pipelining. One could put all of the leaf cells within one clock cycle, while placing the additions in another clock cycle. Pipelining deeper than a couple of clock cycles may loose its effectiveness, however.

When using the split multiplier, designers are advised to use “compare_design” command to compare this implementation to the standard multiplier implementation (and make sure that you compare A to B and then B to A as well). This is a great and quick way to ensure that two circuits are equivalent. We demonstrated only NxN multiplier where N is even. It’s possible and has been done successfully, however, to support NxM as well without restricting either N or M to be even. We also demonstrated ease, simplicity and power of using synthesizable VHDL recursion.

## References

- K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984, pp. 239-241.